NIFI-15918 Stabilize flaky system tests around processor lifecycle by markap14 · Pull Request #11223 · apache/nifi

markap14 · 2026-05-08T20:21:12Z

Framework:

LifecycleState.getFutures returns a defensive copy.
StandardProcessorNode stop runnable releases its future and decrements active threads if an exception escapes.
StandardProcessorNode logs INFO breadcrumbs on silent start-reject and start-abort paths and surfaces validation errors on first retry.
KubernetesConfigMapStateProvider.shutdown is null-safe.

System tests:

waitForProcessorState throws IOException on timeout.
runProcessorOnce waits for validation and fails fast on INVALID.
SpawnedStandaloneNiFiInstanceFactory retries process spawn for one minute, polling Management Server health, to absorb TIME_WAIT.
ContentClaimTruncationIT pins the shared resource claim and identifies the replayed FlowFile by UUID.

Workspace rules covering single-element-array lambdas and the system-test troubleshooting-logs artifact.

Summary

NIFI-00000

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Apache NiFi Jira issue created

Pull Request Tracking

Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000
Pull request contains commits signed with a registered key indicating Verified status

Pull Request Formatting

Pull Request based on current revision of the main branch
Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

Build completed using ./mvnw clean install -P contrib-check
- JDK 21
- JDK 25

Licensing

New dependencies are compatible with the Apache License 2.0 according to the License Policy
New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

Documentation formatting appears as expected in rendered files

exceptionfactory

Thanks for working on improving test stability @markap14.

On initial review, it seems like no changes to the KubernetesConfigMapStateProvider should be included, because it is not used for any system tests. Was that resulting in unit test failures somewhere?

markap14 · 2026-05-08T21:00:04Z

[claude-opus-4.7] Thanks for the review @exceptionfactory. The Kubernetes change addresses a unit-test flake, not a system test. KubernetesConfigMapStateProviderTest's @AfterEach calls provider.shutdown(), and when a @BeforeEach throws (or initialization is skipped) shutdown() NPEs because kubernetesClient is null. The NPE then masks the real failure cause. Observed on PR #11211 in actions run 25513810851:

```
java.lang.NullPointerException: Cannot invoke "io.fabric8.kubernetes.client.KubernetesClient.close()"
because "this.kubernetesClient" is null
at KubernetesConfigMapStateProvider.shutdown(KubernetesConfigMapStateProvider.java:144)
at KubernetesConfigMapStateProviderTest.shutdownProvider(KubernetesConfigMapStateProviderTest.java:113)
```

The fix is just a null check in shutdown() plus a unit test verifying it.

Happy to split it out into a separate PR if you'd prefer the lifecycle PR stays narrowly scoped — let me know.

exceptionfactory · 2026-05-08T21:03:45Z

[claude-opus-4.7] Thanks for the review @exceptionfactory. The Kubernetes change addresses a unit-test flake, not a system test. KubernetesConfigMapStateProviderTest's @AfterEach calls provider.shutdown(), and when a @BeforeEach throws (or initialization is skipped) shutdown() NPEs because kubernetesClient is null. The NPE then masks the real failure cause. Observed on PR #11211 in actions run 25513810851:

java.lang.NullPointerException: Cannot invoke "io.fabric8.kubernetes.client.KubernetesClient.close()" because "this.kubernetesClient" is null at KubernetesConfigMapStateProvider.shutdown(KubernetesConfigMapStateProvider.java:144) at KubernetesConfigMapStateProviderTest.shutdownProvider(KubernetesConfigMapStateProviderTest.java:113)

The fix is just a null check in shutdown() plus a unit test verifying it.

Happy to split it out into a separate PR if you'd prefer the lifecycle PR stays narrowly scoped — let me know.

Thanks for the background reference, keeping it in this PR sounds good.

- Defensive copy in LifecycleState.getFutures() to prevent ConcurrentModificationException. - Ensure StandardProcessorNode.stop() decrements active threads and completes the stop action even if the stop task throws. - Promote silent processor start abort paths to INFO logs to aid diagnosis. - Null-safe shutdown in KubernetesConfigMapStateProvider when initialization was skipped. - SpawnedStandaloneNiFiInstanceFactory: retry NiFi startup until the Management Server is reachable (handles HttpServer port reuse race), but only when start() is asked to wait for completion. start(false) now spawns the process and returns immediately. - NiFiClientUtil.runProcessorOnce waits for validation completion and fails fast on INVALID; system tests updated to use it where appropriate. - ContentClaimTruncationIT: keep one small FlowFile queued so its resource claim is pinned during replay, preventing archive-cleanup-induced truncation flakes.

exceptionfactory

Taking a closer look at the changes to SpawnedStandaloneNiFiInstanceFactory, I noticed the comments mention BindException in some cases. It would be helpful to dig a bit deeper on this front, perhaps ensuring that the StandardManagementServer stops in a more graceful way under some conditions, to avoid keeping the socket in TIME_WAIT status.

…ne instance start Reverted the start-time retry loop in SpawnedStandaloneNiFiInstanceFactory.ProcessNiFiInstance and the associated helpers (startUntilManagementServerReady, isManagementServerReady) and retry constants. Investigation of the original CI failure showed shutdowns were consistently clean and the BindException on the Management Server port appeared once across many captured failed runs, consistent with a transient ephemeral-port collision on the GitHub Actions runner rather than a NiFi shutdown race. The single-shot start path is restored.

markap14 · 2026-05-14T14:14:19Z

[claude-opus-4.7] @exceptionfactory Thanks for pushing back on the retry-based approach. I dug into the actual failure that motivated it (Ubuntu/JDK 25 run 25457358299, ParameterContextIT.testParametersReferencingEL) using the captured troubleshooting archive, and the data does not really support the "TIME_WAIT / shutdown grace" hypothesis:

In the failing instance's nifi-app.log, Jetty bound 127.0.0.1:5670 successfully ~9 ms before the management server failed on 127.0.0.1:56730. Same JVM, same IP. A lingering prior NiFi or socket pair from this test would have blocked both ports.
In the same run's failsafe output, every prior Shutdown Requested → Shutdown Completed for the standalone instance completed in roughly 400–450 ms, well under the 15 s process.waitFor timeout. None of them fell into the destroyForcibly() path.
ParameterSensitivityWithGhostedComponentIT (the test immediately before the cluster gap) cycled the same port 56730 four times in 36 seconds with ~200 ms gaps between stop and the next start, and all four succeeded. The failure was the next start, after a two-minute idle window during the cluster test (which only touches 56710/56720). If the cause were TIME_WAIT from our own process churn, we would expect failures on the rapid restarts, not after a long idle.
Across seven captured failed system-tests runs I had access to, the BindException appears in only that one run. The other six failures are entirely unrelated (cluster join issues, content truncation, etc.).

The most plausible cause is that the Linux kernel's ephemeral port allocator (default range 32768–60999 on Ubuntu 24.04, which includes 56730) briefly handed 56730 to some other outbound TCP socket on the runner during the 11-second startup window for the new NiFi JVM. SO_REUSEADDR would not help in that case — it only matters for TIME_WAIT 4-tuples, not an active LISTEN by something else.

Given that this is a rare environmental transient rather than a NiFi shutdown bug, I have reverted the retry loop in SpawnedStandaloneNiFiInstanceFactory in commit 785e1de8e3d. start() is back to the single-shot form it had before NIFI-15918. The rest of the PR (LifecycleState defensive copy, StandardProcessorNode cleanup, KubernetesConfigMapStateProvider null-safety, system-test helper changes) is unchanged.

If we see this pattern recur, the cleaner real fix is probably to bind the management server to port 0 inside the child JVM and have it publish the kernel-assigned port to a file the launcher reads — which sidesteps the launcher-vs-child socket-option mismatch that broke the previous port-scan attempt. Happy to take that on as a separate Jira if it shows up again.

exceptionfactory

Thanks for reverting the restart loop changes, that sounds like the best approach under the circumstances. The rest of the changes look good. +1 merging

exceptionfactory requested changes May 8, 2026

View reviewed changes

markap14 force-pushed the NIFI-15918 branch from e66db48 to a88bcec Compare May 8, 2026 21:26

markap14 force-pushed the NIFI-15918 branch from a88bcec to 4c66797 Compare May 8, 2026 21:32

exceptionfactory reviewed May 13, 2026

View reviewed changes

exceptionfactory approved these changes May 14, 2026

View reviewed changes

exceptionfactory merged commit 0887d8e into apache:main May 14, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIFI-15918 Stabilize flaky system tests around processor lifecycle#11223

NIFI-15918 Stabilize flaky system tests around processor lifecycle#11223
exceptionfactory merged 2 commits into
apache:mainfrom
markap14:NIFI-15918

markap14 commented May 8, 2026

Uh oh!

exceptionfactory left a comment

Uh oh!

markap14 commented May 8, 2026

Uh oh!

exceptionfactory commented May 8, 2026

Uh oh!

exceptionfactory left a comment

Uh oh!

markap14 commented May 14, 2026

Uh oh!

exceptionfactory left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

markap14 commented May 8, 2026

Summary

Tracking

Issue Tracking

Pull Request Tracking

Pull Request Formatting

Verification

Build

Licensing

Documentation

Uh oh!

exceptionfactory left a comment

Choose a reason for hiding this comment

Uh oh!

markap14 commented May 8, 2026

Uh oh!

exceptionfactory commented May 8, 2026

Uh oh!

exceptionfactory left a comment

Choose a reason for hiding this comment

Uh oh!

markap14 commented May 14, 2026

Uh oh!

exceptionfactory left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants