Skip to content

NIFI-15918 Stabilize flaky system tests around processor lifecycle#11223

Merged
exceptionfactory merged 2 commits into
apache:mainfrom
markap14:NIFI-15918
May 14, 2026
Merged

NIFI-15918 Stabilize flaky system tests around processor lifecycle#11223
exceptionfactory merged 2 commits into
apache:mainfrom
markap14:NIFI-15918

Conversation

@markap14
Copy link
Copy Markdown
Contributor

@markap14 markap14 commented May 8, 2026

Framework:

  • LifecycleState.getFutures returns a defensive copy.
  • StandardProcessorNode stop runnable releases its future and decrements active threads if an exception escapes.
  • StandardProcessorNode logs INFO breadcrumbs on silent start-reject and start-abort paths and surfaces validation errors on first retry.
  • KubernetesConfigMapStateProvider.shutdown is null-safe.

System tests:

  • waitForProcessorState throws IOException on timeout.
  • runProcessorOnce waits for validation and fails fast on INVALID.
  • SpawnedStandaloneNiFiInstanceFactory retries process spawn for one minute, polling Management Server health, to absorb TIME_WAIT.
  • ContentClaimTruncationIT pins the shared resource claim and identifies the replayed FlowFile by UUID.

Workspace rules covering single-element-array lambdas and the system-test troubleshooting-logs artifact.

Summary

NIFI-00000

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000
  • Pull request contains commits signed with a registered key indicating Verified status

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using ./mvnw clean install -P contrib-check
    • JDK 21
    • JDK 25

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

Copy link
Copy Markdown
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on improving test stability @markap14.

On initial review, it seems like no changes to the KubernetesConfigMapStateProvider should be included, because it is not used for any system tests. Was that resulting in unit test failures somewhere?

@markap14
Copy link
Copy Markdown
Contributor Author

markap14 commented May 8, 2026

[claude-opus-4.7] Thanks for the review @exceptionfactory. The Kubernetes change addresses a unit-test flake, not a system test. KubernetesConfigMapStateProviderTest's @AfterEach calls provider.shutdown(), and when a @BeforeEach throws (or initialization is skipped) shutdown() NPEs because kubernetesClient is null. The NPE then masks the real failure cause. Observed on PR #11211 in actions run 25513810851:

```
java.lang.NullPointerException: Cannot invoke "io.fabric8.kubernetes.client.KubernetesClient.close()"
because "this.kubernetesClient" is null
at KubernetesConfigMapStateProvider.shutdown(KubernetesConfigMapStateProvider.java:144)
at KubernetesConfigMapStateProviderTest.shutdownProvider(KubernetesConfigMapStateProviderTest.java:113)
```

The fix is just a null check in shutdown() plus a unit test verifying it.

Happy to split it out into a separate PR if you'd prefer the lifecycle PR stays narrowly scoped — let me know.

@exceptionfactory
Copy link
Copy Markdown
Contributor

[claude-opus-4.7] Thanks for the review @exceptionfactory. The Kubernetes change addresses a unit-test flake, not a system test. KubernetesConfigMapStateProviderTest's @AfterEach calls provider.shutdown(), and when a @BeforeEach throws (or initialization is skipped) shutdown() NPEs because kubernetesClient is null. The NPE then masks the real failure cause. Observed on PR #11211 in actions run 25513810851:

java.lang.NullPointerException: Cannot invoke "io.fabric8.kubernetes.client.KubernetesClient.close()" because "this.kubernetesClient" is null at KubernetesConfigMapStateProvider.shutdown(KubernetesConfigMapStateProvider.java:144) at KubernetesConfigMapStateProviderTest.shutdownProvider(KubernetesConfigMapStateProviderTest.java:113)

The fix is just a null check in shutdown() plus a unit test verifying it.

Happy to split it out into a separate PR if you'd prefer the lifecycle PR stays narrowly scoped — let me know.

Thanks for the background reference, keeping it in this PR sounds good.

- Defensive copy in LifecycleState.getFutures() to prevent ConcurrentModificationException.
- Ensure StandardProcessorNode.stop() decrements active threads and completes the
  stop action even if the stop task throws.
- Promote silent processor start abort paths to INFO logs to aid diagnosis.
- Null-safe shutdown in KubernetesConfigMapStateProvider when initialization was skipped.
- SpawnedStandaloneNiFiInstanceFactory: retry NiFi startup until the Management Server
  is reachable (handles HttpServer port reuse race), but only when start() is asked to
  wait for completion. start(false) now spawns the process and returns immediately.
- NiFiClientUtil.runProcessorOnce waits for validation completion and fails fast on
  INVALID; system tests updated to use it where appropriate.
- ContentClaimTruncationIT: keep one small FlowFile queued so its resource claim is
  pinned during replay, preventing archive-cleanup-induced truncation flakes.
Copy link
Copy Markdown
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a closer look at the changes to SpawnedStandaloneNiFiInstanceFactory, I noticed the comments mention BindException in some cases. It would be helpful to dig a bit deeper on this front, perhaps ensuring that the StandardManagementServer stops in a more graceful way under some conditions, to avoid keeping the socket in TIME_WAIT status.

…ne instance start

Reverted the start-time retry loop in
SpawnedStandaloneNiFiInstanceFactory.ProcessNiFiInstance and the
associated helpers (startUntilManagementServerReady,
isManagementServerReady) and retry constants. Investigation of the
original CI failure showed shutdowns were consistently clean and the
BindException on the Management Server port appeared once across many
captured failed runs, consistent with a transient ephemeral-port
collision on the GitHub Actions runner rather than a NiFi shutdown
race. The single-shot start path is restored.
@markap14
Copy link
Copy Markdown
Contributor Author

[claude-opus-4.7] @exceptionfactory Thanks for pushing back on the retry-based approach. I dug into the actual failure that motivated it (Ubuntu/JDK 25 run 25457358299, ParameterContextIT.testParametersReferencingEL) using the captured troubleshooting archive, and the data does not really support the "TIME_WAIT / shutdown grace" hypothesis:

  • In the failing instance's nifi-app.log, Jetty bound 127.0.0.1:5670 successfully ~9 ms before the management server failed on 127.0.0.1:56730. Same JVM, same IP. A lingering prior NiFi or socket pair from this test would have blocked both ports.
  • In the same run's failsafe output, every prior Shutdown Requested → Shutdown Completed for the standalone instance completed in roughly 400–450 ms, well under the 15 s process.waitFor timeout. None of them fell into the destroyForcibly() path.
  • ParameterSensitivityWithGhostedComponentIT (the test immediately before the cluster gap) cycled the same port 56730 four times in 36 seconds with ~200 ms gaps between stop and the next start, and all four succeeded. The failure was the next start, after a two-minute idle window during the cluster test (which only touches 56710/56720). If the cause were TIME_WAIT from our own process churn, we would expect failures on the rapid restarts, not after a long idle.
  • Across seven captured failed system-tests runs I had access to, the BindException appears in only that one run. The other six failures are entirely unrelated (cluster join issues, content truncation, etc.).

The most plausible cause is that the Linux kernel's ephemeral port allocator (default range 32768–60999 on Ubuntu 24.04, which includes 56730) briefly handed 56730 to some other outbound TCP socket on the runner during the 11-second startup window for the new NiFi JVM. SO_REUSEADDR would not help in that case — it only matters for TIME_WAIT 4-tuples, not an active LISTEN by something else.

Given that this is a rare environmental transient rather than a NiFi shutdown bug, I have reverted the retry loop in SpawnedStandaloneNiFiInstanceFactory in commit 785e1de8e3d. start() is back to the single-shot form it had before NIFI-15918. The rest of the PR (LifecycleState defensive copy, StandardProcessorNode cleanup, KubernetesConfigMapStateProvider null-safety, system-test helper changes) is unchanged.

If we see this pattern recur, the cleaner real fix is probably to bind the management server to port 0 inside the child JVM and have it publish the kernel-assigned port to a file the launcher reads — which sidesteps the launcher-vs-child socket-option mismatch that broke the previous port-scan attempt. Happy to take that on as a separate Jira if it shows up again.

Copy link
Copy Markdown
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reverting the restart loop changes, that sounds like the best approach under the circumstances. The rest of the changes look good. +1 merging

@exceptionfactory exceptionfactory merged commit 0887d8e into apache:main May 14, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants