NIFI-15918 Stabilize flaky system tests around processor lifecycle#11223
Conversation
exceptionfactory
left a comment
There was a problem hiding this comment.
Thanks for working on improving test stability @markap14.
On initial review, it seems like no changes to the KubernetesConfigMapStateProvider should be included, because it is not used for any system tests. Was that resulting in unit test failures somewhere?
|
[claude-opus-4.7] Thanks for the review @exceptionfactory. The Kubernetes change addresses a unit-test flake, not a system test. ``` The fix is just a null check in Happy to split it out into a separate PR if you'd prefer the lifecycle PR stays narrowly scoped — let me know. |
Thanks for the background reference, keeping it in this PR sounds good. |
- Defensive copy in LifecycleState.getFutures() to prevent ConcurrentModificationException. - Ensure StandardProcessorNode.stop() decrements active threads and completes the stop action even if the stop task throws. - Promote silent processor start abort paths to INFO logs to aid diagnosis. - Null-safe shutdown in KubernetesConfigMapStateProvider when initialization was skipped. - SpawnedStandaloneNiFiInstanceFactory: retry NiFi startup until the Management Server is reachable (handles HttpServer port reuse race), but only when start() is asked to wait for completion. start(false) now spawns the process and returns immediately. - NiFiClientUtil.runProcessorOnce waits for validation completion and fails fast on INVALID; system tests updated to use it where appropriate. - ContentClaimTruncationIT: keep one small FlowFile queued so its resource claim is pinned during replay, preventing archive-cleanup-induced truncation flakes.
exceptionfactory
left a comment
There was a problem hiding this comment.
Taking a closer look at the changes to SpawnedStandaloneNiFiInstanceFactory, I noticed the comments mention BindException in some cases. It would be helpful to dig a bit deeper on this front, perhaps ensuring that the StandardManagementServer stops in a more graceful way under some conditions, to avoid keeping the socket in TIME_WAIT status.
…ne instance start Reverted the start-time retry loop in SpawnedStandaloneNiFiInstanceFactory.ProcessNiFiInstance and the associated helpers (startUntilManagementServerReady, isManagementServerReady) and retry constants. Investigation of the original CI failure showed shutdowns were consistently clean and the BindException on the Management Server port appeared once across many captured failed runs, consistent with a transient ephemeral-port collision on the GitHub Actions runner rather than a NiFi shutdown race. The single-shot start path is restored.
|
[claude-opus-4.7] @exceptionfactory Thanks for pushing back on the retry-based approach. I dug into the actual failure that motivated it (Ubuntu/JDK 25 run 25457358299,
The most plausible cause is that the Linux kernel's ephemeral port allocator (default range Given that this is a rare environmental transient rather than a NiFi shutdown bug, I have reverted the retry loop in If we see this pattern recur, the cleaner real fix is probably to bind the management server to port 0 inside the child JVM and have it publish the kernel-assigned port to a file the launcher reads — which sidesteps the launcher-vs-child socket-option mismatch that broke the previous port-scan attempt. Happy to take that on as a separate Jira if it shows up again. |
exceptionfactory
left a comment
There was a problem hiding this comment.
Thanks for reverting the restart loop changes, that sounds like the best approach under the circumstances. The rest of the changes look good. +1 merging
Framework:
System tests:
Workspace rules covering single-element-array lambdas and the system-test troubleshooting-logs artifact.
Summary
NIFI-00000
Tracking
Please complete the following tracking steps prior to pull request creation.
Issue Tracking
Pull Request Tracking
NIFI-00000NIFI-00000VerifiedstatusPull Request Formatting
mainbranchVerification
Please indicate the verification steps performed prior to pull request creation.
Build
./mvnw clean install -P contrib-checkLicensing
LICENSEandNOTICEfilesDocumentation