NIFI-15698 - Fix Python bridge hang during startup with many Python processors#11002
NIFI-15698 - Fix Python bridge hang during startup with many Python processors#11002pvillard31 wants to merge 2 commits intoapache:mainfrom
Conversation
exceptionfactory
left a comment
There was a problem hiding this comment.
Thanks for working on this issue @pvillard31.
On an initial review, I'm concerned about the increased complexity of the PythonProcess, extending the contract and exposing more internals. There are some inherent limitations with Python Processors, which prompts some questions about how much complexity to introduce in order to support a larger number of Python Processors.
Although the current behavior is certainly problematic, I would like to give this closer consideration before moving forward with the new locking approach.
|
Reproduced on 2.6.0 and stumbled over this independently. Some gateway communication attempts (25?) and then fails. I |
Summary
NIFI-15698 - Fix Python bridge hang during startup with many Python processors
It has been challenging to work on this one and I was unable to come up with a system test systematically reproducing the issue. It was, however, very easy to reproduce the problem following the steps in the repository shared by the reporter: https://github.com/distroitt/nifi-bug
I was able to confirm the issue on latest release and was able to confirm that the fix is solving the problem by building the 2.9.0-SNAPSHOT Docker image and running the same tests.
When loading a flow with many Python processors, NiFi can hang during startup or restart and never reach "Started Application". The root cause is virtual thread pinning in
NiFiPythonGateway. The four methods that guard theactiveInvocationslist (beginInvocation,endInvocation,putNewObject,putObject) usesynchronized, which pins virtual threads to their carrier threads in JDK 21. During flow synchronization, the main thread and many processor-initialization virtual threads all contend for this single intrinsic lock. Because each waiting virtual thread pins its carrier, the ForkJoinPool carrier threads are quickly exhausted, and no thread can make progress - including the one holding the lock. This change replaces thesynchronizedmethods with aReentrantLock, which is virtual-thread-friendly: blocked virtual threads yield their carrier thread instead of pinning it.The
PythonProcesslifecycle has been updated so that a process is only handed out to callers afterdiscoverExtensions()completes. NewisReady(),waitUntilReady(), andmarkReadyAndNotify()methods prevent the main thread or initialization threads from calling into a Python process that is still loading extensions, which was another source of hangs on first start.The
getProcessForNextComponentmethod inStandardPythonBridgehas been restructured to hold the bridge lock only for the decision phase (picking or creating a process), then release it before performing blocking operations likestart()anddiscoverExtensions(). Previously the entire method wassynchronized, blocking all other processor creation threads during these slow operations.The
createProcessorBridgemethod now receives the already-resolvedPythonProcessorDetailsfrom its caller instead of callinggetProcessorTypes()again. This eliminates two redundant Python proxy round-trips per processor creation, reducing gateway lock contention during startup.A workaround has been added in
ProcessorInspection.pyfor a CPython 3.11+ bug (gh-95185) whereast.parse()can raiseSystemError: AST constructor recursion depth mismatchunder concurrent load. The error is caught and the file is treated as a non-processor module so that extension loading continues.For reference, extract of thread dump when reproducing the issue:
Tracking
Please complete the following tracking steps prior to pull request creation.
Issue Tracking
Pull Request Tracking
NIFI-00000NIFI-00000VerifiedstatusPull Request Formatting
mainbranchVerification
Please indicate the verification steps performed prior to pull request creation.
Build
./mvnw clean install -P contrib-checkLicensing
LICENSEandNOTICEfilesDocumentation