[FLINK-32212][runtime] Fix infinite restart loop from BlobLibraryCach… #27579
+132
−36
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
…eManager during JM failover
What is the purpose of the change
When a JobManager restarts (common in K8s with spot instances/node drains), re-uploaded JARs produce
PermanentBlobKeyobjects with different random components despite identical content. The TaskManager's cached classloader holds the old blob keys, andBlobLibraryCacheManager.verifyClassLoader()throwsIllegalStateExceptionbecause the new keys don't match the old ones — even though the JAR content is identical. This causes an infinite restart loop since every task deployment attempt hits the same mismatch.This pull request makes
BlobLibraryCacheManagerhandle blob key mismatches gracefully by catching theIllegalStateExceptionand re-creating the classloader with the new blob keys, instead of propagating the exception.Brief change log
LibraryCacheEntry.getOrResolveClassLoader()now catchesIllegalStateExceptionfromverifyClassLoader()and re-creates the classloader with the new blob keyscreateResolvedClassLoader()helper to eliminate duplication between initial creation and re-creation pathsVerifying this change
This change added tests and can be verified as follows:
classloaderIsRecreatedWhenBlobKeysChangeForSameJob: uploads identical content twice to produce differentPermanentBlobKeys (simulating JM failover), verifies the classloader is transparently re-createdclassloaderRecreationDoesNotCloseOldClassloader: verifies the old classloader is not closed during re-creation, since in-flight tasks may still reference itBlobLibraryCacheManagerTestthat previously expectedIllegalStateExceptionto verify the new re-creation behaviorDoes this pull request potentially affect one of the following parts:
@Public(Evolving): noDocumentation