Skip to content

Conversation

@nateab
Copy link
Contributor

@nateab nateab commented Feb 11, 2026

…eManager during JM failover

What is the purpose of the change

When a JobManager restarts (common in K8s with spot instances/node drains), re-uploaded JARs produce PermanentBlobKey objects with different random components despite identical content. The TaskManager's cached classloader holds the old blob keys, and BlobLibraryCacheManager.verifyClassLoader() throws IllegalStateException because the new keys don't match the old ones — even though the JAR content is identical. This causes an infinite restart loop since every task deployment attempt hits the same mismatch.

This pull request makes BlobLibraryCacheManager handle blob key mismatches gracefully by catching the IllegalStateException and re-creating the classloader with the new blob keys, instead of propagating the exception.

Brief change log

  • LibraryCacheEntry.getOrResolveClassLoader() now catches IllegalStateException from verifyClassLoader() and re-creates the classloader with the new blob keys
  • Extracted createResolvedClassLoader() helper to eliminate duplication between initial creation and re-creation paths
  • The old classloader is intentionally not closed because in-flight tasks being cancelled may still reference it

Verifying this change

This change added tests and can be verified as follows:

  • Added classloaderIsRecreatedWhenBlobKeysChangeForSameJob: uploads identical content twice to produce different PermanentBlobKeys (simulating JM failover), verifies the classloader is transparently re-created
  • Added classloaderRecreationDoesNotCloseOldClassloader: verifies the old classloader is not closed during re-creation, since in-flight tasks may still reference it
  • Updated existing tests in BlobLibraryCacheManagerTest that previously expected IllegalStateException to verify the new re-creation behavior

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Feb 11, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

…eManager during JM failover

When a JobManager restarts (e.g. K8s spot instance preemption),
re-uploaded JARs produce PermanentBlobKeys with different random
components despite identical content. The TaskManager's cached
classloader holds the old keys, causing verifyClassLoader() to
throw IllegalStateException on every task deployment attempt,
resulting in an infinite restart loop.

Catch IllegalStateException from verifyClassLoader() and re-create
the classloader with the new blob keys instead of propagating
the exception. The old classloader is intentionally not closed
because in-flight tasks may still reference it.
@nateab nateab force-pushed the fix/FLINK-32212-blob-library-cache-classloader-recreation branch from 95d333c to d1b040d Compare February 11, 2026 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants