Skip to content

AP-25628: add checkpoint/restore (CRaC support in executor)#88

Open
bernd-wiswedel wants to merge 1 commit intomasterfrom
todo/AP-25628-c-ra-c-po-c-for-faster-executor-statup
Open

AP-25628: add checkpoint/restore (CRaC support in executor)#88
bernd-wiswedel wants to merge 1 commit intomasterfrom
todo/AP-25628-c-ra-c-po-c-for-faster-executor-statup

Conversation

@bernd-wiswedel
Copy link
Copy Markdown
Member

AP-25628 (PoC: "CRaC" for faster executor startup (suspend VM after start))

@bernd-wiswedel bernd-wiswedel requested a review from a team as a code owner March 18, 2026 10:16
@bernd-wiswedel bernd-wiswedel requested review from Copilot and knime-ghub-bot and removed request for a team March 18, 2026 10:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a CRaC (Coordinated Restore at Checkpoint) hook to the KNIME Python gateway tracking layer so Python processes are terminated before the JVM is checkpointed, aiming to improve executor startup/restore behavior.

Changes:

  • Register a PhasedInit callback in PythonGatewayTracker to run cleanup before checkpointing.
  • Reuse existing gateway cleanup logic (clear()) to forcefully terminate tracked Python gateways/processes.

Comment on lines +84 to +92
// Support CRaC (Coordinated Restore at Checkpoint) and close all connections prior checkpointing
PhasedInitSupport.registerOrActivate(new PhasedInit<RuntimeException>() {
@Override
public void beforeCheckpoint() throws RuntimeException {
try {
clear();
} catch (IOException ex) {
LOGGER.warn("Error when forcefully terminating Python processes during phased initialization", ex);
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid and should be changed:

  • clear should accept a Consumer that allows the log message to be customizable by the caller
  • Exception should fall through

Comment on lines +85 to +94
PhasedInitSupport.registerOrActivate(new PhasedInit<RuntimeException>() {
@Override
public void beforeCheckpoint() throws RuntimeException {
try {
clear();
} catch (IOException ex) {
LOGGER.warn("Error when forcefully terminating Python processes during phased initialization", ex);
}
}
});
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are race conditions on checkpoint/restore, which should not happen by design. Restore willl be a controlled phase, and checkpoint will only happen when the init of all bundles has stabilized (but no workflows have been executed). So, leaving unchanged!

try {
clear();
} catch (IOException ex) {
LOGGER.warn("Error when forcefully terminating Python processes during phased initialization", ex);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Let's rethrow and fix later, if at all needed.


private PythonGatewayTracker() {
m_openGateways = gatewaySet();
// Support CRaC (Coordinated Restore at Checkpoint) and close all connections prior checkpointing
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

@bernd-wiswedel bernd-wiswedel force-pushed the todo/AP-25628-c-ra-c-po-c-for-faster-executor-statup branch 2 times, most recently from 8479e06 to dd8bd9d Compare April 1, 2026 17:40
AP-25628 (PoC: "CRaC" for faster executor startup (suspend VM after start))
@bernd-wiswedel bernd-wiswedel force-pushed the todo/AP-25628-c-ra-c-po-c-for-faster-executor-statup branch from dd8bd9d to ed41788 Compare April 4, 2026 12:29
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 4, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
61.5% Coverage on New Code (required ≥ 85%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants