AP-25628: add checkpoint/restore (CRaC support in executor)#88
AP-25628: add checkpoint/restore (CRaC support in executor)#88bernd-wiswedel wants to merge 1 commit intomasterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a CRaC (Coordinated Restore at Checkpoint) hook to the KNIME Python gateway tracking layer so Python processes are terminated before the JVM is checkpointed, aiming to improve executor startup/restore behavior.
Changes:
- Register a
PhasedInitcallback inPythonGatewayTrackerto run cleanup before checkpointing. - Reuse existing gateway cleanup logic (
clear()) to forcefully terminate tracked Python gateways/processes.
| // Support CRaC (Coordinated Restore at Checkpoint) and close all connections prior checkpointing | ||
| PhasedInitSupport.registerOrActivate(new PhasedInit<RuntimeException>() { | ||
| @Override | ||
| public void beforeCheckpoint() throws RuntimeException { | ||
| try { | ||
| clear(); | ||
| } catch (IOException ex) { | ||
| LOGGER.warn("Error when forcefully terminating Python processes during phased initialization", ex); | ||
| } |
There was a problem hiding this comment.
This is valid and should be changed:
clearshould accept aConsumerthat allows the log message to be customizable by the caller- Exception should fall through
| PhasedInitSupport.registerOrActivate(new PhasedInit<RuntimeException>() { | ||
| @Override | ||
| public void beforeCheckpoint() throws RuntimeException { | ||
| try { | ||
| clear(); | ||
| } catch (IOException ex) { | ||
| LOGGER.warn("Error when forcefully terminating Python processes during phased initialization", ex); | ||
| } | ||
| } | ||
| }); |
There was a problem hiding this comment.
These are race conditions on checkpoint/restore, which should not happen by design. Restore willl be a controlled phase, and checkpoint will only happen when the init of all bundles has stabilized (but no workflows have been executed). So, leaving unchanged!
| try { | ||
| clear(); | ||
| } catch (IOException ex) { | ||
| LOGGER.warn("Error when forcefully terminating Python processes during phased initialization", ex); |
There was a problem hiding this comment.
Agree. Let's rethrow and fix later, if at all needed.
|
|
||
| private PythonGatewayTracker() { | ||
| m_openGateways = gatewaySet(); | ||
| // Support CRaC (Coordinated Restore at Checkpoint) and close all connections prior checkpointing |
8479e06 to
dd8bd9d
Compare
AP-25628 (PoC: "CRaC" for faster executor startup (suspend VM after start))
dd8bd9d to
ed41788
Compare
|


AP-25628 (PoC: "CRaC" for faster executor startup (suspend VM after start))