reap torchrun agent + workers when extraction is interrupted#131
Merged
Conversation
run_torchrun_worker Popen'd the torch.distributed.run agent but had no teardown: an interrupt in the poll loop unwound the parent while the agent and its GPU-holding workers kept running as orphans, and killing only the workers let the elastic agent respawn them. Launch the agent in its own session (start_new_session=True) and add terminate_process_group(), which SIGTERMs the whole group, waits, then SIGKILLs. Wrap the poll/wait body in try/finally so Ctrl-C (KeyboardInterrupt) tears the group down; it is a no-op on the normal path. Also install a SIGTERM handler that raises KeyboardInterrupt (main thread only, restored in finally) so a programmatic kill of the parent triggers the same teardown, not just Ctrl-C. Verified with real torch.distributed.run: SIGINT (foreground), SIGTERM, and normal completion all leave zero agent/worker processes behind.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
run_torchrun_worker(slide2vec/runtime/distributed.py)Popens thetorch.distributed.runagent and polls it, but had no teardown. If thepoll loop is interrupted (Ctrl-C, or a programmatic kill of the parent), the
parent unwinds while the agent — and the GPU-holding workers it manages — keep
running as orphans. Killing only the workers doesn't help: the surviving
elastic agent respawns them. This bit during downstream agent-triggered
extraction, where an interrupted run left orphaned workers pinning GPU memory.
Fix
start_new_session=Trueso it leads its own session /process group, with its workers inside it.
terminate_process_group():SIGTERMthe whole group → waitgrace_seconds→SIGKILL. Killing the agent is what prevents elasticrespawn.
try/finallythat calls it — fires onKeyboardInterruptand on the failureRuntimeError; no-op on the normalpath (agent already exited).
SIGTERM→KeyboardInterrupthandler (main thread only, restoredin
finally) so a programmatickill <parent>also triggers teardown, notjust Ctrl-C.