Skip to content

reap torchrun agent + workers when extraction is interrupted#131

Merged
clemsgrs merged 1 commit into
mainfrom
reap-torchrun-workers-on-interrupt
May 30, 2026
Merged

reap torchrun agent + workers when extraction is interrupted#131
clemsgrs merged 1 commit into
mainfrom
reap-torchrun-workers-on-interrupt

Conversation

@clemsgrs
Copy link
Copy Markdown
Owner

@clemsgrs clemsgrs commented May 30, 2026

Problem

run_torchrun_worker (slide2vec/runtime/distributed.py) Popens the
torch.distributed.run agent and polls it, but had no teardown. If the
poll loop is interrupted (Ctrl-C, or a programmatic kill of the parent), the
parent unwinds while the agent — and the GPU-holding workers it manages — keep
running as orphans. Killing only the workers doesn't help: the surviving
elastic agent respawns them. This bit during downstream agent-triggered
extraction, where an interrupted run left orphaned workers pinning GPU memory.

Fix

  • Launch the agent with start_new_session=True so it leads its own session /
    process group, with its workers inside it.
  • Add terminate_process_group(): SIGTERM the whole group → wait
    grace_secondsSIGKILL. Killing the agent is what prevents elastic
    respawn.
  • Wrap the poll/wait body in try/finally that calls it — fires on
    KeyboardInterrupt and on the failure RuntimeError; no-op on the normal
    path
    (agent already exited).
  • Install a SIGTERMKeyboardInterrupt handler (main thread only, restored
    in finally) so a programmatic kill <parent> also triggers teardown, not
    just Ctrl-C.

run_torchrun_worker Popen'd the torch.distributed.run agent but had no
teardown: an interrupt in the poll loop unwound the parent while the agent
and its GPU-holding workers kept running as orphans, and killing only the
workers let the elastic agent respawn them.

Launch the agent in its own session (start_new_session=True) and add
terminate_process_group(), which SIGTERMs the whole group, waits, then
SIGKILLs. Wrap the poll/wait body in try/finally so Ctrl-C (KeyboardInterrupt)
tears the group down; it is a no-op on the normal path. Also install a
SIGTERM handler that raises KeyboardInterrupt (main thread only, restored in
finally) so a programmatic kill of the parent triggers the same teardown,
not just Ctrl-C.

Verified with real torch.distributed.run: SIGINT (foreground), SIGTERM, and
normal completion all leave zero agent/worker processes behind.
@clemsgrs clemsgrs changed the title Reap torchrun agent + workers when extraction is interrupted reap torchrun agent + workers when extraction is interrupted May 30, 2026
@clemsgrs clemsgrs merged commit 1db3f4d into main May 30, 2026
3 checks passed
@clemsgrs clemsgrs deleted the reap-torchrun-workers-on-interrupt branch May 30, 2026 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant