reap torchrun agent + workers when extraction is interrupted by clemsgrs · Pull Request #131 · clemsgrs/slide2vec

clemsgrs · 2026-05-30T16:02:49Z

Problem

run_torchrun_worker (slide2vec/runtime/distributed.py) Popens the
torch.distributed.run agent and polls it, but had no teardown. If the
poll loop is interrupted (Ctrl-C, or a programmatic kill of the parent), the
parent unwinds while the agent — and the GPU-holding workers it manages — keep
running as orphans. Killing only the workers doesn't help: the surviving
elastic agent respawns them. This bit during downstream agent-triggered
extraction, where an interrupted run left orphaned workers pinning GPU memory.

Fix

Launch the agent with start_new_session=True so it leads its own session /
process group, with its workers inside it.
Add terminate_process_group(): SIGTERM the whole group → wait
grace_seconds → SIGKILL. Killing the agent is what prevents elastic
respawn.
Wrap the poll/wait body in try/finally that calls it — fires on
KeyboardInterrupt and on the failure RuntimeError; no-op on the normal
path (agent already exited).
Install a SIGTERM → KeyboardInterrupt handler (main thread only, restored
in finally) so a programmatic kill <parent> also triggers teardown, not
just Ctrl-C.

run_torchrun_worker Popen'd the torch.distributed.run agent but had no teardown: an interrupt in the poll loop unwound the parent while the agent and its GPU-holding workers kept running as orphans, and killing only the workers let the elastic agent respawn them. Launch the agent in its own session (start_new_session=True) and add terminate_process_group(), which SIGTERMs the whole group, waits, then SIGKILLs. Wrap the poll/wait body in try/finally so Ctrl-C (KeyboardInterrupt) tears the group down; it is a no-op on the normal path. Also install a SIGTERM handler that raises KeyboardInterrupt (main thread only, restored in finally) so a programmatic kill of the parent triggers the same teardown, not just Ctrl-C. Verified with real torch.distributed.run: SIGINT (foreground), SIGTERM, and normal completion all leave zero agent/worker processes behind.

clemsgrs changed the title ~~Reap torchrun agent + workers when extraction is interrupted~~ reap torchrun agent + workers when extraction is interrupted May 30, 2026

clemsgrs merged commit 1db3f4d into main May 30, 2026
3 checks passed

clemsgrs deleted the reap-torchrun-workers-on-interrupt branch May 30, 2026 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reap torchrun agent + workers when extraction is interrupted#131

reap torchrun agent + workers when extraction is interrupted#131
clemsgrs merged 1 commit into
mainfrom
reap-torchrun-workers-on-interrupt

clemsgrs commented May 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

clemsgrs commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

clemsgrs commented May 30, 2026 •

edited

Loading