Skip to content

fix(container): kill orphaned in-container process on execute() timeout#31

Open
kunchenguid wants to merge 1 commit into
facebookresearch:mainfrom
kunchenguid:fix/kill-orphaned-process-on-execute-timeout
Open

fix(container): kill orphaned in-container process on execute() timeout#31
kunchenguid wants to merge 1 commit into
facebookresearch:mainfrom
kunchenguid:fix/kill-orphaned-process-on-execute-timeout

Conversation

@kunchenguid
Copy link
Copy Markdown

What's going on

When ContainerEnvironment.execute() hits its host-side subprocess.run(..., timeout=N) deadline, Python SIGKILLs the local docker exec CLI. The docker daemon does not propagate that into the container - only graceful SIGINT/SIGTERM propagate. So the test process inside the container keeps running. With the sleep-as-PID-1 cleanroom design, it never gets reaped: it just orphans onto PID 1 and keeps consuming CPU while the next docker exec call competes for the same resources.

For most workloads this is fine because tests exit cleanly. The corpus has a lot of CLI tools where they don't: TUIs that wait for input, watchers that loop until signaled, network listeners that bind forever. When the test runner invokes one of these as a subprocess and never tells it to exit, pytest blocks waiting for it, programbench's _run_step hits its outer timeout, the in-container process is orphaned, programbench retries the branch, the retry spawns another instance alongside the orphan, and so on.

What I observed

Running an 8-arm per-language evaluation over 200 tasks, this was the dominant cost on hang-prone tasks:

  • blacknon__hwatch.edfcb62 consistently spent 3600s + retry (~2 hr per arm) timing out in run_tests
  • ~14 min of CPU contention on the eval host even after the timeout returned, because the orphan inside the container kept running
  • Across 8 arms in a single stripe, the cumulative waste was ~1.5-2 hr of wall clock just on one task

The user-visible symptom from the perspective of the eval pipeline is correct (the eval returns results_read_failed and moves on), but the underlying container is left in a state where the next exec call has to fight an orphan for CPU.

The fix

On TimeoutExpired, send a follow-up:

docker exec <container> bash -c "kill -KILL -1 2>/dev/null; sleep 0.2; kill -KILL -1 2>/dev/null; true"

kill -KILL -1 from a non-PID-1 process sends SIGKILL to every other process the caller can signal in its PID namespace, except PID 1. Running it as root inside the cleanroom hits everything we spawned (the bash login shell, pytest, the test subprocess tree).

This is safe specifically because ContainerEnvironment starts the cleanroom with a long-lived sleep as PID 1 (see the existing __init__). The only non-PID-1 processes are the ones we created via the exec we just timed out on. The double-kill with a short sleep is to catch stragglers that the first round didn't reap.

The kill itself is bounded by its own 30s timeout, so a wedged docker daemon can't pin us a second time. On failure it emits a warning rather than swallowing the error silently.

Backwards compatibility

Behavior on successful runs is unchanged. On timeout, the dict returned from execute() is identical to before; only the in-container side effect is new.

How I'd test it

Smoke test against a known hang-prone task (chroma's pprof test, hwatch's TUI tests). Before this PR: after the run_tests timeout fires, docker top <container> still shows the test process running. After this PR: docker top shows only PID 1 (sleep). The next compile_env.execute(...) call inside the same evaluator completes at normal speed instead of fighting for CPU.

Happy to add a focused unit test that mocks the docker exec and asserts the second exec is issued on TimeoutExpired - let me know if you want that in this PR or a follow-up.

When subprocess.run([docker exec ...], timeout=N) hits TimeoutExpired,
Python SIGKILLs the local docker exec CLI. The docker daemon does not
propagate that into the container (only SIGINT/SIGTERM propagate
gracefully), so the test process spawned inside the container keeps
running. With sleep-as-PID-1 cleanrooms it never gets reaped: it just
orphans onto PID 1 and keeps burning CPU while the next docker exec
call competes for the same resources.

On a multi-arm evaluation run over the full task set, this surfaces as:
a TUI-style task hits the 3600s outer timeout, returns rc=-1,
programbench retries the branch, the retry SIGKILLs another docker
exec, the second orphan accumulates alongside the first, and so on.
Observed cost on one run: ~2 hr wasted per arm per hang-prone task,
plus measurable CPU contention slowing surrounding evals in the same
stripe.

Fix: on TimeoutExpired, send a follow-up
  docker exec <c> bash -c "kill -KILL -1; sleep 0.2; kill -KILL -1; true"
which signals every non-PID-1 process in the container's PID namespace.
Safe here because ContainerEnvironment starts cleanrooms with a long-
lived sleep as PID 1; the only non-PID-1 processes are the test
subprocess tree we just spawned. Bounded with a 30s timeout on the
kill exec itself, and warns if it fails so the failure is debuggable
instead of silent.

Behavior on successful runs is unchanged.
@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented May 23, 2026

Hi @kunchenguid!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 23, 2026
@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented May 23, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant