fix(container): kill orphaned in-container process on execute() timeout#31
Conversation
When subprocess.run([docker exec ...], timeout=N) hits TimeoutExpired, Python SIGKILLs the local docker exec CLI. The docker daemon does not propagate that into the container (only SIGINT/SIGTERM propagate gracefully), so the test process spawned inside the container keeps running. With sleep-as-PID-1 cleanrooms it never gets reaped: it just orphans onto PID 1 and keeps burning CPU while the next docker exec call competes for the same resources. On a multi-arm evaluation run over the full task set, this surfaces as: a TUI-style task hits the 3600s outer timeout, returns rc=-1, programbench retries the branch, the retry SIGKILLs another docker exec, the second orphan accumulates alongside the first, and so on. Observed cost on one run: ~2 hr wasted per arm per hang-prone task, plus measurable CPU contention slowing surrounding evals in the same stripe. Fix: on TimeoutExpired, send a follow-up docker exec <c> bash -c "kill -KILL -1; sleep 0.2; kill -KILL -1; true" which signals every non-PID-1 process in the container's PID namespace. Safe here because ContainerEnvironment starts cleanrooms with a long- lived sleep as PID 1; the only non-PID-1 processes are the test subprocess tree we just spawned. Bounded with a 30s timeout on the kill exec itself, and warns if it fails so the failure is debuggable instead of silent. Behavior on successful runs is unchanged.
|
Hi @kunchenguid! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
What's going on
When
ContainerEnvironment.execute()hits its host-sidesubprocess.run(..., timeout=N)deadline, Python SIGKILLs the localdocker execCLI. The docker daemon does not propagate that into the container - only graceful SIGINT/SIGTERM propagate. So the test process inside the container keeps running. With the sleep-as-PID-1 cleanroom design, it never gets reaped: it just orphans onto PID 1 and keeps consuming CPU while the nextdocker execcall competes for the same resources.For most workloads this is fine because tests exit cleanly. The corpus has a lot of CLI tools where they don't: TUIs that wait for input, watchers that loop until signaled, network listeners that bind forever. When the test runner invokes one of these as a subprocess and never tells it to exit, pytest blocks waiting for it, programbench's
_run_stephits its outer timeout, the in-container process is orphaned, programbench retries the branch, the retry spawns another instance alongside the orphan, and so on.What I observed
Running an 8-arm per-language evaluation over 200 tasks, this was the dominant cost on hang-prone tasks:
blacknon__hwatch.edfcb62consistently spent 3600s + retry (~2 hr per arm) timing out inrun_testsThe user-visible symptom from the perspective of the eval pipeline is correct (the eval returns
results_read_failedand moves on), but the underlying container is left in a state where the next exec call has to fight an orphan for CPU.The fix
On
TimeoutExpired, send a follow-up:kill -KILL -1from a non-PID-1 process sends SIGKILL to every other process the caller can signal in its PID namespace, except PID 1. Running it as root inside the cleanroom hits everything we spawned (the bash login shell, pytest, the test subprocess tree).This is safe specifically because
ContainerEnvironmentstarts the cleanroom with a long-livedsleepas PID 1 (see the existing__init__). The only non-PID-1 processes are the ones we created via the exec we just timed out on. The double-kill with a short sleep is to catch stragglers that the first round didn't reap.The kill itself is bounded by its own 30s timeout, so a wedged docker daemon can't pin us a second time. On failure it emits a warning rather than swallowing the error silently.
Backwards compatibility
Behavior on successful runs is unchanged. On timeout, the dict returned from
execute()is identical to before; only the in-container side effect is new.How I'd test it
Smoke test against a known hang-prone task (chroma's pprof test, hwatch's TUI tests). Before this PR: after the run_tests timeout fires,
docker top <container>still shows the test process running. After this PR:docker topshows only PID 1 (sleep). The nextcompile_env.execute(...)call inside the same evaluator completes at normal speed instead of fighting for CPU.Happy to add a focused unit test that mocks the docker exec and asserts the second exec is issued on TimeoutExpired - let me know if you want that in this PR or a follow-up.