## Problem#710
Open
copybara-service[bot] wants to merge 1 commit intomainfrom
Open
Conversation
6dcb6c9 to
c3919b5
Compare
When concurrent_execute / concurrent_map workers are blocked in synchronous I/O (e.g., HTTP calls to Gemini), future.cancel() on timeout cannot reclaim the worker threads. The finally-block executor.shutdown(wait=False, cancel_futures=True) returns immediately, leaving orphan threads behind. Across the langfun Gemini retry loop (num_attempts=10) this leaks ~211-224 threads per inner action and eventually exhausts the LM max_concurrency semaphore, deadlocking ParallelRunner. ## Fix Add opt-in shutdown_wait_on_timeout: bool = False parameter to both concurrent_execute and concurrent_map. When True, the finally-block shutdown waits for in-flight workers to finish, preventing orphan accumulation. Default False preserves existing behavior — strictly opt-in, zero behavior change for existing callers. ## Evidence (from prior smoke validation, cl/912681428 source) - Baseline: 2 threads -> after fix: 1 thread (smoke test). - swe_bench_timeout_test PASS in 132.3s. - Thread count bounded 206-218 across a 4h26m benchmark run. ## Adversarial test coverage (concurrent_shutdown_test.py, 10 tests) - (a) Baseline cleanup across 5 timed-out tasks. - (b) Backward-compat: default False still leaks (proves opt-in semantics). - (c) Worker ignoring cancellation: bounded-hang documentation. - (d) Worker raising during shutdown: no leak, no swallowed exception. - (e) Concurrent shutdown calls: no deadlock. - (f) Nested executor pools: both layers clean up. - (g) High-fanout stress (50 workers, mixed): bounded. - (h) num_attempts retry interaction: bounded across 10 retries. - (i) Resource fault injection on shutdown: graceful. - (j) GC + weakref: executor collectable post-shutdown. ## Backward compatibility Default shutdown_wait_on_timeout=False — all existing callers see byte-identical behavior. Test (b) explicitly asserts the unchanged-default leak signature is preserved, guaranteeing the fix is purely additive. PiperOrigin-RevId: 913061535
c3919b5 to
2ce4eb1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When concurrent_execute / concurrent_map workers are blocked in synchronous I/O
(e.g., HTTP calls to Gemini), future.cancel() on timeout cannot reclaim the
worker threads. The finally-block executor.shutdown(wait=False, cancel_futures=True)
returns immediately, leaving orphan threads behind. Across the langfun Gemini
retry loop (num_attempts=10) this leaks ~211-224 threads per inner action and
eventually exhausts the LM max_concurrency semaphore, deadlocking ParallelRunner.
Fix
Add opt-in shutdown_wait_on_timeout: bool = False parameter to both
concurrent_execute and concurrent_map. When True, the finally-block shutdown
waits for in-flight workers to finish, preventing orphan accumulation. Default
False preserves existing behavior — strictly opt-in, zero behavior change for
existing callers.
Evidence (from prior smoke validation, cl/912681428 source)
Adversarial test coverage (concurrent_shutdown_test.py, 10 tests)
Backward compatibility
Default shutdown_wait_on_timeout=False — all existing callers see byte-identical
behavior. Test (b) explicitly asserts the unchanged-default leak signature is
preserved, guaranteeing the fix is purely additive.