Skip to content

## Problem#710

Open
copybara-service[bot] wants to merge 1 commit intomainfrom
test_913061535
Open

## Problem#710
copybara-service[bot] wants to merge 1 commit intomainfrom
test_913061535

Conversation

@copybara-service
Copy link
Copy Markdown

Problem

When concurrent_execute / concurrent_map workers are blocked in synchronous I/O
(e.g., HTTP calls to Gemini), future.cancel() on timeout cannot reclaim the
worker threads. The finally-block executor.shutdown(wait=False, cancel_futures=True)
returns immediately, leaving orphan threads behind. Across the langfun Gemini
retry loop (num_attempts=10) this leaks ~211-224 threads per inner action and
eventually exhausts the LM max_concurrency semaphore, deadlocking ParallelRunner.

Fix

Add opt-in shutdown_wait_on_timeout: bool = False parameter to both
concurrent_execute and concurrent_map. When True, the finally-block shutdown
waits for in-flight workers to finish, preventing orphan accumulation. Default
False preserves existing behavior — strictly opt-in, zero behavior change for
existing callers.

Evidence (from prior smoke validation, cl/912681428 source)

  • Baseline: 2 threads -> after fix: 1 thread (smoke test).
  • swe_bench_timeout_test PASS in 132.3s.
  • Thread count bounded 206-218 across a 4h26m benchmark run.

Adversarial test coverage (concurrent_shutdown_test.py, 10 tests)

  • (a) Baseline cleanup across 5 timed-out tasks.
  • (b) Backward-compat: default False still leaks (proves opt-in semantics).
  • (c) Worker ignoring cancellation: bounded-hang documentation.
  • (d) Worker raising during shutdown: no leak, no swallowed exception.
  • (e) Concurrent shutdown calls: no deadlock.
  • (f) Nested executor pools: both layers clean up.
  • (g) High-fanout stress (50 workers, mixed): bounded.
  • (h) num_attempts retry interaction: bounded across 10 retries.
  • (i) Resource fault injection on shutdown: graceful.
  • (j) GC + weakref: executor collectable post-shutdown.

Backward compatibility

Default shutdown_wait_on_timeout=False — all existing callers see byte-identical
behavior. Test (b) explicitly asserts the unchanged-default leak signature is
preserved, guaranteeing the fix is purely additive.

When concurrent_execute / concurrent_map workers are blocked in synchronous I/O
(e.g., HTTP calls to Gemini), future.cancel() on timeout cannot reclaim the
worker threads. The finally-block executor.shutdown(wait=False, cancel_futures=True)
returns immediately, leaving orphan threads behind. Across the langfun Gemini
retry loop (num_attempts=10) this leaks ~211-224 threads per inner action and
eventually exhausts the LM max_concurrency semaphore, deadlocking ParallelRunner.

## Fix
Add opt-in shutdown_wait_on_timeout: bool = False parameter to both
concurrent_execute and concurrent_map. When True, the finally-block shutdown
waits for in-flight workers to finish, preventing orphan accumulation. Default
False preserves existing behavior — strictly opt-in, zero behavior change for
existing callers.

## Evidence (from prior smoke validation, cl/912681428 source)
- Baseline: 2 threads -> after fix: 1 thread (smoke test).
- swe_bench_timeout_test PASS in 132.3s.
- Thread count bounded 206-218 across a 4h26m benchmark run.

## Adversarial test coverage (concurrent_shutdown_test.py, 10 tests)
- (a) Baseline cleanup across 5 timed-out tasks.
- (b) Backward-compat: default False still leaks (proves opt-in semantics).
- (c) Worker ignoring cancellation: bounded-hang documentation.
- (d) Worker raising during shutdown: no leak, no swallowed exception.
- (e) Concurrent shutdown calls: no deadlock.
- (f) Nested executor pools: both layers clean up.
- (g) High-fanout stress (50 workers, mixed): bounded.
- (h) num_attempts retry interaction: bounded across 10 retries.
- (i) Resource fault injection on shutdown: graceful.
- (j) GC + weakref: executor collectable post-shutdown.

## Backward compatibility
Default shutdown_wait_on_timeout=False — all existing callers see byte-identical
behavior. Test (b) explicitly asserts the unchanged-default leak signature is
preserved, guaranteeing the fix is purely additive.

PiperOrigin-RevId: 913061535
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant