Skip to content

fix(pool): replace yield_now with exponential backoff in acquire retry loop#4185

Open
rubenfiszel wants to merge 3 commits intolaunchbadge:mainfrom
rubenfiszel:fix/pool-acquire-spin-loop
Open

fix(pool): replace yield_now with exponential backoff in acquire retry loop#4185
rubenfiszel wants to merge 3 commits intolaunchbadge:mainfrom
rubenfiszel:fix/pool-acquire-spin-loop

Conversation

@rubenfiszel
Copy link

@rubenfiszel rubenfiszel commented Mar 11, 2026

Summary

  • Fix 100% CPU spin loop in PoolInner::acquire() caused by "phantom permits" — semaphore permits that don't correspond to idle connections or available pool slots
  • Replace yield_now().await (returns almost immediately) with bounded exponential backoff (1ms256ms max) in the retry path where idle queue is empty and pool is at max_connections
  • Most visible on ARM/aarch64 where weak memory ordering widens the race window, but the bug exists on all architectures

Root cause

When a connection is discarded (expired via max_lifetime, errored) while checked out, DecrementSizeGuard::drop() decrements pool size and releases a semaphore permit. A waiting task immediately acquires the permit, but pop_idle() returns None (the connection was never returned to the idle queue) and try_increment_size() fails (other connections still checked out push size back to max). The permit is dropped and re-released — it's now a phantom that causes infinite retries.

On ARM/aarch64, the weak memory model can also cause semaphore.release() in the release() path to become visible to other cores before idle_conns.push(), creating phantom permits even during normal connection return.

Reproducing the issue

The test synthetically injects phantom permits (semaphore has permits, idle queue empty, size == max) — the exact state observed via GDB on aarch64 — then counts retry iterations:

Code path Retries in 2s CPU behavior
yield_now() (old) 928,904 100% CPU spin
sleep(backoff) (new) ~15 sleeping between retries
cargo test -p sqlx-core --features _rt-tokio,any phantom_permit

The test fails deterministically with the old yield_now code on any architecture, not just ARM.

rubenfiszel and others added 3 commits March 11, 2026 22:03
…y loop

On ARM/aarch64, the pool's acquire() loop can spin at 100% CPU when
"phantom permits" accumulate — the semaphore has permits but the idle
queue is empty and the pool is at max_connections. This happens when
connections are discarded (expired/errored) while checked out: the
DecrementSizeGuard releases a semaphore permit but the connection is
never returned to the idle queue.

The previous yield_now() returns almost immediately, causing a tight
spin. Replace it with bounded exponential backoff (1ms to 256ms) to
break the spin while giving checked-out connections time to be returned.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds two unit tests that synthetically reproduce the phantom permit
condition observed on ARM/aarch64. The tests inject extra semaphore
permits (without corresponding idle connections) into a pool at
max_connections, then verify that acquire() times out cleanly via
exponential backoff rather than spinning at 100% CPU.

Run with: cargo test -p sqlx-core --features _rt-tokio,any phantom_permit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a unit test that synthetically reproduces the phantom permit
condition observed on ARM/aarch64. The test injects extra semaphore
permits (without corresponding idle connections) into a pool at
max_connections, then verifies that acquire() retries are bounded
by exponential backoff.

Key result:
- With yield_now (old code): 928,904 retries in 2s (100% CPU spin)
- With backoff (new code): ~15 retries in 2s (sleeping between retries)

The test uses a #[cfg(test)] retry counter on PoolInner to directly
assert on iteration count, making it fail deterministically with the
old yield_now code on any architecture.

Run with: cargo test -p sqlx-core --features _rt-tokio,any phantom_permit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant