fix(pool): replace yield_now with exponential backoff in acquire retry loop#4185
Open
rubenfiszel wants to merge 3 commits intolaunchbadge:mainfrom
Open
fix(pool): replace yield_now with exponential backoff in acquire retry loop#4185rubenfiszel wants to merge 3 commits intolaunchbadge:mainfrom
rubenfiszel wants to merge 3 commits intolaunchbadge:mainfrom
Conversation
…y loop On ARM/aarch64, the pool's acquire() loop can spin at 100% CPU when "phantom permits" accumulate — the semaphore has permits but the idle queue is empty and the pool is at max_connections. This happens when connections are discarded (expired/errored) while checked out: the DecrementSizeGuard releases a semaphore permit but the connection is never returned to the idle queue. The previous yield_now() returns almost immediately, causing a tight spin. Replace it with bounded exponential backoff (1ms to 256ms) to break the spin while giving checked-out connections time to be returned. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds two unit tests that synthetically reproduce the phantom permit condition observed on ARM/aarch64. The tests inject extra semaphore permits (without corresponding idle connections) into a pool at max_connections, then verify that acquire() times out cleanly via exponential backoff rather than spinning at 100% CPU. Run with: cargo test -p sqlx-core --features _rt-tokio,any phantom_permit Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a unit test that synthetically reproduces the phantom permit condition observed on ARM/aarch64. The test injects extra semaphore permits (without corresponding idle connections) into a pool at max_connections, then verifies that acquire() retries are bounded by exponential backoff. Key result: - With yield_now (old code): 928,904 retries in 2s (100% CPU spin) - With backoff (new code): ~15 retries in 2s (sleeping between retries) The test uses a #[cfg(test)] retry counter on PoolInner to directly assert on iteration count, making it fail deterministically with the old yield_now code on any architecture. Run with: cargo test -p sqlx-core --features _rt-tokio,any phantom_permit Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PoolInner::acquire()caused by "phantom permits" — semaphore permits that don't correspond to idle connections or available pool slotsyield_now().await(returns almost immediately) with bounded exponential backoff (1ms→256msmax) in the retry path where idle queue is empty and pool is atmax_connectionsRoot cause
When a connection is discarded (expired via
max_lifetime, errored) while checked out,DecrementSizeGuard::drop()decrements pool size and releases a semaphore permit. A waiting task immediately acquires the permit, butpop_idle()returnsNone(the connection was never returned to the idle queue) andtry_increment_size()fails (other connections still checked out push size back to max). The permit is dropped and re-released — it's now a phantom that causes infinite retries.On ARM/aarch64, the weak memory model can also cause
semaphore.release()in therelease()path to become visible to other cores beforeidle_conns.push(), creating phantom permits even during normal connection return.Reproducing the issue
The test synthetically injects phantom permits (semaphore has permits, idle queue empty, size == max) — the exact state observed via GDB on aarch64 — then counts retry iterations:
yield_now()(old)sleep(backoff)(new)The test fails deterministically with the old
yield_nowcode on any architecture, not just ARM.