Skip to content

Early yield on 429 throttling on barrier requests#48914

Open
mbhaskar wants to merge 6 commits into
Azure:mainfrom
mbhaskar:early-yield-on-429-throttling
Open

Early yield on 429 throttling on barrier requests#48914
mbhaskar wants to merge 6 commits into
Azure:mainfrom
mbhaskar:early-yield-on-429-throttling

Conversation

@mbhaskar
Copy link
Copy Markdown
Member

@mbhaskar mbhaskar commented Apr 23, 2026

Description

This PR introduces early yield on 429s during barrier requests.
When receiving 429s with strong consistency, quorum reader/ writer code does not yield early enough creating multiple stack traces resulting into resource constraints on the client side.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Copilot AI review requested due to automatic review settings April 23, 2026 18:04
@mbhaskar mbhaskar requested review from a team and kirankumarkolli as code owners April 23, 2026 18:04
@mbhaskar mbhaskar changed the title Early yield on 429 throttling Early yield on 429 throttling on barrier requests Apr 23, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Cosmos direct connectivity quorum/barrier logic to “yield early” when replica reads are uniformly throttled (HTTP 429), allowing the existing ResourceThrottleRetryPolicy to apply appropriate backoff instead of progressing into additional quorum/primary/barrier attempts.

Changes:

  • Add StoreResult.isThrottledException to cheaply detect 429 responses.
  • In QuorumReader, propagate 429 immediately when all collected replica results are throttled (including barrier paths).
  • In ConsistencyWriter, track throttling during write barriers and, when retries are exhausted and the last attempt was fully throttled, throw a RequestTimeoutException with a new substatus code; add unit tests for the new behaviors.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/StoreResult.java Adds a computed flag to identify throttling (429) on replica results.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/QuorumReader.java Early-yields on replica-wide throttling to let throttle retry policy handle backoff.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/ConsistencyWriter.java Tracks throttling during write barriers and surfaces a distinct timeout substatus when retries are exhausted.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/HttpConstants.java Introduces a new substatus code for write-barrier throttling exhaustion.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/directconnectivity/QuorumReaderTest.java Adds unit tests covering 429 propagation and Gone+429 interactions.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/directconnectivity/ConsistencyWriterTest.java Adds unit tests for write-barrier behavior under sustained throttling and mixed outcomes.

@xinlian12
Copy link
Copy Markdown
Member

@sdkReviewAgent

@xinlian12
Copy link
Copy Markdown
Member

Review complete (49:16)

Posted 1 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@mbhaskar
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

@mbhaskar
Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

mbhaskar and others added 6 commits May 27, 2026 14:57
Port of .NET PR #1667829: When receiving repeated 429 (Too Many Requests)
responses with strong consistency, QuorumReader and ConsistencyWriter now
handle throttling more efficiently.

QuorumReader (reads):
- waitForReadBarrierAsync: yield early when all replicas return 429 in both
  single-region and multi-region barrier loops
- ensureQuorumSelectedStoreResponse: yield early when all replicas throttled
  during initial quorum read
- All cases throw the 429 exception to let ResourceThrottleRetryPolicy
  handle retry with appropriate backoff

ConsistencyWriter (writes):
- waitForWriteBarrierAsync: track lastAttemptWasThrottled flag per iteration
- Do NOT yield early (preserves idempotency guarantees)
- When all retries exhausted due to consistent throttling, throw
  RequestTimeoutException (408) with substatus SERVER_WRITE_BARRIER_THROTTLED
  (21013) instead of returning barrier-not-met

Other changes:
- Added isThrottledException field to StoreResult
- Added SERVER_WRITE_BARRIER_THROTTLED (21013) substatus code
- Unit tests for all throttling scenarios

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ed replica yields early

Port of .NET test ValidatesReadMultipleReplicaAsyncExcludesGoneReplicas.
Validates that when replicas return a mix of 410 (Gone) and 429 (TooManyRequests):
- Gone replicas are excluded from results by StoreReader (isValid=false for GONE)
- The 429 replica with valid LSN headers is kept (isValid=true for non-GONE with lsn>=0)
- Since all remaining replicas are throttled, early yield triggers
- The 429 exception propagates to ResourceThrottleRetryPolicy

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes from xinlian12 (blocking):
- Fix lastAttemptWasThrottled stale state: reset flag before
  avoidQuorumSelection early return to prevent incorrect 408 when
  prior iteration was throttled but current iteration hits 410
- Fix readStrong_AllReplicasThrottled_Returns429 false positive: set LSN
  on exception so StoreReader marks isValid=true, ensuring the early yield
  path is actually exercised. Add transport invocation count assertions
  to verify primary read is NOT attempted.
- Add readStrong_BarrierRequestsThrottled_Returns429 test covering the
  waitForReadBarrierAsync barrier path (quorum succeeds, then barrier
  HEAD requests return 429)

Fixes from Copilot review:
- Fix checkstyle: add missing spaces around = operator (2 places)
- Fix log wording: 'All replicas' -> 'All contacted replicas' (more
  accurate since not all replicas may be contacted per attempt)
- Fix ConsistencyWriter log: 'consistent throttling' -> 'last attempt
  was throttled' (flag only tracks last attempt, not all attempts)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tling

Unit tests (4 new, 10 total throttling tests):
- writeBarrier_AvoidQuorumSelectionAfterThrottling_NoFalse408: validates
  lastAttemptWasThrottled reset on avoidQuorumSelection path (stale state fix)
- writeBarrier_NRegionCommit_AllReplicasThrottled_Returns408: N-region
  synchronous commit barrier throttling produces 408/21013
- readStrong_QuorumNotSelected_PrimaryThrottled_Returns429: primary 429
  propagates correctly through QuorumNotSelected → readPrimary path
- readStrong_BarrierPartialThrottle_StillSucceeds: barrier succeeds when
  one replica is throttled but other meets LSN (no false-negative yield)

Fault injection E2E tests (3 new, require strong consistency account):
- faultInjection_readBarrierThrottled_yieldsEarly: inject 429 on
  HEAD_COLLECTION + GCLSN interceptor → verify early yield on reads
- faultInjection_writeBarrierThrottled_returns408: inject 429 on
  HEAD_COLLECTION + GCLSN interceptor → verify 408 on writes
- faultInjection_readBarrierThrottled_thenRecovers: inject 429 with
  hitLimit(2) → verify read succeeds after throttle clears

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Read and write barrier requests are only triggered on multi-region strong
consistency accounts (numberOfReadRegions > 0). The emulator is single-region,
so the GCLSN interceptor never triggers barriers and the tests fail with
empty supplementalResponseStatisticsList.

Added accountLevelReadRegions.size() > 1 skip check to all three E2E fault
injection tests so they correctly skip on single-region environments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mbhaskar mbhaskar force-pushed the early-yield-on-429-throttling branch from 28b1fae to 3cd6163 Compare May 27, 2026 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants