Improve detection of connect/handshake failures and the retry-policy by mgravell · Pull Request #3038 · StackExchange/StackExchange.Redis

mgravell · 2026-03-19T11:53:47Z

This PR investigates and resolves #3036, but also addresses some other related failure modes.

The fundamental issue in ^^^ was:

in PhysicalBridge, when invoking the retry-policy, we pass a count of how many retries we've already done, which is used by the policy, for example to implement exponential decay
this counter is reset when the shouldResetConnectionRetryCount flag is true
the code was incorrectly setting this flag to true even in "establishing" cases, due to an otherwise-intentional case fall-thru

So the simple part of this fix is simply: move the assignment of shouldResetConnectionRetryCount so that we only reset the retry count after a successful handshake (ConnectionEstablished).

This fix was tested using the in-process-server infrastructure, with failure modes:

actively (fault) reject connections
passively (no completion) reject connections
accept connections but never return responses
accept connections but return garbage responses

This testing highlit a second failure mode, whereby the "never return responses" scenario failed to reconnect. This turned out to be because our message timeout logic only considered "async" scenarios, since "sync" timeout is active from the perspective of the caller (Monitor.Wait). This means that if the handshake "critical tracer" (the echo or ping that we use with a payload that gets validated) never responds... well, nothing happens. The handshake is neither sync nor async, because it is F+F without a result-box.

We therefore extend the timeout logic to explicitly identify this handshake scenario (via a new pseudo-CommandFlags value), and burn the connection accordingly.

context: #3036

mgravell added 3 commits March 18, 2026 15:19

tests for args passed to retry-policy

7051aca

context: #3036

more failure modes

8955d30

detect timout of critical tracer

bb74010

mgravell mentioned this pull request Mar 19, 2026

SE.Redis Connection Storms & High CPU #3036

Closed

mgravell requested a review from philon-msft March 19, 2026 12:04

philon-msft approved these changes Mar 19, 2026

View reviewed changes

mgravell merged commit f9af64f into main Mar 19, 2026
8 checks passed

mgravell deleted the marc/ddos branch March 19, 2026 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve detection of connect/handshake failures and the retry-policy#3038

Improve detection of connect/handshake failures and the retry-policy#3038
mgravell merged 3 commits intomainfrom
marc/ddos

mgravell commented Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mgravell commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mgravell commented Mar 19, 2026 •

edited

Loading