Conversation
philon-msft
approved these changes
Mar 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR investigates and resolves #3036, but also addresses some other related failure modes.
The fundamental issue in ^^^ was:
PhysicalBridge, when invoking the retry-policy, we pass a count of how many retries we've already done, which is used by the policy, for example to implement exponential decayshouldResetConnectionRetryCountflag istruetrueeven in "establishing" cases, due to an otherwise-intentionalcasefall-thruSo the simple part of this fix is simply: move the assignment of
shouldResetConnectionRetryCountso that we only reset the retry count after a successful handshake (ConnectionEstablished).This fix was tested using the in-process-server infrastructure, with failure modes:
This testing highlit a second failure mode, whereby the "never return responses" scenario failed to reconnect. This turned out to be because our message timeout logic only considered "async" scenarios, since "sync" timeout is active from the perspective of the caller (
Monitor.Wait). This means that if the handshake "critical tracer" (theechoorpingthat we use with a payload that gets validated) never responds... well, nothing happens. The handshake is neither sync nor async, because it is F+F without a result-box.We therefore extend the timeout logic to explicitly identify this handshake scenario (via a new pseudo-
CommandFlagsvalue), and burn the connection accordingly.