group: add MutableAutoSelect outbound#266
Open
garmr-ulfr wants to merge 5 commits into
Open
Conversation
A client-side server-selection group that complements MutableURLTest by distinguishing probe-URL health from user-traffic health. A passing probe cannot launder a failing outbound: userFailures only decrements on a successful non-empty Read through the data-plane wrapper, and a no-traffic stall on a live conn (default 30s) bumps the counter and triggers reselection. Ranking applies a three-tier demote (clean / soft when userFailures > 0 / hard when the demote threshold trips) per network, with switch-tolerance hysteresis to dampen churn. A two-step reconnection ladder runs on dial/listen failure: a fast retry of the current target (skipped when userFailures > 0 so probe success can't relaunder), then a full parallel re-probe; exhaustion fires a buffered signal for upper layers. Probe cadence adapts between active and idle intervals. Per-tag history persists across restarts via AutoSelectHistoryStorage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ConsecutiveFailureLimit doc now reflects that the bound also applies to the user-failure counter. - demoteLevel ordering rationale states the real why (clean < soft < hard) instead of pointing at candidateKind. - Ladder skip-step-1 log matches the userFailureCount() > 0 gate. - Drop reference to a spec file that doesn't exist in the repo. - Trim test comments that narrated history or restated names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a new outbound group type, mutableautoselect, intended to improve server selection robustness versus mutableurltest by separating probe-URL health from real user-traffic health and by demoting/avoiding candidates that “probe fine” but stall under actual traffic.
Changes:
- Add
MutableAutoSelectoutbound group with parallel probing, user-failure tracking, stall watchdog wrappers, reconnection ladder, stickiness hysteresis, and an exhaustion signal. - Add persistent history plumbing via
AutoSelectHistoryStorage(with an in-memory default implementation) to carry health signals across restarts. - Extend tests and mocks to cover per-network candidate selection and the new group’s ranking/demotion behavior.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| protocol/register.go | Registers the new mutableautoselect group type with the outbound registry. |
| protocol/group/urltest_test.go | Updates the shared test mock to support per-outbound network lists. |
| protocol/group/mutableautoselect.go | Core MutableAutoSelect implementation: selection, probing lifecycle, ladder, stickiness, exhaustion signaling, background loop. |
| protocol/group/mutableautoselect_test.go | Comprehensive unit tests covering ranking tiers, stickiness, ladder behavior, history persistence, stall wrappers, and pause-manager integration. |
| protocol/group/mutableautoselect_protocols.go | Defines per-protocol probe behavior (timeouts, pool exclusions, substituted delays). |
| protocol/group/mutableautoselect_probe.go | Implements per-member HTTP probe execution and bounded-concurrency probe fan-out. |
| protocol/group/mutableautoselect_history.go | Implements in-memory history ring, demotion logic, and hydration/snapshot conversions to persisted form. |
| protocol/group/mutableautoselect_dataplane.go | Adds data-plane stream/packet wrappers with idle-stall watchdog and activity hooks. |
| protocol/group/fallback_test.go | Extends the shared mock outbound struct used across group tests. |
| option/group.go | Adds MutableAutoSelectOutboundOptions and documents default behaviors/knobs. |
| constant/proxy.go | Adds TypeMutableAutoSelect constant. |
| adapter/group.go | Adds ExhaustionSignaler interface. |
| adapter/autoselect_history.go | Adds AutoSelectHistoryStorage interface + in-memory implementation and TagHistory/Outcome types. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Per Copilot review on #266: once stalled is set (timer fired or Close ran), a successful late Read could otherwise call onRead and reset the userFailure the stall just recorded, masking the failing server. The Reset and onActivity calls in the same path are also wasted on a terminal conn. Also a comment-hygiene sweep across the autoselect files: drop type-doc previews that restate the body, doc blocks that enumerate every branch when only one is non-obvious, restatement docs on unexported helpers, and the `// --- ... ---` banners in the test file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the outcome ring on TagHistory with scalar probe fields (LastSuccessDelayMs, LastOutcomeAt, ConsecutiveFailures) plus a laundering-resistant UserFailures sliding window. Probe outcomes and user-traffic events live on separate tracks so a passing probe can't relaunder a failing outbound. Adds gates to the user-traffic-side failure path so a single stall on an idle keep-alive or an orphan-conn cascade no longer demotes the active outbound: - userFailureDedupeWindow (30s) collapses bursts of user_failures on the same outbound into one entry, and the ladder kick is gated on the same dedupe so orphan-conn cascades don't trigger N redundant probe sweeps. - The data-plane stall watchdog requires the most recent IO to have been a Write. A proven conn whose last activity was a Read is user-idle, not a broken tunnel. - softFailLimit (default 2) replaces the "one failure = soft demote" rule; with many alternatives in the pool, a single transient stall shouldn't push the active member behind every clean peer. - Switch-penalty: the hard-demote threshold doubles when the best real-seeded alternative is more than 3x slower, gated on kindRealSeeded so synthetic delays don't anchor the comparison. Removes the samizdat substituteDelay binding; the kindSubstituted infrastructure stays dormant. Renames vouched/vouchBytes to proven/provedReadBytes through the watchdog code, tests, and option wire format (data_plane_proved_read_bytes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New
mutableautoselectoutbound group that complementsmutableurltestby distinguishing probe-URL health from real user-traffic health. Targets the connection-oscillation case (engineering#3484) where a tunnel handshakes successfully, probes pass on every cycle, but bytes stop flowing — and the URL-test layer keeps re-picking the same dead server because the probe URL says it's healthy.Why
MutableURLTesttreats a passing probe URL as full health. In hostile networks where DPI lets the probe through but throttles user traffic — or where a tunnel handshakes successfully but stops carrying data — a failing outbound looks healthy on every cycle. There is no signal that distinguishes "this server probes fine" from "this server is actually carrying bytes."What's in this PR
Laundering-resistant user-failure window
Each member carries a sliding window of
user_failurestimestamps, separate fromconsecutive_failures. Appended on dial errors, listen errors, and data-plane stalls. Probe successes never touch it — a passing probe URL cannot launder a failing outbound. Failures age out naturally overuserFailureWindow(default 5 min) without depending on traffic being routed through the demoted member. AuserFailureDedupeWindow(default 30 s) collapses bursts on the same outbound into one entry, so a broken outbound with N orphaned keep-alive conns hitting their stall timer in sequence doesn't inflate one event into many.Data-plane stall watchdog
Every dialed conn and listened packet conn is wrapped (
dataPlaneStream/dataPlanePacket) with a per-conntime.AfterFuncidle timer (default 60 s). Two gates have to clear before a stall counts:provedReadBytes(default 4096) of cumulative non-empty Read payload, distinguishing a stalled tunnel from a handshake-only or keepalive-only conn that never carried traffic.When both gates clear and the timer expires, the wrapper appends one user-failure timestamp and triggers the reconnection ladder. CAS-guarded one-shot per conn so a late read returning bytes just before
Closecan't deliver a phantom stall.Three-tier demote ranking, per network
Candidates rank by
(demoteLevel, candidateKind, delayMs):clean— fewer thansoftFailLimituser-failures in window, and consecutive-failure threshold not tripped.soft—userFailsInWindow >= softFailLimit(default 2). Loses to every clean peer but beats hard. A single transient stall is not enough; two withinuserFailureWindowis corroboration.hard—consec >= consecutiveFailLimitoruserFailsInWindow >= consecutiveFailLimit(default 3).splitHealthyForfilters by network first, then returns clean → soft → filtered (hard as last resort). Per-network filtering matters because a clean TCP-only candidate must not starve a soft UDP-only peer.Switch-penalty boost
The hard-demote threshold doubles when the candidate has a real-seeded delay and the best real-seeded alternative is more than 3× slower (
switchPenaltyAltFactor). Hard-demoting a 100 ms member onto a 990 ms alternative on three transient failures is itself a cost; the boost requires more evidence when switching is meaningfully worse. Synthetic and unknown delays are excluded from thebestAltcomparison so the rescue is anchored on real measurements only.Switch-tolerance hysteresis
applyStickinesskeeps the previous sticky tag unless a new winner beats it bySwitchToleranceMs(default 50 ms). Dampens churn from tiny probe-time deltas. TCP and UDP have independent sticky tags.Dial-site fast failover
Before the ladder runs, a
DialContext/ListenPacketerror retries the dial once against the next-best member from the current rank, excluding the failing tag. The active 3-minute probe cadence keeps history fresh enough to pick a working alternative without re-probing. Only one alternate is attempted; the dial caller already retries on its own schedule.Reconnection ladder
Triggered from dial/listen errors (after the fast-failover step) and from the stall watchdog. Full parallel re-probe across the candidate pool inside a
ladderTotalBudgetSeconds(default 10 s) deadline. The ladder kick is gated on the user-failure dedupe: if the triggering event was deduped, the ladder skips so orphan-conn cascades don't trigger N redundant probe sweeps. If no candidate succeeds within the budget, the group emits one value on the buffered exhaustion channel. Concurrent ladder invocations collapse via a CAS guard.Adaptive background probe cadence
A low-priority background loop keeps alternative members warm, registered with sing-box's pause manager. Faster active interval (default 180 s) when traffic has flowed recently through a wrapped conn; slower idle interval (default 900 s) otherwise. Brand-new groups with no observed traffic are treated as idle so we don't burn the fast cadence on a tunnel nobody is using.
Persistent history
A new
AutoSelectHistoryStorageservice interface lets hosts plug in durable storage. The group writes a fullTagHistorysnapshot (probe scalars +user_failureswindow) after every state-changing mutation. The included in-memory implementation is the default; hosts wiring durable storage register their own and wireSetHookto flush snapshots out-of-band. OnAdd, history is lazy-hydrated; persisted entries older thanmaxPersistedAgeSeconds(default 15 min) are dropped on hydrate so a stale snapshot from a previous poll window doesn't start the member innocent. OnRemoveor URL-override change, the affected entries are dropped.Exhaustion signal
ExhaustionSignalerinterface. The group sends a value on a buffered channel when the ladder's full budget elapses with no successful candidate, letting the host refetch/config-newinstead of endlessly re-probing the same dead set.Per-protocol probe behavior
Each protocol gets a probe timeout suited to its handshake cost: longer for multi-step handshakes like algeneva/ssh/shadowtls/reflex/trojan, longest for self-probing protocols like outline (
StrategyFinderdiscovers a winning strategy on first use), and shorter for UDP-based protocols where blocked ports time out silently rather than failing fast. Peer/network protocols (tor,unbounded) are excluded from the candidate pool entirely — they run alongside the group with their own connection managers.Candidate kind (tie-break under demote tier)
kindRealSeeded— measured delay from a real probe.kindUnknown— no in-memory data, no persisted seed.Configurable knobs
option.MutableAutoSelectOutboundOptionsexposes every threshold (demote limits, user-failure window + dedupe window, max persisted age, cadence intervals, idle threshold, switch tolerance, ladder budget, data-plane idle + proved-read bytes) as a config field with documented defaults.