feat(outbound): unified circuit breaker with success-rate and Retry-After awareness#4546
Draft
unleashed wants to merge 18 commits into
Draft
feat(outbound): unified circuit breaker with success-rate and Retry-After awareness#4546unleashed wants to merge 18 commits into
unleashed wants to merge 18 commits into
Conversation
Update the linkerd2-proxy-api dependency to version 0.19.0, which contains the restructured FailureAccrual type with a direct consecutive_failures field, the new LoadBiasConfig and RetryAfterConfig message types, and the ejection field on BalanceP2c required by config conversions. The circuit breaker work depends on these. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
…and success rate Introduce SuccessRateConfig, LoadBiasConfig, and RetryAfterConfig as typed domain models that bridge proto messages to proxy internals. This captures the configuration surface for each respective feature. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
… variants Plug LoadBiasConfig and RetryAfterConfig in the Http1, Http2, and Grpc protocol variants as Option fields defaulting to None. Consumer sites that destructure these variants (sidecar.rs, ingress.rs) gain `..` to skip the new fields until later work uses them in the balance layer. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
…sions Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Concrete<T> moves from derived PartialEq, Eq, and Hash to manual implementations. All three exclude failure_accrual, which cannot implement Eq or Hash because SuccessRateConfig holds f64 fields. Identity over the remaining fields stays the same. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
When the broadcast channel is full, a classification result was discarded without any trace. Each send site on the response future and body now reports the drop at debug level, so backpressure on the classification channel becomes observable instead of silent. State also gains a hand-written Debug implementation in place of the derived one, so only the classifier needs to be printable and the class type no longer has to be. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
FailureAccrual changes from an enum with None and ConsecutiveFailures variants into a struct holding a consecutive field plus an optional success_rate field, and consumers now hold an Option<FailureAccrual> in place of the former None variant. With success_rate present, the proto conversion reads that field, which it dropped before, and rejects degenerate settings. A threshold outside the zero-to-one range, a decay below the moving-average floor, or a cold-start request count above a safety ceiling all return an error rather than a breaker that can never trip or tracks an unusable window. The struct cannot derive Eq or Hash because SuccessRateConfig keeps an f64 threshold, so those impls are dropped from ClientPolicy, Protocol, Http1, Http2, and Grpc. The migration itself keeps the same runtime behavior. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
These tests pin the success-rate side of the failure-accrual proto conversion. They cover the accept case, the 10s default applied when decay is unset, threshold rejection outside the [0.0, 1.0] range including NaN, and the inclusive 1ms decay floor. A negative decay is also covered. It fails the duration conversion and surfaces as a backoff duration error rather than a value error. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Rate-limited responses include backoff hints the circuit breaker can reuse. A new retry_after.rs adds DurationHintStore plus the RetryAfterStore and GrpcRetryPushbackStore wrappers around it, and the RetryAfterClassify and GrpcRetryPushbackClassifyEos classifiers that read Retry-After and grpc-retry-pushback-ms values off the response and record them for the breaker backoff. Each store keeps the longest hint seen and timestamps it so old values turn stale. The two hint sources stay separate. A Retry-After header records only into the HTTP store, and gRPC pushback records only into the gRPC store, so the Retry-After store never holds a value that did not come from a Retry-After header. The classify::grpc_code helper becomes pub so the classifier can read gRPC status codes from response headers. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
A unified breaker replaces the single consecutive-failures policy with one that watches two signals at once: a run of consecutive 5xx responses, and an EWMA success rate that also counts 429 and gRPC RESOURCE_EXHAUSTED as failures. Either signal can trip the circuit on its own. The breaker moves through three states, open while it accepts traffic and tracks both signals, closed while it backs off, and probation while it admits one probe to decide whether to reopen. Cold-start protection covers only the success rate. The EWMA starts at 1.0 and cannot trip until min_requests responses arrive, and a long idle gap resets that counter so one late response does not dominate the average. The consecutive policy keeps no such grace and trips as soon as the run reaches its limit. The probe check stays mode-aware to keep the old breaker's behavior. With the success rate off (min_requests at usize::MAX) it defers to the default classifier through class.is_success(), so 429 is judged as before. With the success rate on it asks for a clean response, treating 429 and RESOURCE_EXHAUSTED as a still-limiting endpoint. A stored Retry-After or gRPC pushback hint sets a floor for the first backoff, and the larger of the two wins when both are present. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Put a breaker-controlled gate in front of each balancer endpoint. When failure accrual is set for the target, the endpoint gets its own Retry-After and gRPC pushback stores, and its response classifier is wrapped so 429 and RESOURCE_EXHAUSTED hints land in those stores. The wrapping happens inline during call(), not through a separate layer, and the stores are built per endpoint so a hint from one endpoint does not extend the backoff of another. When accrual is absent, the endpoint uses the stock classifier behind a gate that never shuts, with no stores and no hint parsing, matching an endpoint that has no circuit breaking. An accrual policy that can never trip, where max_failures is zero and any success-rate threshold is at or below zero, resolves the same way, so it allocates no stores and spawns no breaker. The breaker Params struct holds the per-endpoint stores and the Retry-After cap, passing them on to the UnifiedBreaker. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
The consecutive-failures breaker is replaced by a unified breaker that follows both consecutive failures and an EWMA success rate. The per-target builder fills a UnifiedBreakerConfig: with a success rate it uses the configured threshold, decay, and minimum request count. Without one, consecutive-only mode sets the threshold to 0.0 and the minimum request count to usize::MAX, so the success-rate policy never trips while probe semantics stay in place. With this in place the consecutive_failures module goes away, since the unified breaker covers its behavior. The balance layer drops NewClassifyGateSet for NewRetryAfterGateSet, which adds Retry-After extraction to each endpoint gate. A per-target RetryAfterConfig now reaches the breaker. The HTTP and gRPC policy protocols already hold this configuration, so the sidecar and ingress route builders pass it into the route params, the router moves it onto each Concrete, and the balancer reads it as a parameter to cap the honored Retry-After duration. The Concrete cache key still leaves out failure_accrual and retry_after, since they configure the breaker but do not decide which backend a key selects, and a config-only change must not rebuild the backend cache. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Document that closed() creates a new backoff stream on each trip, so exponential escalation is not kept across trip-recover-trip cycles. A second trip starts at the base backoff. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
With max_failures=0 and the success-rate policy off (threshold=0.0, min_requests=usize::MAX), the gate must stay open no matter how many errors arrive. The test floods 5xx then 429 responses and checks the gate never trips. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Add a test module for the unified circuit breaker. It checks the success-rate-only trip when consecutive failures are disabled, gRPC response classification, and that a Retry-After hint and a gRPC pushback trailer extend the backoff, driving real responses through the classifier into the stores the breaker reads. Two cases deal with backwards compatibility. In consecutive-only mode a 429 probe still counts as success, and a hint recorded in a store the breaker does not hold never reaches it. A further case confirms that a disabled failure-accrual config spawns no breaker task and leaves the gate open. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
A gRPC pushback hint recorded in the store before the breaker trips must hold the gate shut past the normal exponential backoff. With a 5s hint and a 1s backoff, the gate stays shut at 4s and enters probation only after the hint elapses. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
When both an HTTP Retry-After and a gRPC pushback hint are present, the breaker takes the larger of the two for backoff extension. The integration test records http=3s and grpc=7s, trips via three 5xx, and checks the gate stays shut past the 3s mark, still shut at 6s, then enters probation once the 7s window ends. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
After an idle gap of many decay windows the request counter resets, so the success-rate policy again needs min_requests samples before it can trip. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is opened against #4537 to minimize the diff, and builds on top of it. It contains the
circuit breaker implementation before any changes to linkerd2-proxy-api affecting the API,
its messages, and the design of the circuit breaker. That is expected to happen after all the
original feature set is posted completely, as a set of changes on top.
Summary
This branch reworks outbound HTTP failure handling. It replaces the consecutive-failures
circuit breaker with a unified breaker that trips on either consecutive 5xx failures or a
declining EWMA success rate, and it teaches that breaker to honor server backpressure
signals (
Retry-After,grpc-retry-pushback-ms). It also lands the building blocks forfailure-aware load balancing (a standalone
linkerd-ewmacrate and alinkerd-load-biasercrate) and extends the client-policy configuration to carry the newknobs from
linkerd2-proxy-api0.19.0.Every new behavior is opt-in. With no policy configured, an endpoint gets a gate that never
shuts, no breaker task, no hint stores, and no load biasing: the prior behavior, unchanged.
What is live vs. staged
Retry-After/grpc-retry-pushback-msto breaker backoff floorlinkerd-ewmacrateFailureAccrual(struct),SuccessRateConfig,RetryAfterConfiglinkerd-load-biasercrateLoadBiasConfig(load_bias)The balancer itself is unchanged on this branch (
linkerd/proxy/balancehas no diff); itstill uses Tower's
PeakEwma. The load-biaser is a self-contained replacement prepared fora follow-up that swaps the balancer's load metric. This is called out so the dormant
load_biasplumbing is not mistaken for an active code path.Architecture
Component map
flowchart TD API["linkerd2-proxy-api 0.19.0 (proto)<br/>FailureAccrual, SuccessRate,<br/>LoadBiasConfig, RetryAfterConfig"] subgraph CP["linkerd-proxy-client-policy"] CPT["Http1 / Http2 / Grpc<br/>failure_accrual (active)<br/>retry_after (active)<br/>load_bias (staged)"] end subgraph OUT["linkerd-app-outbound : http/breaker"] UNI["unified.rs: UnifiedBreaker state machine"] WRAP["wrap_classify.rs: NewRetryAfterGateSet<br/>per-endpoint gate, classifier, task"] RA["retry_after.rs: DurationHintStore,<br/>RetryAfterClassify"] BAL["concrete/balance.rs: wires gate-set<br/>beneath the PeakEwma balancer"] end subgraph LEAF["leaf crates"] EWMA["linkerd-ewma (active)<br/>time-decayed average, get_at(now)"] CLS["linkerd-http-classify<br/>retry_after parsers,<br/>BroadcastClassification"] LB["linkerd-load-biaser (staged)<br/>Load = max(rtt x (pending+1), penalty)"] end API -->|TryFrom + validation| CPT CPT -->|sidecar.rs / ingress.rs| OUT UNI -->|success rate| EWMA RA --> CLS WRAP --> RA BAL --> WRAP LB -. uses .-> EWMA LB -. not yet wired .-> BALWhere the breaker sits in the outbound HTTP stack
The outbound stack resolves a logical service to a set of backends, and each backend to a
set of endpoint addresses behind a load balancer. The breaker lives at the bottom of that
balancer, one instance per resolved endpoint.
flowchart TD LOG["logical (per service): routes to backends"] CONC["concrete (per backend)"] BALANCER["NewBalance: P2C over PeakEwma<br/>selects a ready endpoint"] LOG --> CONC --> BALANCER BALANCER -->|one gated stack per endpoint| E1 BALANCER --> E2 BALANCER --> E3 subgraph E1["endpoint 10.0.0.1"] direction TB G1["svc::Gate (admission)"] --> C1["Insert + BroadcastClassification"] --> H1["HTTP client"] T1[["UnifiedBreaker task"]] end subgraph E2["endpoint 10.0.0.2"] direction TB G2["svc::Gate"] --> C2["Insert + BroadcastClassification"] --> H2["HTTP client"] T2[["UnifiedBreaker task"]] end subgraph E3["endpoint 10.0.0.3"] direction TB G3["svc::Gate"] --> C3["Insert + BroadcastClassification"] --> H3["HTTP client"] T3[["UnifiedBreaker task"]] end C1 -. classify::Class .-> T1 T1 -. gate state .-> G1 C2 -. classify::Class .-> T2 T2 -. gate state .-> G2 C3 -. classify::Class .-> T3 T3 -. gate state .-> G3concrete/balance.rspushesbreaker::NewRetryAfterGateSetonto the endpoint stack beforepushing
http::NewBalance. In Tower, the later push is the outer layer, so the balancerwraps a set of already-gated endpoints. Because
NewBalanceinstantiates the innerNewServiceonce per resolved endpoint, each endpoint gets its own gate, its ownclassification channel, its own hint stores, and its own breaker task. A trip on one
endpoint never affects another.
This reuses the proxy's existing endpoint-ejection model. The breaker does not reject
requests directly; it actuates a
svc::Gate(linkerd/stack/src/gate.rs), and the gate'sreadiness is what the balancer observes:
Open:poll_readysucceeds, so the endpoint is selectable by P2C.Shut:poll_readystaysPending, so the balancer's readiness cache drops theendpoint from the selectable set and traffic shifts to healthy peers.
Limited(Semaphore(1)): exactly one request is admitted; this is the recovery probe.Construction-time wiring (per endpoint)
flowchart LR A["FailureAccrual + RetryAfterConfig"] --> B["NewRetryAfterGateSet::new_service<br/>(per endpoint)"] B --> C["build RetryAfterStore +<br/>GrpcRetryPushbackStore"] B --> D["create gate channel +<br/>classification mpsc"] B --> E["spawn UnifiedBreaker::run()"] B --> F["assemble: Gate to InsertRetryAfterClassify<br/>to BroadcastClassification to HTTP client"]When the resolved accrual is absent, or present but unable to ever trip, the endpoint
takes a no-breaker path instead: the stock
BroadcastClassificationand a gate that nevershuts, with no stores and no task. The "unable to ever trip" filter
(
is_effectively_disabled) catches a policy withmax_failures == 0and no usablesuccess-rate threshold, pinning its cost to that of having no circuit breaking.
Configuration flow
Configuration is validated once at the proto boundary and then carried as plain target
params down to the point where the breaker task is spawned.
flowchart TD P["linkerd2-proxy-api 0.19.0<br/>outbound.proxy_protocol Http1 / Http2 / Grpc"] CP["client-policy types<br/>failure_accrual: Option<FailureAccrual><br/>retry_after: Option<RetryAfterConfig><br/>load_bias: Option<LoadBiasConfig> (staged)"] R["logical policy router<br/>builds Concrete<T> with failure_accrual + retry_after"] B["Balance<T> param impls<br/>HasFailureAccrual<br/>Param Option<RetryAfterConfig><br/>Param EwmaConfig"] G["NewRetryAfterGateSet::new_service<br/>reads accrual + max_duration<br/>filters effectively-disabled policies"] U["UnifiedBreakerConfig<br/>spawn UnifiedBreaker::run()"] P -->|TryFrom + validation| CP CP -->|sidecar.rs / ingress.rs extract per protocol| R R --> B --> G --> UValidation happens in the
TryFromimpls in client-policy: the success-ratethresholdmust be within
[0.0, 1.0](which also rejectsNaN),decaymust be at least the EWMAfloor (
MIN_DECAY, 1 ms) and defaults to 10 s when absent, andmin_requestsis bounded socold-start can always be satisfied. Those range checks reject out-of-bounds values at
conversion with an
InvalidValueerror. A config that is in range yet can never trip(
threshold == 0.0, say) is not rejected at conversion; it is accepted and then collapsed tothe no-breaker path at stack-build time by
is_effectively_disabled(wrap_classify.rs),rather than producing a live breaker that can never trip.
Concrete<T>carriesfailure_accrualandretry_after, but itsEq/Hasharehand-written to exclude them. Two reasons:
SuccessRateConfigholds anf64, which has nototal
Hash/Eq; and breaker configuration is not part of backend identity: it controlshow a backend is treated, not which backend it is. The backend cache key stays
{target, authority, parent, parent_ref, backend_ref}.The unified circuit breaker
The breaker is a single async task per endpoint (
UnifiedBreaker::run) that owns athree-state machine. It consumes
classify::Classverdicts from an mpsc channel and drivesthe gate.
stateDiagram-v2 [*] --> Open: start (EWMA = 1.0, counters = 0) Open --> Closed: TRIP Closed --> Probation: backoff elapsed Probation --> Open: probe succeeds (reset state) Probation --> Closed: probe fails (advance backoff) note right of Open gate.open() - all requests admitted. Per response: update consecutive (5xx only), EWMA (5xx or 429 or RESOURCE_EXHAUSTED scores 0.0), request_count; reset count after idle beyond 3x decay. TRIP when consecutive reaches max_failures OR (count reaches min_requests AND EWMA below threshold). end note note right of Closed gate.shut() - endpoint ejected from the balancer. Wait max(backoff_step, hint); the first wait is floored by max(Retry-After, gRPC pushback), capped. end note note left of Probation gate.limit(1) - exactly one probe admitted. Dual mode: probe must be non-5xx AND non-429. Consecutive-only mode: class.is_success(). end noteTwo policies, one circuit. Consecutive-failure tracking counts configured failures
(typically 5xx) and resets on any success; it has no cold-start protection because repeated
hard failures are a strong signal at any sample size. The success-rate policy feeds an EWMA
where both 5xx and 429 / gRPC
RESOURCE_EXHAUSTEDcount as failure, so rate limiting dragsthe rate down. Either policy can trip the circuit. When a single response crosses both
thresholds at once, the trip is attributed to consecutive failures because that check runs
first; the
TripReasonis for observability only.Cold-start protection applies to the success-rate policy in three layers: the EWMA
initializes optimistically at
1.0; the circuit cannot trip on success rate untilmin_requestsresponses have been seen; and after an idle gap longer than3 x decaytherequest counter resets, because at that point a single new sample would dominate the decayed
average and could trip the circuit on its own. Idle is measured against the last response
time, tracked separately from the EWMA whose timestamp freezes while the circuit is shut and
probing.
Recovery is mode-aware, which is how backwards compatibility is preserved:
success_ratemax_failuresNone> 0class.is_success(), exactly the prior breaker's semanticsSome0Code::Ok(gRPC)Some> 0None/ threshold <= 00The mode is fixed at construction, not re-derived at recovery: when the success-rate policy
is absent,
wrap_classify.rssetsmin_requests = usize::MAX(andthreshold = 0.0), andthe breaker keys probe behavior off that
min_requests == usize::MAXsentinel rather thanre-checking
success_rate.is_some().In consecutive-only mode the breaker delegates the probe to the default classifier, so a 429
is judged exactly as the old breaker judged it. In any mode with success-rate active, a 429
during probation is treated as failure, because reopening to a still-rate-limited endpoint
would immediately re-trip.
Request and response data flow
Two channels connect the request path to the breaker task, both created per endpoint at
construction time: a gate channel (
watch<State>) the breaker writes, and an mpsc channelof classifications the breaker reads.
sequenceDiagram autonumber participant BAL as Balancer (P2C) participant G as svc::Gate participant I as InsertRetryAfterClassify participant B as BroadcastClassification participant U as Upstream endpoint participant S as Hint stores participant CB as UnifiedBreaker task Note over G,CB: gate channel and classification mpsc are created per endpoint BAL->>G: poll_ready (pick a ready endpoint) G->>G: admit only if Open or holding a Limited permit G->>I: request I->>I: read classify::Response from extensions,<br/>wrap as RetryAfterClassify, re-insert I->>B: request B->>U: request (response body wrapped) U-->>B: response / trailers / error B->>S: drives RetryAfterClassify / GrpcRetryPushbackClassifyEos,<br/>which record Retry-After (429/503) and grpc-retry-pushback-ms<br/>(RESOURCE_EXHAUSTED, gRPC headers only on HTTP 200 OK) B->>CB: try_send(classify::Class) B-->>I: response I-->>G: response G-->>BAL: response CB->>CB: update consecutive, EWMA, counters alt policy breached CB->>S: drain hints for the backoff floor CB->>G: gate.shut() (endpoint ejected) endThe hint stores (
DurationHintStore, anArc<Mutex<Option<(Instant, Duration)>>>) decouplethe classifier from the breaker. Recording is max-value-wins;
take(max_age)returns theremaining wait (recorded duration minus elapsed) only if the hint is still fresh, consumes
it so it is used once, and discards stale entries. Backpressure on the classification channel
is non-blocking:
try_sendfailures are logged at debug and dropped rather than stalling theresponse path. The breaker also drains the channel while shut, so a closed channel ends the
task instead of spinning.
Retry-Afterparsing handles both the delay-seconds and HTTP-date forms (429/503 only); gRPCpushback parsing reads
grpc-retry-pushback-ms, treating a negative value as "do not retry"(no hint). Unary gRPC failures carry the pushback in headers (parsed at
start); streamingfailures carry it in trailers (parsed at
eosvia aGrpcRetryPushbackClassifyEoswrapper).All hints are capped by the per-endpoint
RetryAfterConfig.max_duration, defaulting toDEFAULT_RETRY_AFTER_MAX_DURATION(300 s).Supporting crates
linkerd-ewmais a standalone time-decayed exponentially-weighted moving average. Thereason it exists separately from Tower's internal RTT estimator is
get_at(now): anon-mutating, time-projected read (
value * exp(-elapsed/decay)), which lets the breakersample the success rate without perturbing it, and lets the future load-biaser read load
under a shared lock.
addblends with a time-aware alpha and drops samples that share atimestamp with the last update. The breaker initializes it with
new_with_value(decay, now, 1.0).linkerd-load-biaser(staged) wraps a service to track an RTT EWMA and a penalty EWMA,exposing
Load = max(rtt * (pending + 1), penalty)for P2C selection. It classifiesresponses into
FailureHint::{RateLimited, ServiceUnavailable, InternalError}across HTTPand gRPC, and amplifies the penalty when a
Retry-After/ pushback hint is present so arate-limited endpoint stays de-prioritized through the server's stated window. It is
unit-tested in isolation; wiring it into the balancer in place of Tower
PeakEwmais left toa follow-up, which is also where
LoadBiasConfigbecomes live.Backwards Compatibility
NoneFailureAccrualenum-to-struct migration preserves the no-accrual semantics (absent accrual still yields a gate that never shuts)linkerd/proxy/balanceis unchanged; load-balancing behavior is unchangedNoneexplicitly (discover.rs)Proto wire compatibility (fresh field numbers, optional fields) is owned by the companion
linkerd2-proxy-apibranch and is verified there, not in this repo.Testing
unified.rscover starts-open, consecutive trips, success resettingthe consecutive count, cold-start and idle re-arming, retry-after backoff flooring, and the
dual / consecutive-only probe semantics.
breaker/integration_tests.rsdrives the breaker task against the realsvc::Gate,feeding classifications through the per-endpoint channel: trip-and-recover, HTTP
Retry-Afterand gRPC pushback extending backoff, and a combined max-value-wins case. Twocases (
*_end_to_end) route an actual response through the real classifier into the hintstore; the rest pre-seed the store directly. (Cold-start re-arming after idle is covered in
unified.rs, above, not here.)http/logical/tests/failure_accrual.rsexercise the full outboundstack: consecutive-failure accrual, the balancer dropping tripped endpoints from selection
(
balancer_doesnt_select_tripped_breakers), and per-endpointRetry-Afterisolation withno cross-endpoint bleed.
breaker/retry_after.rscheck max-value-wins, staleness, andelapsed-adjustment behavior.
success-rate configs (threshold, decay, and
min_requestsbounds).ewma,load-biaser, classifyretry_after) are independentlyunit-tested.
Reviewing by commit groups
The history is ordered to build bottom-up and is reviewable in roughly these groups:
ewma,load-biaser, classifyretry_afterparsers, classify channel API.FailureAccrualenum-to-struct migration and new config types,proto bump to 0.19.0.
NewRetryAfterGateSetwiring.concrete/balance.rs,logical*,sidecar.rs,ingress.rs.Cross-Repo Dependencies
Requires
linkerd2-proxy-api0.19.0, which introduces theSuccessRate,LoadBiasConfig,and
RetryAfterConfigproto messages and theFailureAccrualstruct fields used by theconfig conversions.