Skip to content

feat(policy): support load-bias and retry-after balancer annotations#15317

Open
unleashed wants to merge 9 commits into
mainfrom
amr/load-bias-retry-after
Open

feat(policy): support load-bias and retry-after balancer annotations#15317
unleashed wants to merge 9 commits into
mainfrom
amr/load-bias-retry-after

Conversation

@unleashed
Copy link
Copy Markdown
Member

Add control-plane support for two new Service annotations that configure
advanced load balancing behavior in the proxy's outbound path:

  • Load biasing penalizes recently-failed endpoints, shifting traffic toward
    healthier backends. Controlled by exponential-decay penalty parameters.
  • Retry-After honors Retry-After headers (HTTP 429/503) and
    grpc-retry-pushback-ms trailers, temporarily removing overloaded endpoints
    from the load balancer pool.

Both features are opt-in via balancer.alpha.linkerd.io/* annotations and have
no effect on unannotated resources. They complement the existing
balancer.linkerd.io/failure-accrual consecutive-failures circuit breaker.

This branch is the configuration foundation for the broader load-biaser and
circuit-breaking feature set. Further changes in subsequent branches will add:

  • Unified circuit breaker with success-rate accrual (in addition to CF)
  • Endpoint ejection protection, with last-N endpoint safeguards preventing the
    circuit breaker from emptying a pool.

Additionally, a hardening of parsing durations has been implemented, rejecting
negative values, fractional seconds (suggests ms equivalent), bare numbers
without units, and overflow values.

Annotations

These features are new and supported via alpha-level annotations, to ensure
operators adjust expectations to the explicit maturity level.

Annotation Type Default Validation Notes
balancer.alpha.linkerd.io/load-bias "true" | "false" not set (disabled) Mode must be exactly "true" or "false" Enables load biasing on the Service's balancer. "false" and absent are equivalent.
balancer.alpha.linkerd.io/load-bias-penalty duration (10s, 500ms, etc.) 5s Must be > 0; supports ms, s, m, h, d units Duration of the penalty applied to a failing endpoint. Only read when load-bias is "true".
balancer.alpha.linkerd.io/load-bias-penalty-decay duration 10s Must be > 0; same units as above Half-life for exponential decay of the penalty. Only read when load-bias is "true".
balancer.alpha.linkerd.io/retry-after "true" | "false" not set (disabled) Mode must be exactly "true" or "false" Enables Retry-After / grpc-retry-pushback-ms handling.
balancer.alpha.linkerd.io/retry-after-max-duration duration 300s (5 min) Must be > 0; same units as above Cap on how long an endpoint can be held out of the pool via Retry-After. Only read when retry-after is "true".

Validation scope

These annotations are read from Service objects at indexer time.
The admission webhook validates them on Route resources (HTTPRoute,
GRPCRoute) that reference a Service parent, but does not intercept
core/v1 Services directly, matching the existing behavior for
balancer.linkerd.io/failure-accrual and timeout.linkerd.io/*.
Parse errors on Service annotations surface as controller log warnings
and fall back to the unconfigured default.

EgressNetwork resources ignore these annotations (they use Forward
backends, not Balancer) and log a warning if they are set.

Proto dependency

Requires linkerd2-proxy-api 0.19.0, which restructures
FailureAccrual from oneof kind to flat consecutive_failures +
success_rate fields and adds LoadBiasConfig, RetryAfterConfig,
and BalanceP2c.ejection. The wire encoding is backwards-compatible
(field numbers preserved), and new optional fields are omitted when
unset and ignored by older proxies.

For details check out linkerd/linkerd2-proxy-api#556.

Data plane implementation

The proxy has the data-plane symmetric implementation to this PR at linkerd/linkerd2-proxy#4537.

unleashed added 7 commits May 26, 2026 19:22
Bump the workspace linkerd2-proxy-api dependency from 0.18.0 to 0.19.0,
which includes the new LoadBiasConfig, RetryAfterConfig, and ejection
proto messages along with the FailureAccrual restructuring from oneof
to direct fields.

Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Add LoadBiasConfig, RetryAfterConfig, and their associated default
constants (DEFAULT_LOAD_BIAS_PENALTY, DEFAULT_LOAD_BIAS_PENALTY_DECAY,
DEFAULT_RETRY_AFTER_MAX_DURATION) alongside the existing Backoff type.
OutboundPolicy gains load_bias and retry_after fields, both defaulting
to None.

Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
…ation

to_proto() now converts LoadBiasConfig and RetryAfterConfig into their
proto representations. Use their actual values through every HTTP and
gRPC protocol dispatch site.

Each protocol function receives the two new Option parameters after the
existing failure_accrual argument and passes them into the Http1, Http2,
and Grpc proto struct constructions. Set the new ejection field to None
on every BalanceP2c initializer since endpoint ejection is not yet
configured through annotations.

Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
The proto-api branch changed FailureAccrual from a kind-oneof to direct
consecutive_failures and success_rate fields. Update the test helper
to match.

Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Reject negative durations early, reject fractional values with
actionable suggestions (eg. try '500ms' instead of '0.5s'), and
require a unit suffix on non-zero bare numbers with a hint suggesting
the likely intent. Existing accepted inputs are unchanged.

Add unit tests for all parse_duration code paths including the new
rejection conditions and the previously untested "h" and "d" units.

Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Add parsing functions to the outbound index for load-bias and
retry-after. Both follow the parse_accrual_config pattern: read
annotation, validate mode, parse sub-annotations with parse_duration,
return typed config.

Update outbound_api test helpers to account for the new load_bias and
retry_after fields in the proto structs.

Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Extend Validate implementations to call parse_load_bias_config and
parse_retry_after_config. Since the admission webhook reuses the same
parse functions as the indexer, every rejection is automatically
enforced at apply time.

Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
@unleashed unleashed requested a review from a team as a code owner May 26, 2026 17:35
unleashed added 2 commits May 26, 2026 19:41
Add E2E integration tests exercising the full annotation-to-gRPC
pipeline for load bias and retry-after.

Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
The ValidatingWebhookConfiguration intercepts Route resources but not
core/v1 Services, so balancer annotations set directly on a Service
are only parsed at indexer time. Invalid values surface as controller
log warnings and fall back to defaults rather than being rejected at
apply time.

Document this for future contributors.

Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
@unleashed unleashed force-pushed the amr/load-bias-retry-after branch from b4c407f to dd457c4 Compare May 26, 2026 17:41
@unleashed
Copy link
Copy Markdown
Member Author

Edit: fixed build failure due to missing trait import.

bail!("{s} value is sub-millisecond; minimum resolution is 1ms");
}
} else {
bail!("fractional values not supported for duration unit '{unit}'");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by my reading, we'll always bail for fractional values for one reason or another. why bother with all these different reasons and not just say "fractional values not supported"?

Comment on lines 632 to +634
outbound_index::parse_accrual_config(annotations)?;
outbound_index::parse_load_bias_config(annotations)?;
outbound_index::parse_retry_after_config(annotations)?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this is an error that we validate accrual config on HttpRoutes here, because accrual can only be set on Services or EgressNetworks. Should be the same for load bias and retry after.

@@ -194,17 +200,52 @@ where
pub fn failure_accrual_consecutive(
accrual: Option<&grpc::outbound::FailureAccrual>,
) -> &grpc::outbound::failure_accrual::ConsecutiveFailures {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this function to change substantially based on linkerd/linkerd2-proxy-api#559

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants