feat(policy): support load-bias and retry-after balancer annotations#15317
Open
unleashed wants to merge 9 commits into
Open
feat(policy): support load-bias and retry-after balancer annotations#15317unleashed wants to merge 9 commits into
unleashed wants to merge 9 commits into
Conversation
Bump the workspace linkerd2-proxy-api dependency from 0.18.0 to 0.19.0, which includes the new LoadBiasConfig, RetryAfterConfig, and ejection proto messages along with the FailureAccrual restructuring from oneof to direct fields. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Add LoadBiasConfig, RetryAfterConfig, and their associated default constants (DEFAULT_LOAD_BIAS_PENALTY, DEFAULT_LOAD_BIAS_PENALTY_DECAY, DEFAULT_RETRY_AFTER_MAX_DURATION) alongside the existing Backoff type. OutboundPolicy gains load_bias and retry_after fields, both defaulting to None. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
…ation to_proto() now converts LoadBiasConfig and RetryAfterConfig into their proto representations. Use their actual values through every HTTP and gRPC protocol dispatch site. Each protocol function receives the two new Option parameters after the existing failure_accrual argument and passes them into the Http1, Http2, and Grpc proto struct constructions. Set the new ejection field to None on every BalanceP2c initializer since endpoint ejection is not yet configured through annotations. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
The proto-api branch changed FailureAccrual from a kind-oneof to direct consecutive_failures and success_rate fields. Update the test helper to match. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Reject negative durations early, reject fractional values with actionable suggestions (eg. try '500ms' instead of '0.5s'), and require a unit suffix on non-zero bare numbers with a hint suggesting the likely intent. Existing accepted inputs are unchanged. Add unit tests for all parse_duration code paths including the new rejection conditions and the previously untested "h" and "d" units. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Add parsing functions to the outbound index for load-bias and retry-after. Both follow the parse_accrual_config pattern: read annotation, validate mode, parse sub-annotations with parse_duration, return typed config. Update outbound_api test helpers to account for the new load_bias and retry_after fields in the proto structs. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Extend Validate implementations to call parse_load_bias_config and parse_retry_after_config. Since the admission webhook reuses the same parse functions as the indexer, every rejection is automatically enforced at apply time. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
Add E2E integration tests exercising the full annotation-to-gRPC pipeline for load bias and retry-after. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
The ValidatingWebhookConfiguration intercepts Route resources but not core/v1 Services, so balancer annotations set directly on a Service are only parsed at indexer time. Invalid values surface as controller log warnings and fall back to defaults rather than being rejected at apply time. Document this for future contributors. Signed-off-by: Alejandro Martinez Ruiz <amr@buoyant.io>
b4c407f to
dd457c4
Compare
Member
Author
|
Edit: fixed build failure due to missing trait import. |
adleong
reviewed
May 27, 2026
| bail!("{s} value is sub-millisecond; minimum resolution is 1ms"); | ||
| } | ||
| } else { | ||
| bail!("fractional values not supported for duration unit '{unit}'"); |
Member
There was a problem hiding this comment.
by my reading, we'll always bail for fractional values for one reason or another. why bother with all these different reasons and not just say "fractional values not supported"?
Comment on lines
632
to
+634
| outbound_index::parse_accrual_config(annotations)?; | ||
| outbound_index::parse_load_bias_config(annotations)?; | ||
| outbound_index::parse_retry_after_config(annotations)?; |
Member
There was a problem hiding this comment.
I actually think this is an error that we validate accrual config on HttpRoutes here, because accrual can only be set on Services or EgressNetworks. Should be the same for load bias and retry after.
| @@ -194,17 +200,52 @@ where | |||
| pub fn failure_accrual_consecutive( | |||
| accrual: Option<&grpc::outbound::FailureAccrual>, | |||
| ) -> &grpc::outbound::failure_accrual::ConsecutiveFailures { | |||
Member
There was a problem hiding this comment.
I expect this function to change substantially based on linkerd/linkerd2-proxy-api#559
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add control-plane support for two new Service annotations that configure
advanced load balancing behavior in the proxy's outbound path:
healthier backends. Controlled by exponential-decay penalty parameters.
grpc-retry-pushback-ms trailers, temporarily removing overloaded endpoints
from the load balancer pool.
Both features are opt-in via
balancer.alpha.linkerd.io/*annotations and haveno effect on unannotated resources. They complement the existing
balancer.linkerd.io/failure-accrualconsecutive-failures circuit breaker.This branch is the configuration foundation for the broader load-biaser and
circuit-breaking feature set. Further changes in subsequent branches will add:
circuit breaker from emptying a pool.
Additionally, a hardening of parsing durations has been implemented, rejecting
negative values, fractional seconds (suggests
msequivalent), bare numberswithout units, and overflow values.
Annotations
These features are new and supported via alpha-level annotations, to ensure
operators adjust expectations to the explicit maturity level.
balancer.alpha.linkerd.io/load-bias"true"|"false""true"or"false""false"and absent are equivalent.balancer.alpha.linkerd.io/load-bias-penalty10s,500ms, etc.)5sms,s,m,h,dunitsload-biasis"true".balancer.alpha.linkerd.io/load-bias-penalty-decay10sload-biasis"true".balancer.alpha.linkerd.io/retry-after"true"|"false""true"or"false"balancer.alpha.linkerd.io/retry-after-max-duration300s(5 min)retry-afteris"true".Validation scope
These annotations are read from Service objects at indexer time.
The admission webhook validates them on Route resources (HTTPRoute,
GRPCRoute) that reference a Service parent, but does not intercept
core/v1 Services directly, matching the existing behavior for
balancer.linkerd.io/failure-accrualandtimeout.linkerd.io/*.Parse errors on Service annotations surface as controller log warnings
and fall back to the unconfigured default.
EgressNetwork resources ignore these annotations (they use Forward
backends, not Balancer) and log a warning if they are set.
Proto dependency
Requires
linkerd2-proxy-api0.19.0, which restructuresFailureAccrualfromoneof kindto flatconsecutive_failures+success_ratefields and addsLoadBiasConfig,RetryAfterConfig,and
BalanceP2c.ejection. The wire encoding is backwards-compatible(field numbers preserved), and new optional fields are omitted when
unset and ignored by older proxies.
For details check out linkerd/linkerd2-proxy-api#556.
Data plane implementation
The proxy has the data-plane symmetric implementation to this PR at linkerd/linkerd2-proxy#4537.