new feature: introduce retry budget to prevent retry storm

### Feature Description

Hi team, in terms for retry policy implemented in `RetryLayer`, currently we
- rely on `backon` for exponential backoff + jitter, with retry interval and attempt support
- if integrated with timeout layer, we could control the overall retry attempt upper bound, however many attempts we've tried internally

But we don't have any mechanism to prevent retry storm, basically when a storage backend is already overloaded we shouldn't attempt more retries to make service worse.

One way to achieve that is to introduce retry bucket
- when a request succeeds, we deposit the budget
- instead if one fails, we withdraw the budget
- retry can only happen when there's sufficient budget

As a reference, AWS S3 transfer manager have the implementation in their retry policy
- https://github.com/awslabs/aws-s3-transfer-manager-rs/blob/0226736b8ed3ffea5ad56ddfdc0ae08915ba3d6c/aws-sdk-s3-transfer-manager/src/operation/download/retry.rs#L19
- which internally uses `tower`'s [TpsBucket](https://github.com/tower-rs/tower/blob/master/tower/src/retry/budget/tps_budget.rs)

A somewhat relevant reference from google SRE book:

> Consider having a server-wide retry budget. For example, only allow 60 retries per minute in a process, and if the retry budget is exceeded, don’t retry; just fail the request.

Reference: https://sre.google/sre-book/addressing-cascading-failures/

### Problem and Solution

I want to avoid retry storm in production, introduce retry budget should solve or alleviate the problem.

### Additional Context

_No response_

### Are you willing to contribute to the development of this feature?

- [ ] Yes, I am willing to contribute to the development of this feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new feature: introduce retry budget to prevent retry storm #7376

Feature Description

Problem and Solution

Additional Context

Are you willing to contribute to the development of this feature?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

new feature: introduce retry budget to prevent retry storm #7376

Description

Feature Description

Problem and Solution

Additional Context

Are you willing to contribute to the development of this feature?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions