Feature Description
Hi team, in terms for retry policy implemented in RetryLayer, currently we
- rely on
backon for exponential backoff + jitter, with retry interval and attempt support
- if integrated with timeout layer, we could control the overall retry attempt upper bound, however many attempts we've tried internally
But we don't have any mechanism to prevent retry storm, basically when a storage backend is already overloaded we shouldn't attempt more retries to make service worse.
One way to achieve that is to introduce retry bucket
- when a request succeeds, we deposit the budget
- instead if one fails, we withdraw the budget
- retry can only happen when there's sufficient budget
As a reference, AWS S3 transfer manager have the implementation in their retry policy
A somewhat relevant reference from google SRE book:
Consider having a server-wide retry budget. For example, only allow 60 retries per minute in a process, and if the retry budget is exceeded, don’t retry; just fail the request.
Reference: https://sre.google/sre-book/addressing-cascading-failures/
Problem and Solution
I want to avoid retry storm in production, introduce retry budget should solve or alleviate the problem.
Additional Context
No response
Are you willing to contribute to the development of this feature?
Feature Description
Hi team, in terms for retry policy implemented in
RetryLayer, currently webackonfor exponential backoff + jitter, with retry interval and attempt supportBut we don't have any mechanism to prevent retry storm, basically when a storage backend is already overloaded we shouldn't attempt more retries to make service worse.
One way to achieve that is to introduce retry bucket
As a reference, AWS S3 transfer manager have the implementation in their retry policy
tower's TpsBucketA somewhat relevant reference from google SRE book:
Reference: https://sre.google/sre-book/addressing-cascading-failures/
Problem and Solution
I want to avoid retry storm in production, introduce retry budget should solve or alleviate the problem.
Additional Context
No response
Are you willing to contribute to the development of this feature?