Skip to content

new feature: introduce retry budget to prevent retry storm #7376

@dentiny

Description

@dentiny

Feature Description

Hi team, in terms for retry policy implemented in RetryLayer, currently we

  • rely on backon for exponential backoff + jitter, with retry interval and attempt support
  • if integrated with timeout layer, we could control the overall retry attempt upper bound, however many attempts we've tried internally

But we don't have any mechanism to prevent retry storm, basically when a storage backend is already overloaded we shouldn't attempt more retries to make service worse.

One way to achieve that is to introduce retry bucket

  • when a request succeeds, we deposit the budget
  • instead if one fails, we withdraw the budget
  • retry can only happen when there's sufficient budget

As a reference, AWS S3 transfer manager have the implementation in their retry policy

A somewhat relevant reference from google SRE book:

Consider having a server-wide retry budget. For example, only allow 60 retries per minute in a process, and if the retry budget is exceeded, don’t retry; just fail the request.

Reference: https://sre.google/sre-book/addressing-cascading-failures/

Problem and Solution

I want to avoid retry storm in production, introduce retry budget should solve or alleviate the problem.

Additional Context

No response

Are you willing to contribute to the development of this feature?

  • Yes, I am willing to contribute to the development of this feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestreleases-note/featThe PR implements a new feature or has a title that begins with "feat"

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions