Skip to content

fix(flagd): back off before stream reconnect#391

Open
looooown2006 wants to merge 1 commit into
open-feature:mainfrom
looooown2006:fix/flagd-stream-reconnect-backoff
Open

fix(flagd): back off before stream reconnect#391
looooown2006 wants to merge 1 commit into
open-feature:mainfrom
looooown2006:fix/flagd-stream-reconnect-backoff

Conversation

@looooown2006
Copy link
Copy Markdown

Summary

Fixes the flagd provider stream reconnect loop so both gRPC stream implementations apply an application-level delay before creating the next stream after a stream error or normal stream completion.

Changes:

  • GrpcResolver now waits retry_backoff_max_ms before recreating the RPC EventStream.
  • GrpcWatcher now waits retry_backoff_max_ms before recreating the in-process SyncFlags stream.
  • shutdown uses an event-backed wait so shutdown can interrupt the backoff instead of blocking until the timeout expires.
  • added regression coverage for RPC and in-process stream error/completion reconnect paths.

Closes #390.

Verification

uv run --frozen pytest tests/test_grpc_watcher.py tests/test_grpc_resolver.py -q
# 10 passed
uv run --frozen pytest tests/test_config.py tests/test_errors.py tests/test_file_store.py tests/test_flagd.py tests/test_grpc_watcher.py tests/test_grpc_resolver.py tests/test_in_process.py tests/test_metadata.py tests/test_targeting.py -q
# 199 passed
uv tool run ruff check providers/openfeature-provider-flagd/src/openfeature/contrib/provider/flagd/resolvers/grpc.py providers/openfeature-provider-flagd/src/openfeature/contrib/provider/flagd/resolvers/process/connector/grpc_watcher.py providers/openfeature-provider-flagd/tests/test_grpc_watcher.py providers/openfeature-provider-flagd/tests/test_grpc_resolver.py
# All checks passed
uv run --frozen poe mypy
# Success: no issues found in 26 source files

Note: I also tried the broader non-e2e pytest command, but the local run still initialized the Docker-backed e2e test-harness fixture and failed before running those cases because Docker Desktop's Linux engine was not running on this Windows machine.

Signed-off-by: looooown2006 <102905927+looooown2006@users.noreply.github.com>
@looooown2006 looooown2006 requested review from a team as code owners May 10, 2026 09:30
@github-actions github-actions Bot requested review from aepfli and federicobond May 10, 2026 09:30
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a _shutdown_event using threading.Event in both GrpcResolver and grpc_watcher to manage reconnection delays and facilitate cleaner shutdowns. It refactors the listen method in GrpcResolver into smaller helper methods and adds comprehensive unit tests for the reconnection and backoff logic. The reviewer suggests using the initial backoff duration rather than the maximum backoff for reconnection attempts to prevent excessive delays during recovery from transient failures.

logger.debug(self.config.fatal_status_codes)

self.retry_grace_period = config.retry_grace_period
self.retry_backoff_max_seconds = config.retry_backoff_max_ms * 0.001
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the maximum backoff duration (retry_backoff_max_ms) as a constant delay for every reconnect attempt can lead to significant delays in recovery from transient failures. It is generally better to use the initial backoff duration (retry_backoff_ms) or implement an exponential backoff strategy. Since gRPC already handles connection-level backoff internally, a small application-level delay is sufficient to prevent tight loops while allowing faster recovery.

Suggested change
self.retry_backoff_max_seconds = config.retry_backoff_max_ms * 0.001
self.retry_backoff_seconds = config.retry_backoff_ms * 0.001
self.retry_backoff_max_seconds = config.retry_backoff_max_ms * 0.001

)

def _wait_before_reconnect(self) -> None:
self._shutdown_event.wait(self.retry_backoff_max_seconds)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

As mentioned in the initialization, using the maximum backoff for every retry significantly impacts availability. Consider using the initial backoff duration instead. Note that you will also need to update the corresponding regression tests in tests/test_grpc_resolver.py.

Suggested change
self._shutdown_event.wait(self.retry_backoff_max_seconds)
self._shutdown_event.wait(self.retry_backoff_seconds)

return False

def _wait_before_reconnect(self) -> None:
self._shutdown_event.wait(self.retry_backoff_max_seconds)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the maximum backoff duration (retry_backoff_max_ms) as a constant delay for every reconnect attempt can lead to significant delays in recovery from transient failures. It is generally better to use the initial backoff duration (retry_backoff_ms) or implement an exponential backoff strategy. Since gRPC already handles connection-level backoff internally, a small application-level delay is sufficient to prevent tight loops while allowing faster recovery. Note that you will also need to update the corresponding regression tests in tests/test_grpc_watcher.py.

Suggested change
self._shutdown_event.wait(self.retry_backoff_max_seconds)
self._shutdown_event.wait(self.retry_backoff_seconds)

@looooown2006
Copy link
Copy Markdown
Author

Thanks for the review. I kept this tied to retry_backoff_max_ms because #390 and the flagd provider spec both describe retryBackoffMaxMs as the explicit application-level delay before recreating the stream after an error or completion.

This is separate from gRPC's RPC-level retry behavior; the added wait only applies between one stream ending and the next stream being created. That seems to align with the issue/spec and the reconnect-loop scenario they call out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: add explicit application-level backoff on stream reconnection

3 participants