fix(flagd): back off before stream reconnect#391
Conversation
Signed-off-by: looooown2006 <102905927+looooown2006@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a _shutdown_event using threading.Event in both GrpcResolver and grpc_watcher to manage reconnection delays and facilitate cleaner shutdowns. It refactors the listen method in GrpcResolver into smaller helper methods and adds comprehensive unit tests for the reconnection and backoff logic. The reviewer suggests using the initial backoff duration rather than the maximum backoff for reconnection attempts to prevent excessive delays during recovery from transient failures.
| logger.debug(self.config.fatal_status_codes) | ||
|
|
||
| self.retry_grace_period = config.retry_grace_period | ||
| self.retry_backoff_max_seconds = config.retry_backoff_max_ms * 0.001 |
There was a problem hiding this comment.
Using the maximum backoff duration (retry_backoff_max_ms) as a constant delay for every reconnect attempt can lead to significant delays in recovery from transient failures. It is generally better to use the initial backoff duration (retry_backoff_ms) or implement an exponential backoff strategy. Since gRPC already handles connection-level backoff internally, a small application-level delay is sufficient to prevent tight loops while allowing faster recovery.
| self.retry_backoff_max_seconds = config.retry_backoff_max_ms * 0.001 | |
| self.retry_backoff_seconds = config.retry_backoff_ms * 0.001 | |
| self.retry_backoff_max_seconds = config.retry_backoff_max_ms * 0.001 |
| ) | ||
|
|
||
| def _wait_before_reconnect(self) -> None: | ||
| self._shutdown_event.wait(self.retry_backoff_max_seconds) |
There was a problem hiding this comment.
As mentioned in the initialization, using the maximum backoff for every retry significantly impacts availability. Consider using the initial backoff duration instead. Note that you will also need to update the corresponding regression tests in tests/test_grpc_resolver.py.
| self._shutdown_event.wait(self.retry_backoff_max_seconds) | |
| self._shutdown_event.wait(self.retry_backoff_seconds) |
| return False | ||
|
|
||
| def _wait_before_reconnect(self) -> None: | ||
| self._shutdown_event.wait(self.retry_backoff_max_seconds) |
There was a problem hiding this comment.
Using the maximum backoff duration (retry_backoff_max_ms) as a constant delay for every reconnect attempt can lead to significant delays in recovery from transient failures. It is generally better to use the initial backoff duration (retry_backoff_ms) or implement an exponential backoff strategy. Since gRPC already handles connection-level backoff internally, a small application-level delay is sufficient to prevent tight loops while allowing faster recovery. Note that you will also need to update the corresponding regression tests in tests/test_grpc_watcher.py.
| self._shutdown_event.wait(self.retry_backoff_max_seconds) | |
| self._shutdown_event.wait(self.retry_backoff_seconds) |
|
Thanks for the review. I kept this tied to This is separate from gRPC's RPC-level retry behavior; the added wait only applies between one stream ending and the next stream being created. That seems to align with the issue/spec and the reconnect-loop scenario they call out. |
Summary
Fixes the flagd provider stream reconnect loop so both gRPC stream implementations apply an application-level delay before creating the next stream after a stream error or normal stream completion.
Changes:
GrpcResolvernow waitsretry_backoff_max_msbefore recreating the RPCEventStream.GrpcWatchernow waitsretry_backoff_max_msbefore recreating the in-processSyncFlagsstream.Closes #390.
Verification
Note: I also tried the broader non-e2e pytest command, but the local run still initialized the Docker-backed e2e test-harness fixture and failed before running those cases because Docker Desktop's Linux engine was not running on this Windows machine.