Retry transient network errors in background token monitor#4281
Retry transient network errors in background token monitor#4281aponcedeleonch merged 19 commits intomainfrom
Conversation
When a VPN disconnects overnight the OAuth token refresh endpoint becomes temporarily unreachable. Previously the background monitor treated any refresh failure as fatal and immediately marked the workload as unauthenticated, requiring manual intervention. This change distinguishes transient network errors (DNS failures, TCP-level errors, timeouts) from real auth failures (OAuth 4xx, invalid_grant) and retries the former with exponential backoff (30 s initial, 5 min max, no deadline) until the network recovers. Context cancellation stops the loop cleanly without marking the workload as unauthenticated. Non-transient errors still fail immediately. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4281 +/- ##
==========================================
+ Coverage 68.77% 69.23% +0.45%
==========================================
Files 473 478 +5
Lines 47917 48415 +498
==========================================
+ Hits 32956 33521 +565
- Misses 12300 12310 +10
+ Partials 2661 2584 -77 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add WithMaxTries and WithMaxElapsedTime to the Token() retry loop so retries are capped at 5 attempts and 5 minutes rather than running until context cancellation or the implicit 15-minute Retry default. Also expose both limits via environment variables and tighten the default initial/max intervals (10s/2m instead of 30s/5m) to recover faster after short network blips. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Parse with strconv.IntSize instead of 64 so ParseUint rejects values that would overflow uint on 32-bit platforms, resolving the CodeQL high-severity finding. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When multiple requests arrive during a transient network outage, each Token() call entered its own independent retry loop, causing a thundering herd of refresh attempts. Wrap the transient-error retry path in singleflight.Do so only one retry loop runs at a time — all concurrent callers share the result and unblock immediately when it resolves. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Code looks good! One thing before we merge though... The "Special notes for reviewers" section says Can you update the PR description and that code comment to match what the code actually does? The bounded retry is the right call, the docs just need to catch up :) Note that the code scanning alert on |
The comments claimed retries had no MaxElapsedTime and relied solely on context cancellation, but retryTransientRefresh already applies WithMaxTries(5) and WithMaxElapsedTime(5m). Update the doc comment and remove the misleading inline comment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
isTransientNetworkErrorto distinguish DNS failures, TCP-level errors, and timeouts (transient) from OAuth 4xx /invalid_grantresponses (non-transient). The background monitor now retries transient failures with exponential backoff (10 s initial → 2 min max, bounded by 5 attempts and 5 min elapsed time). Context cancellation exits the loop cleanly without marking the workload as unauthenticated. Non-transient errors still fail immediately as before.Type of change
Test plan
task test)Changes
pkg/auth/monitored_token_source.goisTransientNetworkError, exponential backoff retry loop inToken, injectablenewRetryBackOfffor testspkg/auth/monitored_token_source_test.goisTransientNetworkErrorand the three retry scenarios: recovery, context cancellation, transient→non-transient escalationgo.mod/go.sumgithub.com/cenkalti/backoff/v5(direct dep), bump otel exporters and grpc-gateway (incidentalgo mod tidyupdates)Does this introduce a user-facing change?
No — the workload stays authenticated through VPN reconnects instead of going unauthenticated overnight. No CLI or config changes.
Special notes for reviewers
WithMaxTries, default 5) and wall-clock time (WithMaxElapsedTime, default 5 min). Both limits are configurable via environment variables (TOOLHIVE_TOKEN_REFRESH_MAX_TRIES,TOOLHIVE_TOKEN_REFRESH_MAX_ELAPSED_TIME).newRetryBackOfffield is a test seam; it isnilin production and replaced with a fast backoff in tests to avoid multi-second sleeps.resolveTokenRefreshMaxTriesis a false positive —ParseUintwithstrconv.IntSizealready constrains the value to fit inuint.Generated with Claude Code