Skip to content

Retry transient network errors in background token monitor#4281

Merged
aponcedeleonch merged 19 commits intomainfrom
retry-transient-token-refresh-errors
Mar 24, 2026
Merged

Retry transient network errors in background token monitor#4281
aponcedeleonch merged 19 commits intomainfrom
retry-transient-token-refresh-errors

Conversation

@reyortiz3
Copy link
Contributor

@reyortiz3 reyortiz3 commented Mar 19, 2026

Summary

  • OAuth token refresh fails when WARP VPN disconnects overnight — the refresh endpoint becomes temporarily unreachable from the home ISP IP, causing the workload to be immediately marked as unauthenticated and requiring manual intervention.
  • Adds isTransientNetworkError to distinguish DNS failures, TCP-level errors, and timeouts (transient) from OAuth 4xx / invalid_grant responses (non-transient). The background monitor now retries transient failures with exponential backoff (10 s initial → 2 min max, bounded by 5 attempts and 5 min elapsed time). Context cancellation exits the loop cleanly without marking the workload as unauthenticated. Non-transient errors still fail immediately as before.

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)

Changes

File Change
pkg/auth/monitored_token_source.go Add isTransientNetworkError, exponential backoff retry loop in Token, injectable newRetryBackOff for tests
pkg/auth/monitored_token_source_test.go Tests for isTransientNetworkError and the three retry scenarios: recovery, context cancellation, transient→non-transient escalation
go.mod / go.sum Add github.com/cenkalti/backoff/v5 (direct dep), bump otel exporters and grpc-gateway (incidental go mod tidy updates)

Does this introduce a user-facing change?

No — the workload stays authenticated through VPN reconnects instead of going unauthenticated overnight. No CLI or config changes.

Special notes for reviewers

  • Retries are bounded by both attempt count (WithMaxTries, default 5) and wall-clock time (WithMaxElapsedTime, default 5 min). Both limits are configurable via environment variables (TOOLHIVE_TOKEN_REFRESH_MAX_TRIES, TOOLHIVE_TOKEN_REFRESH_MAX_ELAPSED_TIME).
  • The newRetryBackOff field is a test seam; it is nil in production and replaced with a fast backoff in tests to avoid multi-second sleeps.
  • The code scanning alert on resolveTokenRefreshMaxTries is a false positive — ParseUint with strconv.IntSize already constrains the value to fit in uint.

Generated with Claude Code

When a VPN disconnects overnight the OAuth token refresh endpoint becomes
temporarily unreachable. Previously the background monitor treated any
refresh failure as fatal and immediately marked the workload as
unauthenticated, requiring manual intervention.

This change distinguishes transient network errors (DNS failures,
TCP-level errors, timeouts) from real auth failures (OAuth 4xx,
invalid_grant) and retries the former with exponential backoff (30 s
initial, 5 min max, no deadline) until the network recovers. Context
cancellation stops the loop cleanly without marking the workload as
unauthenticated. Non-transient errors still fail immediately.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Mar 19, 2026
@codecov
Copy link

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 70.83333% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.23%. Comparing base (04ae197) to head (5a59030).
⚠️ Report is 21 commits behind head on main.

Files with missing lines Patch % Lines
pkg/auth/monitored_token_source.go 70.83% 26 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4281      +/-   ##
==========================================
+ Coverage   68.77%   69.23%   +0.45%     
==========================================
  Files         473      478       +5     
  Lines       47917    48415     +498     
==========================================
+ Hits        32956    33521     +565     
- Misses      12300    12310      +10     
+ Partials     2661     2584      -77     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 19, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 19, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
jhrozek
jhrozek previously approved these changes Mar 20, 2026
@reyortiz3 reyortiz3 requested a review from amirejaz March 20, 2026 17:33
Add WithMaxTries and WithMaxElapsedTime to the Token() retry loop so
retries are capped at 5 attempts and 5 minutes rather than running until
context cancellation or the implicit 15-minute Retry default.

Also expose both limits via environment variables and tighten the default
initial/max intervals (10s/2m instead of 30s/5m) to recover faster after
short network blips.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
Parse with strconv.IntSize instead of 64 so ParseUint rejects values
that would overflow uint on 32-bit platforms, resolving the CodeQL
high-severity finding.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
When multiple requests arrive during a transient network outage, each
Token() call entered its own independent retry loop, causing a thundering
herd of refresh attempts. Wrap the transient-error retry path in
singleflight.Do so only one retry loop runs at a time — all concurrent
callers share the result and unblock immediately when it resolves.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 24, 2026
@JAORMX
Copy link
Collaborator

JAORMX commented Mar 24, 2026

Code looks good! One thing before we merge though...

The "Special notes for reviewers" section says MaxElapsedTime is intentionally left at zero with no deadline, and that the retry loop runs until the context is cancelled or the network recovers. But retryTransientRefresh() actually passes backoff.WithMaxTries(5) and backoff.WithMaxElapsedTime(5*time.Minute), so retries are bounded. The comment in getRetryBackOff() ("No MaxElapsedTime — context cancellation is the stop signal") is stale for the same reason.

Can you update the PR description and that code comment to match what the code actually does? The bounded retry is the right call, the docs just need to catch up :)

Note that the code scanning alert on resolveTokenRefreshMaxTries is a false positive... ParseUint with strconv.IntSize already constrains the value to fit in uint.

The comments claimed retries had no MaxElapsedTime and relied solely
on context cancellation, but retryTransientRefresh already applies
WithMaxTries(5) and WithMaxElapsedTime(5m). Update the doc comment
and remove the misleading inline comment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Mar 24, 2026
@JAORMX JAORMX dismissed amirejaz’s stale review March 24, 2026 07:50

Concerns have been addressed

@aponcedeleonch aponcedeleonch merged commit acbab2a into main Mar 24, 2026
40 checks passed
@aponcedeleonch aponcedeleonch deleted the retry-transient-token-refresh-errors branch March 24, 2026 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants