feat(vmcp): add auth-retry with circuit breaker to BackendClient#4169
feat(vmcp): add auth-retry with circuit breaker to BackendClient#4169
Conversation
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
There was a problem hiding this comment.
Pull request overview
Adds an auth-failure retry decorator (with exponential backoff + per-backend circuit breaker + tracing) around the vMCP BackendClient, expands auth error string detection, and introduces new unit/integration/E2E test coverage plus test utilities to simulate transient HTTP failures.
Changes:
- Wrap
NewHTTPBackendClientwithretryingBackendClientimplementing auth retry/backoff + circuit breaker + OTel spans. - Refactor
IsAuthenticationErrorinto a sharedauthErrorPatternslist (including mcp-go"unauthorized (401)"). - Add test helpers and new unit/integration/E2E tests for auth-retry and circuit breaker behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/vmcp/client/auth_retry.go | Implements auth-retry decorator w/ backoff, circuit breaker, singleflight, and OTel spans |
| pkg/vmcp/client/client.go | Wraps the existing HTTP backend client with the retry decorator |
| pkg/vmcp/errors.go | Refactors authentication error detection into pattern list and expands coverage |
| pkg/vmcp/client/auth_retry_test.go | Unit tests for retry/circuit breaker behavior |
| pkg/vmcp/client/auth_retry_integration_test.go | Integration tests using a real mcp-go HTTP server with transient 401 responses |
| pkg/vmcp/client/client_test.go | Updates tests to account for the new retry decorator return type |
| test/integration/vmcp/helpers/backend.go | Adds WithHTTPMiddleware to inject transient HTTP failures in integration tests |
| test/e2e/thv-operator/virtualmcp/helpers.go | Extends E2E backend deployment helper (Args + PodTemplateSpec) and adds health status const |
| test/e2e/thv-operator/virtualmcp/virtualmcp_auth_retry_test.go | New E2E test validating retry exhaustion behavior against a persistent-401 backend |
| test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go | Minor refactor to use shared health-status constant |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ab7f12704
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Large PR justification has been provided. Thank you!
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
There was a problem hiding this comment.
Pull request overview
Adds authentication-failure retry behavior (with backoff + circuit breaker + tracing) to the vMCP BackendClient, expands auth-error detection patterns, and introduces new test hooks/coverage to validate the behavior end-to-end.
Changes:
- Wrap
NewHTTPBackendClientwith aretryingBackendClientdecorator that retriesErrAuthenticationFailedwith exponential backoff and a per-backend circuit breaker; instruments retries with OTel spans. - Refactor
IsAuthenticationErrorto use a pattern list, including the mcp-go"unauthorized (401)"error format. - Extend integration/E2E test helpers and add new unit/integration/E2E tests for auth-retry and circuit-breaker behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/vmcp/client/auth_retry.go |
Implements auth retry loop with backoff, circuit breaker, singleflight coalescing, and OTel spans. |
pkg/vmcp/client/client.go |
Wraps the HTTP backend client with the retry decorator; routes errors through wrapBackendError. |
pkg/vmcp/errors.go |
Refactors auth error string matching into authErrorPatterns and adds additional patterns. |
pkg/vmcp/client/auth_retry_test.go |
Unit tests for retry behavior, breaker open/reset, concurrency coalescing, and cancellation. |
pkg/vmcp/client/auth_retry_integration_test.go |
Integration test validating transient-401 retry against a real mcp-go streamable HTTP server. |
pkg/vmcp/client/client_test.go |
Updates tests to account for the new retry-wrapper returned by NewHTTPBackendClient. |
test/integration/vmcp/helpers/backend.go |
Adds WithHTTPMiddleware for injecting transient HTTP behavior in integration tests. |
test/e2e/thv-operator/virtualmcp/helpers.go |
Extends backend config to support args/pod template patches and skipping readiness wait; adds a shared health-status constant. |
test/e2e/thv-operator/virtualmcp/virtualmcp_auth_retry_test.go |
New E2E coverage validating persistent 401 leads to “unavailable” backend status. |
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go |
Replaces hard-coded "healthy" with shared constant. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4169 +/- ##
==========================================
- Coverage 69.30% 69.20% -0.10%
==========================================
Files 466 467 +1
Lines 46774 46920 +146
==========================================
+ Hits 32416 32471 +55
+ Misses 11871 11861 -10
- Partials 2487 2588 +101 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR adds an authentication-failure retry decorator (with exponential backoff, per-backend circuit breaker, singleflight coalescing, and OTel spans) around the vMCP BackendClient, and refactors error classification / test helpers to support the new behavior.
Changes:
- Wrap
NewHTTPBackendClientwith aretryingBackendClientand introducehttpStatusRoundTripperto map HTTP 401/403 and 5xx into vmcp sentinel errors forerrors.Ischecks. - Refactor
IsAuthenticationError/IsConnectionErrorto use package-level pattern slices and expand auth matching. - Add/adjust integration + E2E test helpers/tests to inject transient HTTP failures and validate auth-retry / circuit-breaker behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/vmcp/client/client.go |
Wraps the backend client with retry logic; adds HTTP status → sentinel error mapping; routes more call errors through wrapBackendError. |
pkg/vmcp/client/auth_retry.go |
Implements auth-retry with exponential backoff, per-backend circuit breaker, singleflight coalescing, and OTel span instrumentation. |
pkg/vmcp/errors.go |
Refactors string-based auth/connection error detection to pattern slices and expands auth patterns. |
pkg/vmcp/client/client_test.go |
Updates tests to account for NewHTTPBackendClient now returning a retrying decorator. |
pkg/vmcp/client/auth_retry_test.go |
Adds unit tests covering retry success/exhaustion, breaker open/reset, concurrency coalescing, and context cancellation. |
pkg/vmcp/client/auth_retry_integration_test.go |
Adds integration test exercising retry behavior against an httptest-backed MCP server returning a transient 401. |
test/integration/vmcp/helpers/backend.go |
Adds WithHTTPMiddleware to wrap backend HTTP handlers for injecting transient HTTP failures in integration tests. |
test/e2e/thv-operator/virtualmcp/helpers.go |
Adds helpers/fields to support new E2E scenarios (pod-only readiness wait, backend args/pod template patching, skip readiness gate, health status constant). |
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go |
Replaces literal "healthy" comparisons with shared constant. |
test/e2e/thv-operator/virtualmcp/virtualmcp_auth_retry_test.go |
Adds E2E test validating persistent-401 backend transitions to unavailable/degraded while stable backend remains healthy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
`NewHTTPBackendClient` now wraps the raw HTTP client in a `retryingBackendClient` decorator that: - Intercepts `ErrAuthenticationFailed` (401/403) returned by any `BackendClient` method and retries up to `maxAuthRetries` (3) times with exponential backoff (100 ms base, doubling each attempt). - Maintains a per-backend `authCircuitBreaker` that opens after `authCircuitBreakerThreshold` (5) consecutive fully-exhausted retry sequences, preventing runaway latency from permanently broken credentials. - Uses `singleflight` to coalesce concurrent backoff waits for the same backend, so N goroutines racing on a 401 sleep only once per attempt. - Wraps each retry sequence in an OpenTelemetry `auth.retry` span so the overhead is visible in distributed traces. - Never logs raw credentials. `IsAuthenticationError` is refactored from chained `if` blocks into a package-level `authErrorPatterns` slice (fixes gocyclo limit and adds the mcp-go `"unauthorized (401)"` format that was previously missed). The integration test helper gains `WithHTTPMiddleware` so tests can inject transient HTTP errors without modifying the MCP server logic. New tests: - 9 unit tests (`auth_retry_test.go`) covering success, single-retry, max-retries, non-auth passthrough, circuit breaker open/reset, and context cancellation. - 1 integration test (`auth_retry_integration_test.go`) against a real mcp-go streamable-HTTP server. - 1 E2E Ginkgo suite (`virtualmcp_auth_retry_test.go`) deploying a Python 401 backend alongside a healthy yardstick backend and asserting that the failing backend is marked `BackendStatusUnavailable` while the stable backend remains ready. Closes: #3869
There was a problem hiding this comment.
Pull request overview
Adds an auth-failure retry decorator around the vMCP BackendClient, converting HTTP status codes into sentinel errors to enable errors.Is()-based detection and supporting end-to-end verification of auth-retry exhaustion behavior.
Changes:
- Wrap
httpBackendClientwith a newretryingBackendClientthat retriesErrAuthenticationFailedwith exponential backoff and a per-backend circuit breaker; coalesces concurrent backoff sleeps viasingleflightand instruments retries with anauth.retryspan. - Introduce an HTTP
RoundTripperthat turns 401/403 and selected 5xx responses into structured sentinel errors and updates backend error wrapping to prefer these sentinels. - Expand integration/E2E test helpers and add unit + integration + E2E tests to exercise transient/persistent 401 behavior and health reporting.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/vmcp/client/client.go |
Wraps the HTTP backend client with retrying decorator; adds httpStatusRoundTripper; updates error wrapping call sites to use shared wrapBackendError. |
pkg/vmcp/client/auth_retry.go |
New retry decorator implementing auth retry/backoff, circuit breaker, singleflight coalescing, and OTel span instrumentation. |
pkg/vmcp/errors.go |
Refactors auth/connection string matchers into pattern slices and expands auth patterns (incl. unauthorized (401)). |
pkg/vmcp/client/client_test.go |
Updates tests to account for NewHTTPBackendClient now returning the retrying decorator. |
pkg/vmcp/client/auth_retry_test.go |
New unit tests for retry success, max-retry exhaustion, circuit breaker behavior, and concurrency coalescing. |
pkg/vmcp/client/auth_retry_integration_test.go |
New integration test validating transient 401 retry succeeds end-to-end with a real mcp-go server. |
test/integration/vmcp/helpers/backend.go |
Adds WithHTTPMiddleware to allow injecting transient HTTP failures in integration tests. |
test/e2e/thv-operator/virtualmcp/helpers.go |
Adds pod-only readiness wait helper; extends backend config to support args, pod template patching, and skipping readiness waits; adds backendHealthStatusHealthy constant. |
test/e2e/thv-operator/virtualmcp/virtualmcp_circuit_breaker_test.go |
Uses the new backendHealthStatusHealthy constant instead of hardcoded "healthy". |
test/e2e/thv-operator/virtualmcp/virtualmcp_auth_retry_test.go |
New E2E test deploying a persistent-401 backend and asserting it becomes unavailable while a stable backend remains healthy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func NewHTTPBackendClient(registry vmcpauth.OutgoingAuthRegistry) (vmcp.BackendClient, error) { | ||
| if registry == nil { | ||
| return nil, fmt.Errorf("registry cannot be nil; use UnauthenticatedStrategy for no authentication") | ||
| } | ||
|
|
||
| c := &httpBackendClient{ | ||
| registry: registry, | ||
| } | ||
| c.clientFactory = c.defaultClientFactory | ||
| return c, nil | ||
| return newRetryingBackendClient(c, registry), nil | ||
| } |
| type retryingBackendClient struct { | ||
| inner vmcp.BackendClient | ||
| registry vmcpauth.OutgoingAuthRegistry | ||
|
|
jerm-dro
left a comment
There was a problem hiding this comment.
FYI: I think this potentially overlaps with the embedded auth in vMCP work: https://github.com/stacklok/stacklok-epics/issues/251
As part of that work, we want to solve the token refresh problem. It may make sense for those changes to land first.
| // breakers maps backendID -> *authCircuitBreaker. LoadOrStore is used for concurrent safety. | ||
| breakers sync.Map |
There was a problem hiding this comment.
blocker: can we implement this without a sync.Map?
The multi-session already has the notion of separate clients per backend. We don't actually have to share state, right? If so, then we should have one retrier per backend.
|
ok moving to draft for now, until the work lands |
Summary
NewHTTPBackendClientnow wraps the raw HTTP client in aretryingBackendClientdecorator that:ErrAuthenticationFailed(401/403) returned by anyBackendClientmethod and retries up tomaxAuthRetries(3) times with exponential backoff (100 ms base, doubling each attempt).authCircuitBreakerthat opens afterauthCircuitBreakerThreshold(5) consecutive fully-exhausted retry sequences, preventing runaway latency from permanently broken credentials.singleflightto coalesce concurrent backoff waits for the same backend, so N goroutines racing on a 401 sleep only once per attempt.auth.retryspan so the overhead is visible in distributed traces.IsAuthenticationErroris refactored from chainedifblocks into a package-levelauthErrorPatternsslice (fixes gocyclo limit and adds the mcp-go"unauthorized (401)"format that was previously missed).The integration test helper gains
WithHTTPMiddlewareso tests can inject transient HTTP errors without modifying the MCP server logic.Fixes #3869
Type of change
Test plan
task test)task test-e2e)task lint-fix)Changes
Does this introduce a user-facing change?
Special notes for reviewers
Large PR Justification
This is a complete and isolated functionality, with comprehensive tests. Cannot be split.