feat: aggressive retry budget for /auth/token specifically (v1.11.1)#52
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…ong line Three ruff issues from CI on PR TheColonyCC#52: - I001 (×2): import blocks unsorted in test helpers - F841: `calls = self._patch_urlopen(...)` in test_auth_token_network_error_also_falls_back never read after assignment - E501: HTTPError construction line ran to 124 chars; broke `body_bytes` onto its own line No test behaviour change. All 5 TestXAPIKeyFallback tests still pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`ruff format --check` is a separate CI gate from `ruff check` — fail on PR TheColonyCC#52 was the line-wrapping style (single-line vs split) for a few longer conditions, not a real issue. No behaviour change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the Colony `/auth/token` endpoint returns transient 5xx errors, the SDK now retries with a separately-configurable, more aggressive budget than the per-call retry config. Closes the failure mode where a `/auth/token` outage bricks every SDK consumer's bootstrap auth. ## Motivation On 2026-05-21 the Colony `/auth/token` endpoint returned 502 Bad Gateway for ~1 hour. Every dogfood agent on the host failed `client.get_me()` as their first call (bootstrap auth) and exited with code 3. The default retry config (max_retries=2, base_delay=1.0, max_delay=10.0) gives up after ~3 attempts in ~7s — not enough for a multi-minute infrastructure blip. ## What changed - New module-level `_DEFAULT_AUTH_RETRY = RetryConfig(max_retries=6, base_delay=2.0, max_delay=60.0)` — gives a ~122s total budget on the full-exhaustion path, with exponential backoff (2, 4, 8, 16, 32, 60 between retries). - `ColonyClient.__init__` accepts `auth_token_retry: RetryConfig | None = None` (defaults to `_DEFAULT_AUTH_RETRY`). - `_ensure_token()` passes `retry_override=self.auth_token_retry` to `_raw_request`. - `_raw_request` accepts a new `retry_override: RetryConfig | None = None` parameter; when set, it replaces `self.retry` for the retry-decision and delay calc for that call chain. Propagates on recursion (both the 401 retry and the transient-retry paths). - No change to non-/auth/token endpoints — they still use `self.retry`. ## Why retry, not X-API-Key fallback An earlier version of this PR added an "X-API-Key fallback when /auth/token is unreachable" path, based on the (false) assumption that the Colony backend accepts `X-API-Key` on authenticated endpoints. It does not — every variant returns 403 "Not authenticated". Only `Authorization: Bearer <jwt>` works. Probe matrix confirmed against the live API on 2026-05-21. Retry-with- backoff is the correct shape: it gives the upstream time to recover without changing the authentication contract. ## Tests 6 new tests in `TestAuthTokenRetry`: - Sanity: default auth_token_retry is more aggressive than default retry - `/auth/token` 502 burst (3 failures then success) recovers; sleeps follow the documented 2/4/8 schedule - `/auth/token` always-5xx eventually raises ColonyServerError (no infinite loop) - `auth_token_retry=RetryConfig(max_retries=0)` restores pre-2026-05-21 legacy single-attempt behaviour - Aggressive budget applies ONLY to `/auth/token`, not to other endpoints (sets `retry=max_retries=0` + `auth_token_retry=max_retries=6` and verifies a 502 on `/users/me` raises after exactly one attempt) - URLError on `/auth/token` raises immediately (documents the contract — URLError is not in the retry_on set by default) Full unit suite: 206 passed. ## Behaviour compatibility - No happy-path behaviour change: when `/auth/token` works on the first attempt, the SDK behaves identically to v1.11.0. - No API breakage: existing call sites get the new behaviour automatically with the default 122s budget. - Opt-out: `ColonyClient(api_key=..., auth_token_retry=RetryConfig( max_retries=0))` restores the legacy fail-fast behaviour for callers who want it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
513964e to
8a5e237
Compare
|
Force-pushed to fix a false-premise error in the original PR. Original approach: "X-API-Key fallback when Reality (probed against the live API after this PR went up): the Colony backend does NOT accept Corrected approach: retry Apologies for the noise — the rewrite has a smaller surface (60 net lines vs 250) and tests prove it. Verified locally against the live API: |
Summary
When the Colony
/auth/tokenendpoint returns transient 5xx errors, the SDK now retries with a separately-configurable, more aggressive budget than the per-call retry config. Closes the failure mode where a/auth/tokenoutage bricks every SDK consumer's bootstrap auth.Note on the rewrite
This PR was force-pushed on 2026-05-21 to fix a false-premise error in the original approach. The initial commits proposed an "X-API-Key fallback when /auth/token is unreachable" path. While probing the live Colony API after the initial PR went up, the assumption turned out to be wrong:
X-API-Key: <key>(all 4 case variants)Authorization: ApiKey/Api-Key/Token <key>Authorization: <key>(bare)?api_key=<key>(query param)Authorization: Bearer <jwt>Only
Authorization: Bearer <jwt>works. The Colony backend does NOT accept X-API-Key. The retry-with-backoff approach in this rewrite is the correct shape: gives upstream time to recover without changing the authentication contract.Motivation
On 2026-05-21 the Colony
/auth/tokenendpoint returned 502 Bad Gateway for ~1 hour. Every dogfood agent on the host failedclient.get_me()as their first call (bootstrap auth) and exited with code 3. The default retry config (max_retries=2, base_delay=1.0, max_delay=10.0) gives up after ~3 attempts in ~7 seconds — not enough for a multi-minute infrastructure blip.What changed
_DEFAULT_AUTH_RETRY = RetryConfig(max_retries=6, base_delay=2.0, max_delay=60.0)— gives a ~122s total budget on the full-exhaustion path, with exponential backoff (2, 4, 8, 16, 32, 60 between retries).ColonyClient.__init__acceptsauth_token_retry: RetryConfig | None = None(defaults to_DEFAULT_AUTH_RETRY)._ensure_token()passesretry_override=self.auth_token_retryto_raw_request._raw_requestaccepts a newretry_override: RetryConfig | None = Noneparameter; when set, it replacesself.retryfor the retry-decision and delay calc for that call chain. Propagates on recursion (both the 401 retry and the transient-retry paths).self.retrywith the regular 3-attempt budget.Tests
6 new tests in
TestAuthTokenRetry:auth_token_retry.max_retries > retry.max_retries/auth/tokenfails 3 times then succeeds → SDK rides through; sleeps follow the documented[2.0, 4.0, 8.0]schedule (mocked)ColonyServerErroris surfaced after the budget exhausts; no infinite loopauth_token_retry=RetryConfig(max_retries=0)restores pre-2026-05-21 single-attempt behaviour/auth/token. Test setsretry=max_retries=0+auth_token_retry=max_retries=6, verifies that a 502 on/users/mestill raises after exactly one attempt/auth/token(URLError is not in the retry-on set by default)Full unit suite: 206 passed.
Behaviour compatibility
/auth/tokenworks on the first attempt, the SDK behaves identically to v1.11.0.ColonyClient(api_key=..., auth_token_retry=RetryConfig(max_retries=0))restores legacy fail-fast behaviour.🤖 Generated with Claude Code