Skip to content

feat: aggressive retry budget for /auth/token specifically (v1.11.1)#52

Merged
ColonistOne merged 1 commit into
TheColonyCC:mainfrom
ColonistOne:sdk-fallback-xapikey
May 21, 2026
Merged

feat: aggressive retry budget for /auth/token specifically (v1.11.1)#52
ColonistOne merged 1 commit into
TheColonyCC:mainfrom
ColonistOne:sdk-fallback-xapikey

Conversation

@ColonistOne
Copy link
Copy Markdown
Collaborator

@ColonistOne ColonistOne commented May 21, 2026

Summary

When the Colony /auth/token endpoint returns transient 5xx errors, the SDK now retries with a separately-configurable, more aggressive budget than the per-call retry config. Closes the failure mode where a /auth/token outage bricks every SDK consumer's bootstrap auth.

Note on the rewrite

This PR was force-pushed on 2026-05-21 to fix a false-premise error in the original approach. The initial commits proposed an "X-API-Key fallback when /auth/token is unreachable" path. While probing the live Colony API after the initial PR went up, the assumption turned out to be wrong:

Auth method tested on live API Result
X-API-Key: <key> (all 4 case variants) 403 "Not authenticated"
Authorization: ApiKey/Api-Key/Token <key> 403 "Invalid authentication credentials"
Authorization: <key> (bare) 403
?api_key=<key> (query param) 403
Authorization: Bearer <jwt> 200 ✅

Only Authorization: Bearer <jwt> works. The Colony backend does NOT accept X-API-Key. The retry-with-backoff approach in this rewrite is the correct shape: gives upstream time to recover without changing the authentication contract.

Motivation

On 2026-05-21 the Colony /auth/token endpoint returned 502 Bad Gateway for ~1 hour. Every dogfood agent on the host failed client.get_me() as their first call (bootstrap auth) and exited with code 3. The default retry config (max_retries=2, base_delay=1.0, max_delay=10.0) gives up after ~3 attempts in ~7 seconds — not enough for a multi-minute infrastructure blip.

What changed

  • New module-level constant _DEFAULT_AUTH_RETRY = RetryConfig(max_retries=6, base_delay=2.0, max_delay=60.0) — gives a ~122s total budget on the full-exhaustion path, with exponential backoff (2, 4, 8, 16, 32, 60 between retries).
  • ColonyClient.__init__ accepts auth_token_retry: RetryConfig | None = None (defaults to _DEFAULT_AUTH_RETRY).
  • _ensure_token() passes retry_override=self.auth_token_retry to _raw_request.
  • _raw_request accepts a new retry_override: RetryConfig | None = None parameter; when set, it replaces self.retry for the retry-decision and delay calc for that call chain. Propagates on recursion (both the 401 retry and the transient-retry paths).
  • No change to non-/auth/token endpoints — they still use self.retry with the regular 3-attempt budget.

Tests

6 new tests in TestAuthTokenRetry:

  • Sanity: default auth_token_retry.max_retries > retry.max_retries
  • 502 burst recovers: /auth/token fails 3 times then succeeds → SDK rides through; sleeps follow the documented [2.0, 4.0, 8.0] schedule (mocked)
  • Always-5xx eventually raises: ColonyServerError is surfaced after the budget exhausts; no infinite loop
  • Legacy opt-out: auth_token_retry=RetryConfig(max_retries=0) restores pre-2026-05-21 single-attempt behaviour
  • Scope isolation: aggressive budget applies ONLY to /auth/token. Test sets retry=max_retries=0 + auth_token_retry=max_retries=6, verifies that a 502 on /users/me still raises after exactly one attempt
  • URLError contract: documents that network errors raise immediately on /auth/token (URLError is not in the retry-on set by default)

Full unit suite: 206 passed.

Behaviour compatibility

  • No happy-path behaviour change: when /auth/token works on the first attempt, the SDK behaves identically to v1.11.0.
  • No API breakage: existing call sites get the new behaviour automatically with the default 122s budget.
  • Opt-out path: ColonyClient(api_key=..., auth_token_retry=RetryConfig(max_retries=0)) restores legacy fail-fast behaviour.

🤖 Generated with Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

ColonistOne added a commit to ColonistOne/colony-sdk-python that referenced this pull request May 21, 2026
…ong line

Three ruff issues from CI on PR TheColonyCC#52:
- I001 (×2): import blocks unsorted in test helpers
- F841: `calls = self._patch_urlopen(...)` in test_auth_token_network_error_also_falls_back never read after assignment
- E501: HTTPError construction line ran to 124 chars; broke `body_bytes` onto its own line

No test behaviour change. All 5 TestXAPIKeyFallback tests still pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ColonistOne added a commit to ColonistOne/colony-sdk-python that referenced this pull request May 21, 2026
`ruff format --check` is a separate CI gate from `ruff check` — fail
on PR TheColonyCC#52 was the line-wrapping style (single-line vs split) for a
few longer conditions, not a real issue. No behaviour change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the Colony `/auth/token` endpoint returns transient 5xx errors,
the SDK now retries with a separately-configurable, more aggressive
budget than the per-call retry config. Closes the failure mode where
a `/auth/token` outage bricks every SDK consumer's bootstrap auth.

## Motivation

On 2026-05-21 the Colony `/auth/token` endpoint returned 502 Bad
Gateway for ~1 hour. Every dogfood agent on the host failed
`client.get_me()` as their first call (bootstrap auth) and exited
with code 3. The default retry config (max_retries=2, base_delay=1.0,
max_delay=10.0) gives up after ~3 attempts in ~7s — not enough for a
multi-minute infrastructure blip.

## What changed

- New module-level `_DEFAULT_AUTH_RETRY = RetryConfig(max_retries=6,
  base_delay=2.0, max_delay=60.0)` — gives a ~122s total budget on
  the full-exhaustion path, with exponential backoff (2, 4, 8, 16,
  32, 60 between retries).
- `ColonyClient.__init__` accepts `auth_token_retry: RetryConfig |
  None = None` (defaults to `_DEFAULT_AUTH_RETRY`).
- `_ensure_token()` passes `retry_override=self.auth_token_retry` to
  `_raw_request`.
- `_raw_request` accepts a new `retry_override: RetryConfig | None =
  None` parameter; when set, it replaces `self.retry` for the
  retry-decision and delay calc for that call chain. Propagates on
  recursion (both the 401 retry and the transient-retry paths).
- No change to non-/auth/token endpoints — they still use `self.retry`.

## Why retry, not X-API-Key fallback

An earlier version of this PR added an "X-API-Key fallback when
/auth/token is unreachable" path, based on the (false) assumption
that the Colony backend accepts `X-API-Key` on authenticated
endpoints. It does not — every variant returns 403 "Not
authenticated". Only `Authorization: Bearer <jwt>` works. Probe
matrix confirmed against the live API on 2026-05-21. Retry-with-
backoff is the correct shape: it gives the upstream time to recover
without changing the authentication contract.

## Tests

6 new tests in `TestAuthTokenRetry`:

- Sanity: default auth_token_retry is more aggressive than default
  retry
- `/auth/token` 502 burst (3 failures then success) recovers; sleeps
  follow the documented 2/4/8 schedule
- `/auth/token` always-5xx eventually raises ColonyServerError (no
  infinite loop)
- `auth_token_retry=RetryConfig(max_retries=0)` restores pre-2026-05-21
  legacy single-attempt behaviour
- Aggressive budget applies ONLY to `/auth/token`, not to other
  endpoints (sets `retry=max_retries=0` + `auth_token_retry=max_retries=6`
  and verifies a 502 on `/users/me` raises after exactly one attempt)
- URLError on `/auth/token` raises immediately (documents the
  contract — URLError is not in the retry_on set by default)

Full unit suite: 206 passed.

## Behaviour compatibility

- No happy-path behaviour change: when `/auth/token` works on the
  first attempt, the SDK behaves identically to v1.11.0.
- No API breakage: existing call sites get the new behaviour
  automatically with the default 122s budget.
- Opt-out: `ColonyClient(api_key=..., auth_token_retry=RetryConfig(
  max_retries=0))` restores the legacy fail-fast behaviour for
  callers who want it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ColonistOne ColonistOne force-pushed the sdk-fallback-xapikey branch from 513964e to 8a5e237 Compare May 21, 2026 12:31
@ColonistOne
Copy link
Copy Markdown
Collaborator Author

Force-pushed to fix a false-premise error in the original PR.

Original approach: "X-API-Key fallback when /auth/token is unreachable" — based on my recollection from the 2026-05-21 incident that X-API-Key worked as a Colony auth method when JWTs were down.

Reality (probed against the live API after this PR went up): the Colony backend does NOT accept X-API-Key on authenticated endpoints. Every variant (X-API-Key, x-api-key, Authorization: ApiKey, Authorization: Token, bare Authorization: <key>, ?api_key=) returns 403. Only Authorization: Bearer <jwt> works. What I remembered as "X-API-Key worked" during the outage was almost certainly me successfully fetching a fresh JWT manually during a moment of /auth/token recovery, mis-stored as "X-API-Key direct."

Corrected approach: retry /auth/token specifically with a separately-configurable, more aggressive budget (default 7 attempts over ~122s). The single-point-of-failure problem is real, the solution is just retry-with-backoff, not protocol bypass.

Apologies for the noise — the rewrite has a smaller surface (60 net lines vs 250) and tests prove it. Verified locally against the live API: _ensure_token with the new defaults survives a simulated 4-attempt 502 burst before the JWT comes through.

@ColonistOne ColonistOne changed the title feat: X-API-Key fallback when /auth/token is unreachable (v1.11.1) feat: aggressive retry budget for /auth/token specifically (v1.11.1) May 21, 2026
@ColonistOne ColonistOne merged commit 4b33e53 into TheColonyCC:main May 21, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant