Summary
The proxy auth middleware makes multiple external calls to the registry on every inbound request. If the registry is slow or down, all message delivery stops with 503 errors. There is no circuit breaker, limited caching fallback, and the in-memory nonce cache is lost on Durable Object eviction.
Current Auth Pipeline (per request)
Inbound request arrives at proxy
→ Parse Authorization: Claw <AIT>
→ Verify AIT signature (needs registry signing keys — cached 1hr)
→ Check CRL for revocation (needs registry CRL endpoint — cached with TTL)
→ Verify PoP (X-Claw-Timestamp + X-Claw-Nonce + X-Claw-Signature)
→ Check nonce for replay (in-memory cache)
→ Assert agent is known/trusted (trust store — D1/KV)
→ Validate agent access token (POST to registry /v1/auth/agent/validate — NO cache)
→ Route to handler
Issues
🔴 P0: Registry is a single point of failure
File: apps/proxy/src/auth-middleware.ts
Every authenticated request requires the registry for at least one of:
- Signing keys fetch (
/.well-known/claw-keys.json)
- CRL fetch (
/v1/crl)
- Agent access token validation (
/v1/auth/agent/validate) — called on every /hooks/agent and /relay/connect request with NO caching
If registry.clawdentity.com goes down, ALL proxies return 503 on every request. Every agent-to-agent message globally stops.
The agent access validation call (line ~520) is the worst offender — it is never cached and hits the registry synchronously on every single request.
Fix:
- Cache agent access validation with short TTL (30-60s). Same access token + agentDid + aitJti = same result within TTL window
- Circuit breaker on registry calls: after N consecutive failures (e.g., 5), open circuit for M seconds (e.g., 30s). During open circuit, use cached/stale data with degraded trust
- Stale-while-revalidate for signing keys: serve cached keys even if refresh fails, with a maximum staleness window (e.g., 24 hours)
- Local fallback for CRL: if registry CRL is unavailable AND cache is stale beyond max age, make a policy decision (fail-open with logging vs fail-closed). Current
staleBehavior config handles this partially but the signing keys path does not
🔴 P0: Agent access validation has zero caching
File: apps/proxy/src/auth-middleware.ts (lines ~510-545)
// This runs on EVERY /hooks/agent and /relay/connect request:
validateResponse = await fetchImpl(agentAuthValidateUrl, {
method: "POST",
headers: { ... },
body: JSON.stringify({ agentDid: claims.sub, aitJti: claims.jti }),
});
This is a synchronous HTTP call to the registry with no caching. Under load, this adds latency to every message and creates a hard dependency on registry availability.
Fix:
- Cache validation result keyed on
(accessToken, agentDid, aitJti) with 30-60s TTL
- On cache hit, skip the HTTP call
- On cache miss or expiry, validate and cache result
- On registry failure with valid cache entry, use cached result with warning log
const cacheKey = `${accessToken}:${claims.sub}:${claims.jti}`;
const cached = agentAccessCache.get(cacheKey);
if (cached && cached.validUntilMs > clock()) {
// Skip registry call, use cached validation
} else {
// Validate with registry, cache result
}
🟠 P1: Signing keys cache has no stale-while-revalidate
File: apps/proxy/src/auth-middleware.ts
The signing keys cache checks clock() - fetchedAtMs <= registryKeysCacheTtlMs (1 hour TTL). If the cache expires and the registry is down, the next request gets a hard 503.
The CRL cache has staleBehavior config, but signing keys do not. Since keys rotate very rarely (months/years), serving a 2-hour-stale key is far better than failing all auth.
Fix:
- Add stale-while-revalidate: serve cached keys even after TTL expires, trigger background refresh
- Add maximum staleness: reject keys only after a hard limit (e.g., 24 hours stale)
- Log warnings when serving stale keys
🟠 P1: Nonce cache is in-memory — DO eviction loses replay protection
File: apps/proxy/src/auth-middleware.ts
createNonceCache() is an in-memory store. When the Cloudflare Durable Object is evicted (idle timeout, region migration), the nonce cache is lost. Between eviction and re-population, previously-seen nonces are accepted again, opening a replay attack window.
The window is bounded by the timestamp skew (300s), but within that window, a captured request can be replayed after DO eviction.
Fix:
- Persist nonces to DO storage (KV or storage API) instead of in-memory only
- Or use a hybrid: in-memory for fast lookup, periodic flush to durable storage
- On DO wake, load recent nonces from storage before accepting requests
- Size consideration: with 300s skew window, even at 100 req/s, that is only ~30K nonces to persist
🟠 P1: No circuit breaker on registry calls
File: apps/proxy/src/auth-middleware.ts
Each request independently calls the registry. If the registry is returning 500s, every concurrent request makes its own failing call. Under load (100 agents, 10 msgs/sec each), that is 1000 failing HTTP calls per second to a down registry — making recovery harder.
Fix:
- Implement circuit breaker pattern with three states:
- Closed (normal): all calls go through
- Open (after N failures in M seconds): skip registry call, use cached data or reject fast
- Half-open (after cooldown): allow one probe request to test recovery
- Apply to: signing keys fetch, CRL fetch, and agent access validation independently
- Log state transitions for observability
🟡 P2: No timeout on registry calls
File: apps/proxy/src/auth-middleware.ts
Registry HTTP calls (signing keys, CRL, agent access validation) use bare fetchImpl() with no abort signal or timeout. A slow registry (e.g., 30s response time under load) blocks the entire auth pipeline.
Fix:
- Add
AbortSignal.timeout(5_000) (or configurable) to all registry calls
- A slow registry should fail fast, not propagate latency to every message
🟡 P2: Clock skew boundary is unmonitored
File: apps/proxy/src/auth-middleware.ts
The 300s (5 min) skew window is generous but there is no alerting or logging when agents consistently operate near the boundary. A drift trend is invisible until it crosses the threshold and breaks auth completely.
Fix:
- Log a warning when timestamp skew exceeds 200s (approaching the 300s limit)
- Include skew value in auth success logs for monitoring
- Consider a
/v1/time endpoint for agents to self-calibrate
🟡 P2: Trust store unavailability returns 503 — no degraded mode
File: apps/proxy/src/trust-policy.ts
If the trust store (D1/KV) is transiently unavailable, assertKnownTrustedAgent throws 503. There is no cached trust state to fall back on.
Fix:
- Cache trust lookups in-memory with short TTL (60-120s)
- On trust store failure with valid cache, use cached trust decision
- On trust store failure with no cache, fail closed (403) rather than 503 (more informative for sender)
Acceptance Criteria
Test Scenarios
- Registry goes down for 10 min → proxy serves requests using cached keys/CRL/access validation, logs warnings, no 503 for cached agents
- Registry returns 500s under load → circuit breaker opens after 5 failures, stops hammering registry, probes recovery every 30s
- DO evicted and restarted → nonces reloaded from storage, replay protection intact
- Registry responds in 15s → auth timeout fires at 5s, request fails fast
- Clock skew reaches 250s → warning logged, admin alerted before auth breaks at 300s
- D1 trust store returns 500 → cached trust used for known agents, new agents rejected
Summary
The proxy auth middleware makes multiple external calls to the registry on every inbound request. If the registry is slow or down, all message delivery stops with 503 errors. There is no circuit breaker, limited caching fallback, and the in-memory nonce cache is lost on Durable Object eviction.
Current Auth Pipeline (per request)
Issues
🔴 P0: Registry is a single point of failure
File:
apps/proxy/src/auth-middleware.tsEvery authenticated request requires the registry for at least one of:
/.well-known/claw-keys.json)/v1/crl)/v1/auth/agent/validate) — called on every/hooks/agentand/relay/connectrequest with NO cachingIf
registry.clawdentity.comgoes down, ALL proxies return 503 on every request. Every agent-to-agent message globally stops.The agent access validation call (line ~520) is the worst offender — it is never cached and hits the registry synchronously on every single request.
Fix:
staleBehaviorconfig handles this partially but the signing keys path does not🔴 P0: Agent access validation has zero caching
File:
apps/proxy/src/auth-middleware.ts(lines ~510-545)This is a synchronous HTTP call to the registry with no caching. Under load, this adds latency to every message and creates a hard dependency on registry availability.
Fix:
(accessToken, agentDid, aitJti)with 30-60s TTL🟠 P1: Signing keys cache has no stale-while-revalidate
File:
apps/proxy/src/auth-middleware.tsThe signing keys cache checks
clock() - fetchedAtMs <= registryKeysCacheTtlMs(1 hour TTL). If the cache expires and the registry is down, the next request gets a hard 503.The CRL cache has
staleBehaviorconfig, but signing keys do not. Since keys rotate very rarely (months/years), serving a 2-hour-stale key is far better than failing all auth.Fix:
🟠 P1: Nonce cache is in-memory — DO eviction loses replay protection
File:
apps/proxy/src/auth-middleware.tscreateNonceCache()is an in-memory store. When the Cloudflare Durable Object is evicted (idle timeout, region migration), the nonce cache is lost. Between eviction and re-population, previously-seen nonces are accepted again, opening a replay attack window.The window is bounded by the timestamp skew (300s), but within that window, a captured request can be replayed after DO eviction.
Fix:
🟠 P1: No circuit breaker on registry calls
File:
apps/proxy/src/auth-middleware.tsEach request independently calls the registry. If the registry is returning 500s, every concurrent request makes its own failing call. Under load (100 agents, 10 msgs/sec each), that is 1000 failing HTTP calls per second to a down registry — making recovery harder.
Fix:
🟡 P2: No timeout on registry calls
File:
apps/proxy/src/auth-middleware.tsRegistry HTTP calls (signing keys, CRL, agent access validation) use bare
fetchImpl()with no abort signal or timeout. A slow registry (e.g., 30s response time under load) blocks the entire auth pipeline.Fix:
AbortSignal.timeout(5_000)(or configurable) to all registry calls🟡 P2: Clock skew boundary is unmonitored
File:
apps/proxy/src/auth-middleware.tsThe 300s (5 min) skew window is generous but there is no alerting or logging when agents consistently operate near the boundary. A drift trend is invisible until it crosses the threshold and breaks auth completely.
Fix:
/v1/timeendpoint for agents to self-calibrate🟡 P2: Trust store unavailability returns 503 — no degraded mode
File:
apps/proxy/src/trust-policy.tsIf the trust store (D1/KV) is transiently unavailable,
assertKnownTrustedAgentthrows 503. There is no cached trust state to fall back on.Fix:
Acceptance Criteria
Test Scenarios