feat(auth-cache): SWR cache for credentials, IAM policies, and table-key info by engjan · Pull Request #122 · ExtendDB/extenddb

engjan · 2026-05-23T03:59:07Z

Why

Each DynamoDB request issues ~6 catalog queries before dispatch can begin: credential lookup, identity policies, group policies, permissions boundary, principal tags, resource tags, plus a 7th lookup for TableKeyInfo on item-level operations. Queries 2–6 fan out concurrently via tokio::try_join!, but they each consume a catalog-pool connection and reparse policy JSON on every hit. Under load this is the dominant pre-dispatch overhead — no pool size fixes it, because the bottleneck is the round-trip itself.

This PR layers an in-memory stale-while-revalidate cache over each of those reads to drive steady-state catalog roundtrips to ~0 while keeping the pre-cache behavior reachable via a kill switch.

What

Adds an extenddb-cache crate (SwrCache<K, V, E> primitive on top of moka) plus three wrappers:

CachedCredentialStore wraps any CredentialStore. Re-validates expires_at on every cache hit so cached ASIA* sessions cannot outlive their issued lifetime.
CachedAuthzStore wraps any AuthorizationStore with 9 sub-caches (user/role policies, group policies, boundaries, principal tags, resource tags, session data). Cache values are Arc-wrapped so a hit is one atomic increment instead of an N-element clone.
CachedTableKeyInfoStore wraps StorageEngine.table_key_info with negative caching for TableNotFound.

A new AuthCacheRegistry threads invalidation hooks through ServerComponents and AppState so every IAM mutation — from both the management API and the web console — invalidates the matching cache entry write-through. Self-induced changes propagate instantly within the local instance.

Configuration ([auth.cache] in extenddb.toml):

enabled              = true   # master kill switch
ttl_seconds          = 60     # hard TTL — entries beyond this are full misses
soft_ttl_seconds     = 30     # stale-but-usable threshold; refresh-ahead
negative_ttl_seconds = 5      # how long "not found" results are remembered
max_entries          = 10000  # per-cache LRU cap

When enabled = false, the wrappers operate in pass-through mode and forward every request directly to the underlying store. The pass-through path allocates no underlying moka instance (Option<MokaCache>) — it's a true kill switch, not a configured-to-expire-instantly cache.

Tradeoffs

Bounded staleness for IAM data. AWS IAM itself documents policy-propagation delays of ~30s–2min; we expose the same trade-off as auth.cache.ttl_seconds, configurable. This is a deliberate, scoped revision of the project's "No In-Process State" rule (docs/design/11-high-availability.md §D1) — see that doc for the full discussion. The data path remains uncached.

No verdict caching. We cache the inputs to authorization (policies, tags), not Allow/Deny decisions. Verdict caching has subtle correctness issues with conditions that depend on per-request context (aws:CurrentTime, IP, leading keys) and would need invalidation on every condition-relevant input change. Re-evaluating policies in CPU per request is essentially free once the inputs are cached.

Async fanout invalidation has a small window. Single-key invalidations (DeleteAccessKey, PutUserPolicy) drop the cache slot synchronously. Fanout invalidations (DeleteAccount, DeleteRole session sweep, DeleteGroup member fanout) use moka's invalidate_entries_if, which evaluates predicates asynchronously — there is a brief (~ms) window between the API responding success and the matching entries actually being evicted. Documented in the operator guide; for hard cutover (e.g. revoking a compromised key), prefer the single-key path.

Memory is bounded. Default 10,000 entries per cache (≈90,000 across the 9-cache authz block + credential + table-key-info), LRU-evicted under pressure. Empty results from tag and policy loaders cache at the short negative_ttl so probes don't fill the LRU with empty positive entries.

Out of scope (deferred)

Multi-instance cache invalidation. In a multi-frontend deployment, off-instance changes (a separate process modifying the catalog directly, or a different instance) wait up to ttl_seconds to propagate locally. Cross-process invalidation fanout is designed in detail in Appendix B of docs/design/12-auth-authz-cache.md (Postgres LISTEN/NOTIFY + durable backstop table for reconnect catch-up) but not implemented in this PR. The implementation is gated on a separate review pass — failure modes, schema, ordering guarantees, and operational backpressure semantics all need engineering input before the change lands. For now, multi-instance deployments must accept auth.cache.ttl_seconds as the worst-case lag; tune accordingly against IAM-revocation timing requirements.

Public type changes

StoredCredential gains one public field:

pub expires_at: Option<time::OffsetDateTime>,

None for long-lived AKIA* credentials, Some for ASIA* sessions. Required to enforce session expiry on the cache-hit path. No out-of-tree consumers of extenddb-auth exist today, but third-party CredentialStore implementations (if any) would need to populate this field.

Observability

/management/auth-cache-metrics (admin-authenticated JSON) exposes per-cache hit / stale-hit / miss / negative-hit / refresh-success / refresh-failure / refresh-skipped-inflight / refresh-dropped-epoch / invalidation counters and entry counts. Each cache also reports a pass_through: bool so operators can distinguish "cache disabled" from "cache cold."

The endpoint is admin-only because cache hit-rate and entry-count signals are workload-fingerprinting telemetry; operators who want Prometheus scraping should expose this via reverse proxy. (A future PR could mirror these into /metrics.)

Test plan

cargo fmt --all -- --check
cargo clippy --release --all-targets -- -D warnings
cargo test --workspace — 27 cache crate tests, 132 auth tests, 18 server tests, 219 core, 16 engine, 3 storage; all green.
extenddb init + serve against a clean Postgres; /health and /management/auth-cache-metrics smoke-tested.
Targeted pytest -k "test_abac or test_cache_coherence or test_console_cache_coherence": 17/17 pass.
Full pytest: 354 passed, 6 failed. The 6 failures are pre-existing concurrency + import/export issues unrelated to this work and are unchanged from main.

New tests added in this PR:

crates/cache/src/tests.rs: 27 unit tests covering single-flight, soft/hard TTL, negative caching, refresh-vs-invalidate races, refresh-vs-hard-miss races (Arc identity guard), panic safety, LRU bounds, pass-through, config validation.
Round-trip + invalidation tests in credential_cache.rs and authz_cache.rs.
tests/test_cache_coherence.py: end-to-end mgmt-API → SigV4 round-trips against a live server.
tests/test_console_cache_coherence.py: same invariants driven via the web console (form posts with CSRF) so management / console parity stays locked in against future refactor drift.

Documentation

docs/design/12-auth-authz-cache.md — full design (motivation, SWR refresh strategy, library choice, cache key shapes, error semantics, metrics, operator guide).
Appendix B of the same doc: deferred multi-instance fanout proposal.
docs/design/11-high-availability.md §D1 / §A3 — D1's "No In-Process State" rule reconciled with the new cache (scoped revision for IAM data only; data path remains uncached).
docs/manuals/05-admin-guide.md — [auth.cache] config reference + invalidation timing notes (sync vs. async + off-instance).
docs/getting-started.md — operator-level summary.

jcshepherd · 2026-05-27T18:49:27Z

+            .await;
+    }
+
+    /// Invalidate user-group-policy entries for **every** member of `group_name`.


The comment doesn't jive with the function: there's not "group_name" in play in the code. Also, why are user policy boundaries and tags not in scope of this invalidate?

Good catch on both points.

On the stale comment: you're right — group_name was leftover from an earlier draft that used invalidate_if over a group key. The function got refactored to take the member list from the
caller, but the doc never caught up. Fixed in the latest commit.

On boundaries and tags: they're intentionally out of scope here, but the function name didn't make that obvious. Looking at every call site, this only fires on group-membership events:
DeleteGroup, PutGroupPolicy, DeleteGroupPolicy. Those affect what policies a user inherits via group membership — i.e. only the user_group_policies cache. A user's permissions boundary and
tags are per-user attributes that group events don't touch; they have their own dedicated invalidation methods (invalidate_user_boundary, invalidate_user_tags) called from the
corresponding per-user mutation sites (PutUserPermissionsBoundary, TagUser, etc.).

So the function was actually doing slightly more than necessary (it was also dropping user_policies, which group events also don't affect). I've:

Renamed it to invalidate_users_group_policies so the scope is obvious from the name.

Dropped the redundant invalidate_user_policies call inside the loop.

Rewritten the doc comment to describe the actual fan-out behavior and explicitly note which per-user attributes are out of scope and why.

See c90a041 (the rename + tightening commit on top of the original feature).

jcshepherd

One or two minor comments/questions, but I think this looks okay.

One thing to maybe consider as a fast follow is to provide a way through the console and 'extenddb manage' to CLI to force a cache invalidation, instead of requiring a server bounce.

jcshepherd · 2026-05-27T20:58:11Z

Please rebase and resolve conflicts; then we can merge. Thanks!

…key info Adds an end-to-end stale-while-revalidate cache that eliminates the per- request catalog roundtrip for auth/authz data and table key info, plus the JSON-parse cost of policy documents. ~6 catalog queries per request in steady state become ~0; cold-start performance is unchanged. Architecture * New extenddb-cache crate: SwrCache<K, V, E> primitive built on moka, with config validation, single-flight on hard miss, refresh-ahead on stale hit, panic-safe RAII guards, epoch + Arc-identity guards against refresh-vs-write races, and a kill-switch pass-through mode that allocates no underlying moka. * CachedCredentialStore wraps any CredentialStore. Re-validates session expires_at on every cache hit so cached ASIA* sessions cannot outlive their issued lifetime. * CachedAuthzStore wraps any AuthorizationStore with 9 sub-caches (user/role policies, group policies, boundaries, principal tags, resource tags, session data). Cache values are wrapped in Arc so a hit is one atomic increment instead of an N-element clone. Empty results route to the short negative_ttl so probes don't poison the LRU. * CachedTableKeyInfoStore wraps StorageEngine.table_key_info; maps TableNotFound to a negative cache, other errors propagate uncached. * AuthCacheRegistry threads invalidation hooks through ServerComponents and AppState. Both the management API and the web console call the same hooks at every IAM mutation, so self-induced changes propagate instantly within the local instance. Configuration ([auth.cache] in extenddb.toml) enabled true # master kill switch ttl_seconds 60 # hard TTL soft_ttl_seconds 30 # stale-but-usable threshold negative_ttl_seconds 5 # "not found" TTL max_entries 10000 # per-cache LRU cap Validated eagerly at startup; soft_ttl > ttl, negative_ttl > ttl, and zero values are rejected. Invalidation coverage Management API and web console mutations both call into the registry: - Create/Delete/Import AccessKey, AssumeRole -> credential cache - Put/Delete user/role/group policies -> matching policy cache - Add/RemoveGroupMember -> user_group_policies - Tag/Untag user/role/resource -> matching tags cache - Set/Delete user/role boundary -> matching boundary cache - DeleteUser / DeleteRole / DeleteGroup -> full cascade - DeleteAccount -> sweeps every cache Engine handlers invalidate after storage write succeeds: - Create/Delete/Update Table -> table_key_info (+ resource_tags on Create/Delete) - Import/Restore Table -> table_key_info - Tag/UntagResource -> resource_tags Observability /management/auth-cache-metrics (admin-authenticated JSON) exposes per- cache hit/miss/refresh/invalidation counters, entry counts, and a pass_through flag distinguishing "cache disabled" from "cache cold." Out of scope (deferred) Multi-instance cache invalidation fanout. Designed in detail in Appendix B of docs/design/12-auth-authz-cache.md (Postgres LISTEN/ NOTIFY + durable backstop) but not implemented in this PR. For multi-frontend deployments, off-instance changes wait up to ttl_seconds to propagate; operators tune ttl_seconds against their IAM-revocation timing requirements. Public type changes StoredCredential gained a public field expires_at: Option<OffsetDateTime>. None for AKIA* keys, Some for ASIA* sessions. Required to enforce session expiry on the cache-hit path. Tests - 27 unit tests in extenddb-cache covering single-flight, soft/hard TTL, negative caching, refresh-vs-invalidate / refresh-vs-hard-miss races, panic safety, LRU bounds, pass-through. - Round-trip + invalidation tests for credential, authz, key-info. - tests/test_cache_coherence.py: end-to-end mgmt-API -> SigV4 round- trips against a live server. - tests/test_console_cache_coherence.py: same invariants driven via the web console (form posts with CSRF) so management/console parity stays locked in. Verification cargo fmt --check, clippy --all-targets -D warnings, cargo test --workspace all clean. Full pytest against a freshly-init'd server: 354 passed, 6 failed (pre-existing concurrency + import_export failures unrelated to this work). Design doc: docs/design/12-auth-authz-cache.md. Relationship to "No In-Process State" rule: documented as a deliberate, scoped revision in docs/design/11-high-availability.md §D1.

…p_policies Addresses PR review feedback. The previous name and stale doc comment implied broader scope (invalidating all per-user attributes) than the function actually had. Every caller is a group-membership event (DeleteGroup, PutGroupPolicy, DeleteGroupPolicy) that only affects the user_group_policies cache; user_policies, user_boundary, and user_tags are per-user attributes untouched by group events and have their own dedicated invalidation methods called from the corresponding per-user mutation sites. * Rename the method on CachedAuthzStore and the AuthzCacheInvalidator trait + AuthCacheRegistry wrapper. * Drop the redundant invalidate_user_policies call inside the loop. * Rewrite the doc comment to describe the actual fan-out behavior and explicitly note which per-user attributes are out of scope and why. * Update all four call sites (management iam_group/iam_policy, console group_pages/policy_pages).

…console page Adds manual cache invalidation as a complement to the automatic write- through hooks. Operators reach for this when off-instance changes have not yet expired, when a write-through hook is suspected of being broken, or when a test needs a deterministic flush. Documented in docs/design/12-auth-authz-cache.md §6.1. * POST /management/cache/invalidate — admin-authenticated, single endpoint with a tagged-enum scope discriminator. Supported scopes: all, account, credential, user, role, group_members, table_key_info, resource_tags. Composite scopes (user, role) report each subcache they touched in the response. scope=all requires confirm:true to prevent accidental flushes; the credential, authz, and table-key- info caches are swept independently. * extenddb manage cache invalidate <scope> ... CLI subcommand, matching the API's scope taxonomy 1:1. Uses the same admin Basic- auth flow as the rest of the manage CLI; --yes is required for `cache invalidate all` (mirrors destroy/import-access-key). * /console/cache admin-only page and POST /console/cache/invalidate form handler. Both delegate to the same `apply` helper as the management API so behavior stays byte-identical regardless of how the operator triggers it. scope=all requires typing INVALIDATE in the confirmation field. * Audit trail: every invocation logs at INFO with admin name + scope + selectors. Operators correlate this with the per-subcache `invalidations` counter on /management/auth-cache-metrics. * Plumbing: new invalidate_all() on CachedAuthzStore (sweeps every subcache) and CachedTableKeyInfoStore. ConsoleState gains the concrete authz/table-key-info handles already held by ManagementState so the console can call the shared helper. * Tests: pytest cases covering scope=user end-to-end, scope=account, scope=all confirmation gating, missing-selector → 400, IAM-user → 403, unauthenticated → 401, console form-submit success page, and the typed-confirmation requirement on scope=all.

Addresses items F1–F8 from the PR review on the manual cache invalidation feature. * F1: drop the broken /console/docs/12-auth-authz-cache link from the console cache page (design docs are not published through the docs manifest). * F2: audit log now includes every selector via `selectors = ?s` (Debug skips None fields), matching the §6.1 example. Forensic value goes from `scope=user account_id=...` to the full target. * F3: console parse_scope now drives off Scope's snake_case Deserialize impl. New scope variants no longer require a parallel edit in the console layer. * F4: console success page renders Scope via a new Scope::as_str() that returns the canonical snake_case identifier, matching the JSON wire format and the design doc table. Audit log uses the same helper. * F5: AuthzCacheInvalidator and TableKeyInfoCacheInvalidator gain `invalidate_all`, and AuthCacheRegistry gains `invalidate_all_caches`. `apply` now takes only `&AuthCacheRegistry`; `ConsoleState` no longer needs the concrete authz/table-key-info handles. Cleans up the awkward triple-handle signature. * F6: design doc §6.1 now flags that `scope=account` does not sweep `table_key_info` (per-table cache; use `scope=all` or `scope=table_key_info` per table). * F7: progressive disclosure on the console form. Selector fields are now hidden when they don't apply to the chosen scope. The page still works without JS — every field renders, the server ignores irrelevant ones. * F8: test_manual_invalidate_user_increments_counters → renamed to test_manual_invalidate_user_forces_refetch and now asserts the `misses` counter increases on the next request, proving the cached entry was actually dropped (not just that the `invalidations` counter ticked, which always does).

engjan · 2026-05-29T18:02:00Z

Added admin break-glass cache invalidation per the suggestion to expose this via the CLI and console.

Surface

POST /management/cache/invalidate — admin-authenticated, single endpoint with a scope discriminator and per-scope selectors.
extenddb manage cache invalidate ... — CLI subcommand, scope taxonomy 1:1 with the API.
/console/cache — admin-only page with a form (progressive disclosure on the selector inputs).

All three paths share the same apply helper, so behavior is identical regardless of how an operator triggers it.

Scopes

┌────────────────┬─────────────────────────────────────┬────────────────────────────────────────────────────────────────┐
│ scope │ Selectors │ Fans out to │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ all │ (requires --yes / typed INVALIDATE) │ every cache │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ account │ account_id │ authz + credentials for the account │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ credential │ access_key_id │ one cached key │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ user │ account_id, user_name │ user_policies + group_policies + boundary + tags + credentials │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ role │ account_id, role_name │ role_policies + boundary + tags + sessions + credentials │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ group_members │ account_id, user_names[] │ user_group_policies for each member │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ table_key_info │ account_id, table_name │ one cached table │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ resource_tags │ arn │ one ARN │
└────────────────┴─────────────────────────────────────┴────────────────────────────────────────────────────────────────┘

Composite scopes (user, role) match what an operator usually wants ("forget everything about user X"); narrow scopes are still exposed for surgical use.

Safety

scope=all requires explicit confirmation (--yes from CLI, typed token from console). The endpoint refuses without it.
Audit log line per invocation (tracing::info! with admin name + scope + selectors), in addition to the existing per-subcache invalidations counters on /management/auth-cache-metrics.
Disabled-cache mode is fine: registry methods are no-ops, endpoint returns 200 — the operator's contract is trivially satisfied.

Tests

9 new pytest cases covering the API, the auth gating (401/403/admin-only), the missing-selector validation, the confirm requirement on all, the console form, and the typed-confirmation
token. All pass against a live server.

Out of scope (deferred)

Cross-instance fanout. Manual invalidation only clears the local instance, same as the automatic write-through hooks. Documented in §6.1.
Cache content inspection. Different threat model (admin-readable mirror of cached credential material).

Design doc: docs/design/12-auth-authz-cache.md §6.1.

Branch is re-basedlined.

jcshepherd

Thanks for clarifying the group invalidation behavior and for adding the ability to manually invalidate caches: we should move ahead with it.

engjan requested review from LeeroyHannigan, jcshepherd and pdf-amzn May 23, 2026 04:18

jcshepherd reviewed May 27, 2026

View reviewed changes

jcshepherd previously approved these changes May 27, 2026

View reviewed changes

Jan Engelsberg added 4 commits May 29, 2026 10:41

engjan dismissed jcshepherd’s stale review via c90a041 May 29, 2026 17:56

engjan force-pushed the feat/auth-cache-pr branch from d640a77 to c90a041 Compare May 29, 2026 17:57

engjan requested review from amrith and c33howard as code owners May 29, 2026 17:57

jcshepherd approved these changes May 29, 2026

View reviewed changes

Merge branch 'main' into feat/auth-cache-pr

b959abb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(auth-cache): SWR cache for credentials, IAM policies, and table-key info#122

feat(auth-cache): SWR cache for credentials, IAM policies, and table-key info#122
engjan wants to merge 5 commits into
mainfrom
feat/auth-cache-pr

engjan commented May 23, 2026

Uh oh!

jcshepherd May 27, 2026

Uh oh!

engjan May 29, 2026

Uh oh!

jcshepherd left a comment

Uh oh!

jcshepherd commented May 27, 2026

Uh oh!

engjan commented May 29, 2026

Uh oh!

jcshepherd left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

engjan commented May 23, 2026

Why

What

Tradeoffs

Out of scope (deferred)

Public type changes

Observability

Test plan

Documentation

Uh oh!

jcshepherd May 27, 2026

Choose a reason for hiding this comment

Uh oh!

engjan May 29, 2026

Choose a reason for hiding this comment

Uh oh!

jcshepherd left a comment

Choose a reason for hiding this comment

Uh oh!

jcshepherd commented May 27, 2026

Uh oh!

engjan commented May 29, 2026

Uh oh!

jcshepherd left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants