Skip to content

feat(auth-cache): SWR cache for credentials, IAM policies, and table-key info#122

Open
engjan wants to merge 5 commits into
mainfrom
feat/auth-cache-pr
Open

feat(auth-cache): SWR cache for credentials, IAM policies, and table-key info#122
engjan wants to merge 5 commits into
mainfrom
feat/auth-cache-pr

Conversation

@engjan
Copy link
Copy Markdown
Collaborator

@engjan engjan commented May 23, 2026

Why

Each DynamoDB request issues ~6 catalog queries before dispatch can begin: credential lookup, identity policies, group policies, permissions boundary, principal tags, resource tags, plus a 7th lookup for TableKeyInfo on item-level operations. Queries 2–6 fan out concurrently via tokio::try_join!, but they each consume a catalog-pool connection and reparse policy JSON on every hit. Under load this is the dominant pre-dispatch overhead — no pool size fixes it, because the bottleneck is the round-trip itself.

This PR layers an in-memory stale-while-revalidate cache over each of those reads to drive steady-state catalog roundtrips to ~0 while keeping the pre-cache behavior reachable via a kill switch.

What

Adds an extenddb-cache crate (SwrCache<K, V, E> primitive on top of moka) plus three wrappers:

  • CachedCredentialStore wraps any CredentialStore. Re-validates expires_at on every cache hit so cached ASIA* sessions cannot outlive their issued lifetime.
  • CachedAuthzStore wraps any AuthorizationStore with 9 sub-caches (user/role policies, group policies, boundaries, principal tags, resource tags, session data). Cache values are Arc-wrapped so a hit is one atomic increment instead of an N-element clone.
  • CachedTableKeyInfoStore wraps StorageEngine.table_key_info with negative caching for TableNotFound.

A new AuthCacheRegistry threads invalidation hooks through ServerComponents and AppState so every IAM mutation — from both the management API and the web console — invalidates the matching cache entry write-through. Self-induced changes propagate instantly within the local instance.

Configuration ([auth.cache] in extenddb.toml):

enabled              = true   # master kill switch
ttl_seconds          = 60     # hard TTL — entries beyond this are full misses
soft_ttl_seconds     = 30     # stale-but-usable threshold; refresh-ahead
negative_ttl_seconds = 5      # how long "not found" results are remembered
max_entries          = 10000  # per-cache LRU cap

When enabled = false, the wrappers operate in pass-through mode and forward every request directly to the underlying store. The pass-through path allocates no underlying moka instance (Option<MokaCache>) — it's a true kill switch, not a configured-to-expire-instantly cache.

Tradeoffs

Bounded staleness for IAM data. AWS IAM itself documents policy-propagation delays of ~30s–2min; we expose the same trade-off as auth.cache.ttl_seconds, configurable. This is a deliberate, scoped revision of the project's "No In-Process State" rule (docs/design/11-high-availability.md §D1) — see that doc for the full discussion. The data path remains uncached.

No verdict caching. We cache the inputs to authorization (policies, tags), not Allow/Deny decisions. Verdict caching has subtle correctness issues with conditions that depend on per-request context (aws:CurrentTime, IP, leading keys) and would need invalidation on every condition-relevant input change. Re-evaluating policies in CPU per request is essentially free once the inputs are cached.

Async fanout invalidation has a small window. Single-key invalidations (DeleteAccessKey, PutUserPolicy) drop the cache slot synchronously. Fanout invalidations (DeleteAccount, DeleteRole session sweep, DeleteGroup member fanout) use moka's invalidate_entries_if, which evaluates predicates asynchronously — there is a brief (~ms) window between the API responding success and the matching entries actually being evicted. Documented in the operator guide; for hard cutover (e.g. revoking a compromised key), prefer the single-key path.

Memory is bounded. Default 10,000 entries per cache (≈90,000 across the 9-cache authz block + credential + table-key-info), LRU-evicted under pressure. Empty results from tag and policy loaders cache at the short negative_ttl so probes don't fill the LRU with empty positive entries.

Out of scope (deferred)

Multi-instance cache invalidation. In a multi-frontend deployment, off-instance changes (a separate process modifying the catalog directly, or a different instance) wait up to ttl_seconds to propagate locally. Cross-process invalidation fanout is designed in detail in Appendix B of docs/design/12-auth-authz-cache.md (Postgres LISTEN/NOTIFY + durable backstop table for reconnect catch-up) but not implemented in this PR. The implementation is gated on a separate review pass — failure modes, schema, ordering guarantees, and operational backpressure semantics all need engineering input before the change lands. For now, multi-instance deployments must accept auth.cache.ttl_seconds as the worst-case lag; tune accordingly against IAM-revocation timing requirements.

Public type changes

StoredCredential gains one public field:

pub expires_at: Option<time::OffsetDateTime>,

None for long-lived AKIA* credentials, Some for ASIA* sessions. Required to enforce session expiry on the cache-hit path. No out-of-tree consumers of extenddb-auth exist today, but third-party CredentialStore implementations (if any) would need to populate this field.

Observability

/management/auth-cache-metrics (admin-authenticated JSON) exposes per-cache hit / stale-hit / miss / negative-hit / refresh-success / refresh-failure / refresh-skipped-inflight / refresh-dropped-epoch / invalidation counters and entry counts. Each cache also reports a pass_through: bool so operators can distinguish "cache disabled" from "cache cold."

The endpoint is admin-only because cache hit-rate and entry-count signals are workload-fingerprinting telemetry; operators who want Prometheus scraping should expose this via reverse proxy. (A future PR could mirror these into /metrics.)

Test plan

  • cargo fmt --all -- --check
  • cargo clippy --release --all-targets -- -D warnings
  • cargo test --workspace — 27 cache crate tests, 132 auth tests, 18 server tests, 219 core, 16 engine, 3 storage; all green.
  • extenddb init + serve against a clean Postgres; /health and /management/auth-cache-metrics smoke-tested.
  • Targeted pytest -k "test_abac or test_cache_coherence or test_console_cache_coherence": 17/17 pass.
  • Full pytest: 354 passed, 6 failed. The 6 failures are pre-existing concurrency + import/export issues unrelated to this work and are unchanged from main.

New tests added in this PR:

  • crates/cache/src/tests.rs: 27 unit tests covering single-flight, soft/hard TTL, negative caching, refresh-vs-invalidate races, refresh-vs-hard-miss races (Arc identity guard), panic safety, LRU bounds, pass-through, config validation.
  • Round-trip + invalidation tests in credential_cache.rs and authz_cache.rs.
  • tests/test_cache_coherence.py: end-to-end mgmt-API → SigV4 round-trips against a live server.
  • tests/test_console_cache_coherence.py: same invariants driven via the web console (form posts with CSRF) so management / console parity stays locked in against future refactor drift.

Documentation

  • docs/design/12-auth-authz-cache.md — full design (motivation, SWR refresh strategy, library choice, cache key shapes, error semantics, metrics, operator guide).
  • Appendix B of the same doc: deferred multi-instance fanout proposal.
  • docs/design/11-high-availability.md §D1 / §A3 — D1's "No In-Process State" rule reconciled with the new cache (scoped revision for IAM data only; data path remains uncached).
  • docs/manuals/05-admin-guide.md[auth.cache] config reference + invalidation timing notes (sync vs. async + off-instance).
  • docs/getting-started.md — operator-level summary.

Comment thread crates/server/src/authz_cache.rs Outdated
.await;
}

/// Invalidate user-group-policy entries for **every** member of `group_name`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment doesn't jive with the function: there's not "group_name" in play in the code. Also, why are user policy boundaries and tags not in scope of this invalidate?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on both points.

On the stale comment: you're right — group_name was leftover from an earlier draft that used invalidate_if over a group key. The function got refactored to take the member list from the
caller, but the doc never caught up. Fixed in the latest commit.

On boundaries and tags: they're intentionally out of scope here, but the function name didn't make that obvious. Looking at every call site, this only fires on group-membership events:
DeleteGroup, PutGroupPolicy, DeleteGroupPolicy. Those affect what policies a user inherits via group membership — i.e. only the user_group_policies cache. A user's permissions boundary and
tags are per-user attributes that group events don't touch; they have their own dedicated invalidation methods (invalidate_user_boundary, invalidate_user_tags) called from the
corresponding per-user mutation sites (PutUserPermissionsBoundary, TagUser, etc.).

So the function was actually doing slightly more than necessary (it was also dropping user_policies, which group events also don't affect). I've:

  1. Renamed it to invalidate_users_group_policies so the scope is obvious from the name.
  2. Dropped the redundant invalidate_user_policies call inside the loop.
  3. Rewritten the doc comment to describe the actual fan-out behavior and explicitly note which per-user attributes are out of scope and why.

See c90a041 (the rename + tightening commit on top of the original feature).

jcshepherd
jcshepherd previously approved these changes May 27, 2026
Copy link
Copy Markdown
Collaborator

@jcshepherd jcshepherd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One or two minor comments/questions, but I think this looks okay.

One thing to maybe consider as a fast follow is to provide a way through the console and 'extenddb manage' to CLI to force a cache invalidation, instead of requiring a server bounce.

@jcshepherd
Copy link
Copy Markdown
Collaborator

Please rebase and resolve conflicts; then we can merge. Thanks!

Jan Engelsberg added 4 commits May 29, 2026 10:41
…key info

Adds an end-to-end stale-while-revalidate cache that eliminates the per-
request catalog roundtrip for auth/authz data and table key info, plus
the JSON-parse cost of policy documents. ~6 catalog queries per request
in steady state become ~0; cold-start performance is unchanged.

Architecture

  * New extenddb-cache crate: SwrCache<K, V, E> primitive built on moka,
    with config validation, single-flight on hard miss, refresh-ahead
    on stale hit, panic-safe RAII guards, epoch + Arc-identity guards
    against refresh-vs-write races, and a kill-switch pass-through mode
    that allocates no underlying moka.

  * CachedCredentialStore wraps any CredentialStore. Re-validates
    session expires_at on every cache hit so cached ASIA* sessions
    cannot outlive their issued lifetime.

  * CachedAuthzStore wraps any AuthorizationStore with 9 sub-caches
    (user/role policies, group policies, boundaries, principal tags,
    resource tags, session data). Cache values are wrapped in Arc so a
    hit is one atomic increment instead of an N-element clone. Empty
    results route to the short negative_ttl so probes don't poison the
    LRU.

  * CachedTableKeyInfoStore wraps StorageEngine.table_key_info; maps
    TableNotFound to a negative cache, other errors propagate uncached.

  * AuthCacheRegistry threads invalidation hooks through ServerComponents
    and AppState. Both the management API and the web console call the
    same hooks at every IAM mutation, so self-induced changes propagate
    instantly within the local instance.

Configuration ([auth.cache] in extenddb.toml)

  enabled              true        # master kill switch
  ttl_seconds          60          # hard TTL
  soft_ttl_seconds     30          # stale-but-usable threshold
  negative_ttl_seconds 5           # "not found" TTL
  max_entries          10000       # per-cache LRU cap

  Validated eagerly at startup; soft_ttl > ttl, negative_ttl > ttl, and
  zero values are rejected.

Invalidation coverage

  Management API and web console mutations both call into the registry:
    - Create/Delete/Import AccessKey, AssumeRole -> credential cache
    - Put/Delete user/role/group policies        -> matching policy cache
    - Add/RemoveGroupMember                      -> user_group_policies
    - Tag/Untag user/role/resource               -> matching tags cache
    - Set/Delete user/role boundary              -> matching boundary cache
    - DeleteUser / DeleteRole / DeleteGroup      -> full cascade
    - DeleteAccount                              -> sweeps every cache
  Engine handlers invalidate after storage write succeeds:
    - Create/Delete/Update Table   -> table_key_info (+ resource_tags
                                       on Create/Delete)
    - Import/Restore Table         -> table_key_info
    - Tag/UntagResource            -> resource_tags

Observability

  /management/auth-cache-metrics (admin-authenticated JSON) exposes per-
  cache hit/miss/refresh/invalidation counters, entry counts, and a
  pass_through flag distinguishing "cache disabled" from "cache cold."

Out of scope (deferred)

  Multi-instance cache invalidation fanout. Designed in detail in
  Appendix B of docs/design/12-auth-authz-cache.md (Postgres LISTEN/
  NOTIFY + durable backstop) but not implemented in this PR. For
  multi-frontend deployments, off-instance changes wait up to
  ttl_seconds to propagate; operators tune ttl_seconds against their
  IAM-revocation timing requirements.

Public type changes

  StoredCredential gained a public field expires_at:
  Option<OffsetDateTime>. None for AKIA* keys, Some for ASIA* sessions.
  Required to enforce session expiry on the cache-hit path.

Tests

  - 27 unit tests in extenddb-cache covering single-flight, soft/hard
    TTL, negative caching, refresh-vs-invalidate / refresh-vs-hard-miss
    races, panic safety, LRU bounds, pass-through.
  - Round-trip + invalidation tests for credential, authz, key-info.
  - tests/test_cache_coherence.py: end-to-end mgmt-API -> SigV4 round-
    trips against a live server.
  - tests/test_console_cache_coherence.py: same invariants driven via
    the web console (form posts with CSRF) so management/console
    parity stays locked in.

Verification

  cargo fmt --check, clippy --all-targets -D warnings, cargo test
  --workspace all clean. Full pytest against a freshly-init'd server:
  354 passed, 6 failed (pre-existing concurrency + import_export
  failures unrelated to this work).

Design doc: docs/design/12-auth-authz-cache.md.
Relationship to "No In-Process State" rule: documented as a deliberate,
scoped revision in docs/design/11-high-availability.md §D1.
…p_policies

Addresses PR review feedback. The previous name and stale doc comment
implied broader scope (invalidating all per-user attributes) than the
function actually had. Every caller is a group-membership event
(DeleteGroup, PutGroupPolicy, DeleteGroupPolicy) that only affects the
user_group_policies cache; user_policies, user_boundary, and user_tags
are per-user attributes untouched by group events and have their own
dedicated invalidation methods called from the corresponding per-user
mutation sites.

  * Rename the method on CachedAuthzStore and the
    AuthzCacheInvalidator trait + AuthCacheRegistry wrapper.
  * Drop the redundant invalidate_user_policies call inside the loop.
  * Rewrite the doc comment to describe the actual fan-out behavior
    and explicitly note which per-user attributes are out of scope
    and why.
  * Update all four call sites (management iam_group/iam_policy,
    console group_pages/policy_pages).
…console page

Adds manual cache invalidation as a complement to the automatic write-
through hooks. Operators reach for this when off-instance changes have
not yet expired, when a write-through hook is suspected of being broken,
or when a test needs a deterministic flush. Documented in
docs/design/12-auth-authz-cache.md §6.1.

  * POST /management/cache/invalidate — admin-authenticated, single
    endpoint with a tagged-enum scope discriminator. Supported scopes:
    all, account, credential, user, role, group_members, table_key_info,
    resource_tags. Composite scopes (user, role) report each subcache
    they touched in the response. scope=all requires confirm:true to
    prevent accidental flushes; the credential, authz, and table-key-
    info caches are swept independently.

  * extenddb manage cache invalidate <scope> ... CLI subcommand,
    matching the API's scope taxonomy 1:1. Uses the same admin Basic-
    auth flow as the rest of the manage CLI; --yes is required for
    `cache invalidate all` (mirrors destroy/import-access-key).

  * /console/cache admin-only page and POST /console/cache/invalidate
    form handler. Both delegate to the same `apply` helper as the
    management API so behavior stays byte-identical regardless of how
    the operator triggers it. scope=all requires typing INVALIDATE in
    the confirmation field.

  * Audit trail: every invocation logs at INFO with admin name + scope +
    selectors. Operators correlate this with the per-subcache
    `invalidations` counter on /management/auth-cache-metrics.

  * Plumbing: new invalidate_all() on CachedAuthzStore (sweeps every
    subcache) and CachedTableKeyInfoStore. ConsoleState gains the
    concrete authz/table-key-info handles already held by
    ManagementState so the console can call the shared helper.

  * Tests: pytest cases covering scope=user end-to-end, scope=account,
    scope=all confirmation gating, missing-selector → 400, IAM-user
    → 403, unauthenticated → 401, console form-submit success page,
    and the typed-confirmation requirement on scope=all.
Addresses items F1–F8 from the PR review on the manual cache
invalidation feature.

  * F1: drop the broken /console/docs/12-auth-authz-cache link from
    the console cache page (design docs are not published through
    the docs manifest).
  * F2: audit log now includes every selector via `selectors = ?s`
    (Debug skips None fields), matching the §6.1 example. Forensic
    value goes from `scope=user account_id=...` to the full target.
  * F3: console parse_scope now drives off Scope's snake_case
    Deserialize impl. New scope variants no longer require a parallel
    edit in the console layer.
  * F4: console success page renders Scope via a new Scope::as_str()
    that returns the canonical snake_case identifier, matching the
    JSON wire format and the design doc table. Audit log uses the
    same helper.
  * F5: AuthzCacheInvalidator and TableKeyInfoCacheInvalidator gain
    `invalidate_all`, and AuthCacheRegistry gains `invalidate_all_caches`.
    `apply` now takes only `&AuthCacheRegistry`; `ConsoleState` no longer
    needs the concrete authz/table-key-info handles. Cleans up the
    awkward triple-handle signature.
  * F6: design doc §6.1 now flags that `scope=account` does not sweep
    `table_key_info` (per-table cache; use `scope=all` or
    `scope=table_key_info` per table).
  * F7: progressive disclosure on the console form. Selector fields
    are now hidden when they don't apply to the chosen scope. The
    page still works without JS — every field renders, the server
    ignores irrelevant ones.
  * F8: test_manual_invalidate_user_increments_counters → renamed to
    test_manual_invalidate_user_forces_refetch and now asserts the
    `misses` counter increases on the next request, proving the
    cached entry was actually dropped (not just that the
    `invalidations` counter ticked, which always does).
@engjan
Copy link
Copy Markdown
Collaborator Author

engjan commented May 29, 2026

Added admin break-glass cache invalidation per the suggestion to expose this via the CLI and console.

Surface

  • POST /management/cache/invalidate — admin-authenticated, single endpoint with a scope discriminator and per-scope selectors.
  • extenddb manage cache invalidate ... — CLI subcommand, scope taxonomy 1:1 with the API.
  • /console/cache — admin-only page with a form (progressive disclosure on the selector inputs).

All three paths share the same apply helper, so behavior is identical regardless of how an operator triggers it.

Scopes

┌────────────────┬─────────────────────────────────────┬────────────────────────────────────────────────────────────────┐
│ scope │ Selectors │ Fans out to │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ all │ (requires --yes / typed INVALIDATE) │ every cache │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ account │ account_id │ authz + credentials for the account │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ credential │ access_key_id │ one cached key │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ user │ account_id, user_name │ user_policies + group_policies + boundary + tags + credentials │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ role │ account_id, role_name │ role_policies + boundary + tags + sessions + credentials │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ group_members │ account_id, user_names[] │ user_group_policies for each member │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ table_key_info │ account_id, table_name │ one cached table │
├────────────────┼─────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ resource_tags │ arn │ one ARN │
└────────────────┴─────────────────────────────────────┴────────────────────────────────────────────────────────────────┘

Composite scopes (user, role) match what an operator usually wants ("forget everything about user X"); narrow scopes are still exposed for surgical use.

Safety

  • scope=all requires explicit confirmation (--yes from CLI, typed token from console). The endpoint refuses without it.
  • Audit log line per invocation (tracing::info! with admin name + scope + selectors), in addition to the existing per-subcache invalidations counters on /management/auth-cache-metrics.
  • Disabled-cache mode is fine: registry methods are no-ops, endpoint returns 200 — the operator's contract is trivially satisfied.

Tests

  • 9 new pytest cases covering the API, the auth gating (401/403/admin-only), the missing-selector validation, the confirm requirement on all, the console form, and the typed-confirmation
    token. All pass against a live server.

Out of scope (deferred)

  • Cross-instance fanout. Manual invalidation only clears the local instance, same as the automatic write-through hooks. Documented in §6.1.
  • Cache content inspection. Different threat model (admin-readable mirror of cached credential material).

Design doc: docs/design/12-auth-authz-cache.md §6.1.

Branch is re-basedlined.

Copy link
Copy Markdown
Collaborator

@jcshepherd jcshepherd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying the group invalidation behavior and for adding the ability to manually invalidate caches: we should move ahead with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants