feat(auth-cache): SWR cache for credentials, IAM policies, and table-key info#122
feat(auth-cache): SWR cache for credentials, IAM policies, and table-key info#122engjan wants to merge 5 commits into
Conversation
| .await; | ||
| } | ||
|
|
||
| /// Invalidate user-group-policy entries for **every** member of `group_name`. |
There was a problem hiding this comment.
The comment doesn't jive with the function: there's not "group_name" in play in the code. Also, why are user policy boundaries and tags not in scope of this invalidate?
There was a problem hiding this comment.
Good catch on both points.
On the stale comment: you're right — group_name was leftover from an earlier draft that used invalidate_if over a group key. The function got refactored to take the member list from the
caller, but the doc never caught up. Fixed in the latest commit.
On boundaries and tags: they're intentionally out of scope here, but the function name didn't make that obvious. Looking at every call site, this only fires on group-membership events:
DeleteGroup, PutGroupPolicy, DeleteGroupPolicy. Those affect what policies a user inherits via group membership — i.e. only the user_group_policies cache. A user's permissions boundary and
tags are per-user attributes that group events don't touch; they have their own dedicated invalidation methods (invalidate_user_boundary, invalidate_user_tags) called from the
corresponding per-user mutation sites (PutUserPermissionsBoundary, TagUser, etc.).
So the function was actually doing slightly more than necessary (it was also dropping user_policies, which group events also don't affect). I've:
- Renamed it to invalidate_users_group_policies so the scope is obvious from the name.
- Dropped the redundant invalidate_user_policies call inside the loop.
- Rewritten the doc comment to describe the actual fan-out behavior and explicitly note which per-user attributes are out of scope and why.
See c90a041 (the rename + tightening commit on top of the original feature).
jcshepherd
left a comment
There was a problem hiding this comment.
One or two minor comments/questions, but I think this looks okay.
One thing to maybe consider as a fast follow is to provide a way through the console and 'extenddb manage' to CLI to force a cache invalidation, instead of requiring a server bounce.
|
Please rebase and resolve conflicts; then we can merge. Thanks! |
…key info
Adds an end-to-end stale-while-revalidate cache that eliminates the per-
request catalog roundtrip for auth/authz data and table key info, plus
the JSON-parse cost of policy documents. ~6 catalog queries per request
in steady state become ~0; cold-start performance is unchanged.
Architecture
* New extenddb-cache crate: SwrCache<K, V, E> primitive built on moka,
with config validation, single-flight on hard miss, refresh-ahead
on stale hit, panic-safe RAII guards, epoch + Arc-identity guards
against refresh-vs-write races, and a kill-switch pass-through mode
that allocates no underlying moka.
* CachedCredentialStore wraps any CredentialStore. Re-validates
session expires_at on every cache hit so cached ASIA* sessions
cannot outlive their issued lifetime.
* CachedAuthzStore wraps any AuthorizationStore with 9 sub-caches
(user/role policies, group policies, boundaries, principal tags,
resource tags, session data). Cache values are wrapped in Arc so a
hit is one atomic increment instead of an N-element clone. Empty
results route to the short negative_ttl so probes don't poison the
LRU.
* CachedTableKeyInfoStore wraps StorageEngine.table_key_info; maps
TableNotFound to a negative cache, other errors propagate uncached.
* AuthCacheRegistry threads invalidation hooks through ServerComponents
and AppState. Both the management API and the web console call the
same hooks at every IAM mutation, so self-induced changes propagate
instantly within the local instance.
Configuration ([auth.cache] in extenddb.toml)
enabled true # master kill switch
ttl_seconds 60 # hard TTL
soft_ttl_seconds 30 # stale-but-usable threshold
negative_ttl_seconds 5 # "not found" TTL
max_entries 10000 # per-cache LRU cap
Validated eagerly at startup; soft_ttl > ttl, negative_ttl > ttl, and
zero values are rejected.
Invalidation coverage
Management API and web console mutations both call into the registry:
- Create/Delete/Import AccessKey, AssumeRole -> credential cache
- Put/Delete user/role/group policies -> matching policy cache
- Add/RemoveGroupMember -> user_group_policies
- Tag/Untag user/role/resource -> matching tags cache
- Set/Delete user/role boundary -> matching boundary cache
- DeleteUser / DeleteRole / DeleteGroup -> full cascade
- DeleteAccount -> sweeps every cache
Engine handlers invalidate after storage write succeeds:
- Create/Delete/Update Table -> table_key_info (+ resource_tags
on Create/Delete)
- Import/Restore Table -> table_key_info
- Tag/UntagResource -> resource_tags
Observability
/management/auth-cache-metrics (admin-authenticated JSON) exposes per-
cache hit/miss/refresh/invalidation counters, entry counts, and a
pass_through flag distinguishing "cache disabled" from "cache cold."
Out of scope (deferred)
Multi-instance cache invalidation fanout. Designed in detail in
Appendix B of docs/design/12-auth-authz-cache.md (Postgres LISTEN/
NOTIFY + durable backstop) but not implemented in this PR. For
multi-frontend deployments, off-instance changes wait up to
ttl_seconds to propagate; operators tune ttl_seconds against their
IAM-revocation timing requirements.
Public type changes
StoredCredential gained a public field expires_at:
Option<OffsetDateTime>. None for AKIA* keys, Some for ASIA* sessions.
Required to enforce session expiry on the cache-hit path.
Tests
- 27 unit tests in extenddb-cache covering single-flight, soft/hard
TTL, negative caching, refresh-vs-invalidate / refresh-vs-hard-miss
races, panic safety, LRU bounds, pass-through.
- Round-trip + invalidation tests for credential, authz, key-info.
- tests/test_cache_coherence.py: end-to-end mgmt-API -> SigV4 round-
trips against a live server.
- tests/test_console_cache_coherence.py: same invariants driven via
the web console (form posts with CSRF) so management/console
parity stays locked in.
Verification
cargo fmt --check, clippy --all-targets -D warnings, cargo test
--workspace all clean. Full pytest against a freshly-init'd server:
354 passed, 6 failed (pre-existing concurrency + import_export
failures unrelated to this work).
Design doc: docs/design/12-auth-authz-cache.md.
Relationship to "No In-Process State" rule: documented as a deliberate,
scoped revision in docs/design/11-high-availability.md §D1.
…p_policies
Addresses PR review feedback. The previous name and stale doc comment
implied broader scope (invalidating all per-user attributes) than the
function actually had. Every caller is a group-membership event
(DeleteGroup, PutGroupPolicy, DeleteGroupPolicy) that only affects the
user_group_policies cache; user_policies, user_boundary, and user_tags
are per-user attributes untouched by group events and have their own
dedicated invalidation methods called from the corresponding per-user
mutation sites.
* Rename the method on CachedAuthzStore and the
AuthzCacheInvalidator trait + AuthCacheRegistry wrapper.
* Drop the redundant invalidate_user_policies call inside the loop.
* Rewrite the doc comment to describe the actual fan-out behavior
and explicitly note which per-user attributes are out of scope
and why.
* Update all four call sites (management iam_group/iam_policy,
console group_pages/policy_pages).
…console page
Adds manual cache invalidation as a complement to the automatic write-
through hooks. Operators reach for this when off-instance changes have
not yet expired, when a write-through hook is suspected of being broken,
or when a test needs a deterministic flush. Documented in
docs/design/12-auth-authz-cache.md §6.1.
* POST /management/cache/invalidate — admin-authenticated, single
endpoint with a tagged-enum scope discriminator. Supported scopes:
all, account, credential, user, role, group_members, table_key_info,
resource_tags. Composite scopes (user, role) report each subcache
they touched in the response. scope=all requires confirm:true to
prevent accidental flushes; the credential, authz, and table-key-
info caches are swept independently.
* extenddb manage cache invalidate <scope> ... CLI subcommand,
matching the API's scope taxonomy 1:1. Uses the same admin Basic-
auth flow as the rest of the manage CLI; --yes is required for
`cache invalidate all` (mirrors destroy/import-access-key).
* /console/cache admin-only page and POST /console/cache/invalidate
form handler. Both delegate to the same `apply` helper as the
management API so behavior stays byte-identical regardless of how
the operator triggers it. scope=all requires typing INVALIDATE in
the confirmation field.
* Audit trail: every invocation logs at INFO with admin name + scope +
selectors. Operators correlate this with the per-subcache
`invalidations` counter on /management/auth-cache-metrics.
* Plumbing: new invalidate_all() on CachedAuthzStore (sweeps every
subcache) and CachedTableKeyInfoStore. ConsoleState gains the
concrete authz/table-key-info handles already held by
ManagementState so the console can call the shared helper.
* Tests: pytest cases covering scope=user end-to-end, scope=account,
scope=all confirmation gating, missing-selector → 400, IAM-user
→ 403, unauthenticated → 401, console form-submit success page,
and the typed-confirmation requirement on scope=all.
Addresses items F1–F8 from the PR review on the manual cache
invalidation feature.
* F1: drop the broken /console/docs/12-auth-authz-cache link from
the console cache page (design docs are not published through
the docs manifest).
* F2: audit log now includes every selector via `selectors = ?s`
(Debug skips None fields), matching the §6.1 example. Forensic
value goes from `scope=user account_id=...` to the full target.
* F3: console parse_scope now drives off Scope's snake_case
Deserialize impl. New scope variants no longer require a parallel
edit in the console layer.
* F4: console success page renders Scope via a new Scope::as_str()
that returns the canonical snake_case identifier, matching the
JSON wire format and the design doc table. Audit log uses the
same helper.
* F5: AuthzCacheInvalidator and TableKeyInfoCacheInvalidator gain
`invalidate_all`, and AuthCacheRegistry gains `invalidate_all_caches`.
`apply` now takes only `&AuthCacheRegistry`; `ConsoleState` no longer
needs the concrete authz/table-key-info handles. Cleans up the
awkward triple-handle signature.
* F6: design doc §6.1 now flags that `scope=account` does not sweep
`table_key_info` (per-table cache; use `scope=all` or
`scope=table_key_info` per table).
* F7: progressive disclosure on the console form. Selector fields
are now hidden when they don't apply to the chosen scope. The
page still works without JS — every field renders, the server
ignores irrelevant ones.
* F8: test_manual_invalidate_user_increments_counters → renamed to
test_manual_invalidate_user_forces_refetch and now asserts the
`misses` counter increases on the next request, proving the
cached entry was actually dropped (not just that the
`invalidations` counter ticked, which always does).
d640a77 to
c90a041
Compare
|
Added admin break-glass cache invalidation per the suggestion to expose this via the CLI and console. Surface
All three paths share the same apply helper, so behavior is identical regardless of how an operator triggers it. Scopes ┌────────────────┬─────────────────────────────────────┬────────────────────────────────────────────────────────────────┐ Composite scopes (user, role) match what an operator usually wants ("forget everything about user X"); narrow scopes are still exposed for surgical use. Safety
Tests
Out of scope (deferred)
Design doc: docs/design/12-auth-authz-cache.md §6.1. Branch is re-basedlined. |
jcshepherd
left a comment
There was a problem hiding this comment.
Thanks for clarifying the group invalidation behavior and for adding the ability to manually invalidate caches: we should move ahead with it.
Why
Each DynamoDB request issues ~6 catalog queries before dispatch can begin: credential lookup, identity policies, group policies, permissions boundary, principal tags, resource tags, plus a 7th lookup for
TableKeyInfoon item-level operations. Queries 2–6 fan out concurrently viatokio::try_join!, but they each consume a catalog-pool connection and reparse policy JSON on every hit. Under load this is the dominant pre-dispatch overhead — no pool size fixes it, because the bottleneck is the round-trip itself.This PR layers an in-memory stale-while-revalidate cache over each of those reads to drive steady-state catalog roundtrips to ~0 while keeping the pre-cache behavior reachable via a kill switch.
What
Adds an
extenddb-cachecrate (SwrCache<K, V, E>primitive on top of moka) plus three wrappers:CachedCredentialStorewraps anyCredentialStore. Re-validatesexpires_aton every cache hit so cached ASIA* sessions cannot outlive their issued lifetime.CachedAuthzStorewraps anyAuthorizationStorewith 9 sub-caches (user/role policies, group policies, boundaries, principal tags, resource tags, session data). Cache values areArc-wrapped so a hit is one atomic increment instead of an N-element clone.CachedTableKeyInfoStorewrapsStorageEngine.table_key_infowith negative caching forTableNotFound.A new
AuthCacheRegistrythreads invalidation hooks throughServerComponentsandAppStateso every IAM mutation — from both the management API and the web console — invalidates the matching cache entry write-through. Self-induced changes propagate instantly within the local instance.Configuration (
[auth.cache]inextenddb.toml):When
enabled = false, the wrappers operate in pass-through mode and forward every request directly to the underlying store. The pass-through path allocates no underlying moka instance (Option<MokaCache>) — it's a true kill switch, not a configured-to-expire-instantly cache.Tradeoffs
Bounded staleness for IAM data. AWS IAM itself documents policy-propagation delays of ~30s–2min; we expose the same trade-off as
auth.cache.ttl_seconds, configurable. This is a deliberate, scoped revision of the project's "No In-Process State" rule (docs/design/11-high-availability.md§D1) — see that doc for the full discussion. The data path remains uncached.No verdict caching. We cache the inputs to authorization (policies, tags), not
Allow/Denydecisions. Verdict caching has subtle correctness issues with conditions that depend on per-request context (aws:CurrentTime, IP, leading keys) and would need invalidation on every condition-relevant input change. Re-evaluating policies in CPU per request is essentially free once the inputs are cached.Async fanout invalidation has a small window. Single-key invalidations (
DeleteAccessKey,PutUserPolicy) drop the cache slot synchronously. Fanout invalidations (DeleteAccount,DeleteRolesession sweep,DeleteGroupmember fanout) use moka'sinvalidate_entries_if, which evaluates predicates asynchronously — there is a brief (~ms) window between the API responding success and the matching entries actually being evicted. Documented in the operator guide; for hard cutover (e.g. revoking a compromised key), prefer the single-key path.Memory is bounded. Default 10,000 entries per cache (≈90,000 across the 9-cache authz block + credential + table-key-info), LRU-evicted under pressure. Empty results from tag and policy loaders cache at the short
negative_ttlso probes don't fill the LRU with empty positive entries.Out of scope (deferred)
Multi-instance cache invalidation. In a multi-frontend deployment, off-instance changes (a separate process modifying the catalog directly, or a different instance) wait up to
ttl_secondsto propagate locally. Cross-process invalidation fanout is designed in detail in Appendix B ofdocs/design/12-auth-authz-cache.md(PostgresLISTEN/NOTIFY+ durable backstop table for reconnect catch-up) but not implemented in this PR. The implementation is gated on a separate review pass — failure modes, schema, ordering guarantees, and operational backpressure semantics all need engineering input before the change lands. For now, multi-instance deployments must acceptauth.cache.ttl_secondsas the worst-case lag; tune accordingly against IAM-revocation timing requirements.Public type changes
StoredCredentialgains one public field:Nonefor long-lived AKIA* credentials,Somefor ASIA* sessions. Required to enforce session expiry on the cache-hit path. No out-of-tree consumers ofextenddb-authexist today, but third-partyCredentialStoreimplementations (if any) would need to populate this field.Observability
/management/auth-cache-metrics(admin-authenticated JSON) exposes per-cache hit / stale-hit / miss / negative-hit / refresh-success / refresh-failure / refresh-skipped-inflight / refresh-dropped-epoch / invalidation counters and entry counts. Each cache also reports apass_through: boolso operators can distinguish "cache disabled" from "cache cold."The endpoint is admin-only because cache hit-rate and entry-count signals are workload-fingerprinting telemetry; operators who want Prometheus scraping should expose this via reverse proxy. (A future PR could mirror these into
/metrics.)Test plan
cargo fmt --all -- --checkcargo clippy --release --all-targets -- -D warningscargo test --workspace— 27 cache crate tests, 132 auth tests, 18 server tests, 219 core, 16 engine, 3 storage; all green.extenddb init+serveagainst a clean Postgres;/healthand/management/auth-cache-metricssmoke-tested.pytest -k "test_abac or test_cache_coherence or test_console_cache_coherence": 17/17 pass.pytest: 354 passed, 6 failed. The 6 failures are pre-existing concurrency + import/export issues unrelated to this work and are unchanged frommain.New tests added in this PR:
crates/cache/src/tests.rs: 27 unit tests covering single-flight, soft/hard TTL, negative caching, refresh-vs-invalidate races, refresh-vs-hard-miss races (Arc identity guard), panic safety, LRU bounds, pass-through, config validation.credential_cache.rsandauthz_cache.rs.tests/test_cache_coherence.py: end-to-end mgmt-API → SigV4 round-trips against a live server.tests/test_console_cache_coherence.py: same invariants driven via the web console (form posts with CSRF) so management / console parity stays locked in against future refactor drift.Documentation
docs/design/12-auth-authz-cache.md— full design (motivation, SWR refresh strategy, library choice, cache key shapes, error semantics, metrics, operator guide).docs/design/11-high-availability.md§D1 / §A3 — D1's "No In-Process State" rule reconciled with the new cache (scoped revision for IAM data only; data path remains uncached).docs/manuals/05-admin-guide.md—[auth.cache]config reference + invalidation timing notes (sync vs. async + off-instance).docs/getting-started.md— operator-level summary.