Skip to content

authorizer: cache PreparedEvalQuery per (policy path, decisions)#724

Open
Jura-Z wants to merge 1 commit into
aserto-dev:mainfrom
Jura-Z:izakipnyi/cache-prepared-eval-query
Open

authorizer: cache PreparedEvalQuery per (policy path, decisions)#724
Jura-Z wants to merge 1 commit into
aserto-dev:mainfrom
Jura-Z:izakipnyi/cache-prepared-eval-query

Conversation

@Jura-Z
Copy link
Copy Markdown

@Jura-Z Jura-Z commented May 21, 2026

authorizer: cache PreparedEvalQuery per (policy path, decisions)

Summary

Each Is() call currently rebuilds a rego.PreparedEvalQuery from the runtime's
compiler and store, including ast.ParseBody / ast.ParseRef for the query body.
The prepared query is a pure function of the policy path, the decisions list, and
the active OPA compiler — all stable between bundle reloads — so we can memoize it.

This PR adds a small preparedQueryCache (sync.Map + singleflight) on
AuthorizerServer that holds prepared queries keyed by (policy_path, decisions).
Cache invalidation is wired via plugins.Manager.RegisterCompilerTrigger, which
fires on bundle activation / discovery update / any compiler rotation, so a policy
change is reflected on the next request.

Why

Under sustained concurrent load, every goroutine inside Is() repeats the same
parse + plan work and contends on the OPA compiler's internal locks. The cache
removes that duplicated work entirely on the hot path — typical Topaz deployments
see only a handful of unique (path, decisions) tuples across millions of calls,
so the cache is read-mostly and the lifetime of each entry is the lifetime of the
bundle.

Benchmark

Stub iap.egress.http policy (allow := true for two test inputs, no
directory lookups), native Go gRPC client to port 8282 (bypasses the REST
gateway), Apple M-series Mac (12 P-cores). Each row is a 3-second
fixed-time run; numbers below are the median of 3 back-to-back runs after a
100-call warmup. p50 latency is per-call, measured client-side.

concurrency before rps after rps gain before p50 after p50
1 7,803 11,517 +48% 113 µs 77 µs
2 12,885 19,545 +52% 121 µs 83 µs
4 18,474 28,831 +56% 153 µs 110 µs
8 23,880 37,572 +57% 284 µs 173 µs
16 29,929 46,980 +57% 522 µs 301 µs

Implementation notes

  • preparedQueryCache.entries is a sync.Map keyed by a stable string
    derived from the policy path and the ordered decisions list (separators are
    ASCII \x1f / \x1e, neither of which appear in valid Rego identifiers).
  • singleflight.Group collapses concurrent misses on the same key into one
    PrepareForEval call, so a thundering herd on first use of a new
    (path, decisions) tuple doesn't multiply work.
  • ensureCompilerWatcher registers the invalidation hook lazily on first
    use — exactly once per *plugins.Manager. The hook drops the entire cache
    when the compiler is rotated; bundle reloads are rare relative to Is()
    rate, so we don't try to be more precise.
  • Errors from the factory (e.g. PrepareForEval failures on a malformed
    request) are propagated and not cached, so a transient failure doesn't
    poison the cache.
  • The factory closure captures policyPath and decisions by reference; the
    shared parsing helpers in aserto-dev/runtime (ValidateRule,
    ValidateQuery) are still invoked, just inside the factory rather than on
    every request.

Tests

  • TestCacheKey: key derivation is order-sensitive on the decisions list,
    distinguishes paths and decision lists correctly, and is stable for inputs
    that contain edge-case characters.
  • TestGetOrPrepare_CachesAndDedupes: factory runs at most ~once across 200
    concurrent goroutines for the same key; subsequent calls hit the cache
    without invoking the factory; a fresh key still triggers exactly one
    factory call.
  • TestGetOrPrepare_FactoryError: factory errors propagate to the caller
    and are not cached.

All three pass. Existing tests in topazd/authorizer/impl/ still pass.

Compatibility

  • No API change.
  • No config change.
  • No new dependency: golang.org/x/sync/singleflight is already a transitive
    dep through OPA / grpc-go.

What this PR does NOT solve

This PR removes one of several CPU costs in the per-call path. It does
not deliver linear concurrency scaling — Topaz still scales sub-linearly
under N-way concurrent load. To document this honestly for reviewers, here
is the full investigation that informed this PR:

Profiling under sustained 16-way load (after this PR's cache)

CPU breakdown from go tool pprof on a 10s sample at 47k rps:

category CPU %
Go runtime scheduler (schedule/findRunnable) 34%
gRPC server framework + middlewares 23%
Cores parking/waking (runtime.usleep) 22%
GC 16%
Syscalls (kevent/netpoll) 11%
TLS write 9%
OPA PreparedEvalQuery.Eval (the actual work) 9%
Topaz authorizer code 11%

The mutex profile shows 96% of contention in runtime.unlock / scheduler
internals; no application-level lock is the bottleneck after this fix.

Architectural ceiling tested empirically

To confirm the limit, I tried the obvious additional optimizations and
measured. With this PR as baseline (47,494 rps at conc=16):

variant conc=16 rps gain over PR
this PR alone (cache) 47,494
+ plaintext gRPC (no TLS on loopback) 55,067 +16%
+ strip RequestID/Tracing/Error/Prometheus 57,142 +20%
+ GOMAXPROCS=4 (vs default 18) 68,552 +44%
2 separate topazd processes, conc=16 each, summed 84,245 +77%

GOMAXPROCS=4 outperforming the default is striking — the Go runtime
scheduler thrashes when given more P's than the workload's effective
parallelism, since each P repeatedly tries to steal nonexistent work. This
is independent of this PR; tuning GOMAXPROCS is a runtime knob, not a
code change.

The 2-process number confirms the per-process ceiling is real: doubling
the process count nearly doubles total throughput. Topaz scales out
horizontally, not up.
The within-process ceiling is dominated by Go
scheduler / GC / netpoll cost at this allocation rate, not by any lock or
serialization in Topaz code.

Suggestions for further work (not in this PR)

If the maintainers want to push the per-process ceiling higher, the
remaining levers are:

  1. Skip middlewares when their effects are unobserved. E.g. don't build
    the per-request zerolog instance in TracingMiddleware when level >
    Trace; don't generate prometheus exemplars when no scraper is attached.
  2. sync.Pool the protobuf IsRequest/IsResponse and the input
    map[string]any.
    Profile shows ~16% CPU in GC; pooling could cut
    that meaningfully.
  3. Document GOMAXPROCS tuning for high-throughput deployments — it's
    a free 22% on the same hardware.

Each of those is a separate PR if the maintainers are interested.

Each Is() call rebuilds a rego.PreparedEvalQuery from the runtime's
compiler and store, including ast.ParseBody / ast.ParseRef for the
query body. Under concurrent load this fights the OPA compiler's
internal locks and burns CPU on duplicate parsing.

Memoize the prepared query keyed on (policy_context.path,
policy_context.decisions). The compiler/store/policy bundle are stable
between bundle reloads, so the prepared query is reusable. Invalidation
is wired up via plugins.Manager.RegisterCompilerTrigger, which fires
on bundle activation / discovery update / any compiler rotation.

Concurrency-safe: sync.Map for read-mostly access, golang.org/x/sync
singleflight to collapse concurrent misses on the same key into one
PrepareForEval call.

Measured against a stub iap.egress.http policy, native Go gRPC client
(port 8282), Apple M-series Mac. Median of 3 back-to-back 3-second runs
after a 100-call warmup:

    conc | before rps | after rps | gain
    -----+------------+-----------+------
       1 |     7,803  |   11,517  | +48%
       4 |    18,474  |   28,831  | +56%
       8 |    23,880  |   37,572  | +57%
      16 |    29,929  |   46,980  | +57%

p50 latency drops correspondingly (e.g. conc=1: 113 µs -> 77 µs;
conc=16: 522 µs -> 301 µs). Decisions are byte-identical to the
unmodified path.

Tests:
- TestCacheKey: key derivation is order-sensitive on the decisions
  list and stable across runs.
- TestGetOrPrepare_CachesAndDedupes: factory runs at most once across
  200 concurrent goroutines for the same key; subsequent calls hit
  the cache without invoking the factory.
- TestGetOrPrepare_FactoryError: factory errors are propagated and
  not cached, so transient failures don't poison the cache.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant