service/builder: use grpc.NumStreamWorkers for fixed goroutine pool by Jura-Z · Pull Request #725 · aserto-dev/topaz

Jura-Z · 2026-05-21T22:13:36Z

Summary

The default grpc-go server spawns a fresh goroutine for every incoming
stream. Under sustained ~50k rps load this is the dominant CPU cost on
top of the actual policy work — measured via pprof on a local stub
iap.egress.http policy:

Go scheduler: 34%
runtime.morestack: 41%
GC mark: 16%
OPA Eval (the actual work): 9%

grpc.NumStreamWorkers(N) switches the server to a fixed pool of N
worker goroutines that process incoming streams. Stacks are reused (no
morestack), creation cost is amortized, and GC has less per-request
garbage to chase.

Setting N to runtime.NumCPU() follows the recommendation in grpc-go's
own benchmarks (see the comment in server.go:609).

Bench

Stacking on top of #724 (the prepared-query cache PR), gRPC client,
M-series Mac, stub iap.egress.http policy, median of 2 runs:

concurrency	cache only	+ StreamWorkers	gain
1	11,517	12,900	+12%
4	28,831	31,854	+10%
8	37,572	40,985	+9%
16	46,980	50,261	+7%

Why merge order matters

This stacks on #724. The branch izakipnyi/grpc-stream-workers is
based on izakipnyi/cache-prepared-eval-query so the diff against
main includes both commits. If #724 lands first, this becomes a
trivial 14-line diff.

Compatibility

No API change.
No config change.
No new dependency (grpc.NumStreamWorkers is part of the existing
google.golang.org/grpc dep, just a previously-unused option).
Marked Experimental in grpc-go but stable since 2022 and used by
production gRPC servers under high RPS.
Existing tests still pass.

Why this is the architectural ceiling

Even with this fix, single-process Topaz still scales sub-linearly
(4.4× from 1→16 way concurrent vs ideal 12×). The remaining cost is
fundamentally Go-runtime overhead: even with stack reuse, each request
still pays for protobuf alloc + GC + middleware-chain dispatch. To go
further you'd need to pool the IsRequest/IsResponse proto objects
with sync.Pool, or scale topazd horizontally (multiple processes
behind a load balancer — verified empirically as 2 processes ≈ 1.5×
total throughput of 1).

Each Is() call rebuilds a rego.PreparedEvalQuery from the runtime's compiler and store, including ast.ParseBody / ast.ParseRef for the query body. Under concurrent load this fights the OPA compiler's internal locks and burns CPU on duplicate parsing. Memoize the prepared query keyed on (policy_context.path, policy_context.decisions). The compiler/store/policy bundle are stable between bundle reloads, so the prepared query is reusable. Invalidation is wired up via plugins.Manager.RegisterCompilerTrigger, which fires on bundle activation / discovery update / any compiler rotation. Concurrency-safe: sync.Map for read-mostly access, golang.org/x/sync singleflight to collapse concurrent misses on the same key into one PrepareForEval call. Measured against a stub iap.egress.http policy, native Go gRPC client (port 8282), Apple M-series Mac. Median of 3 back-to-back 3-second runs after a 100-call warmup: conc | before rps | after rps | gain -----+------------+-----------+------ 1 | 7,803 | 11,517 | +48% 4 | 18,474 | 28,831 | +56% 8 | 23,880 | 37,572 | +57% 16 | 29,929 | 46,980 | +57% p50 latency drops correspondingly (e.g. conc=1: 113 µs -> 77 µs; conc=16: 522 µs -> 301 µs). Decisions are byte-identical to the unmodified path. Tests: - TestCacheKey: key derivation is order-sensitive on the decisions list and stable across runs. - TestGetOrPrepare_CachesAndDedupes: factory runs at most once across 200 concurrent goroutines for the same key; subsequent calls hit the cache without invoking the factory. - TestGetOrPrepare_FactoryError: factory errors are propagated and not cached, so transient failures don't poison the cache.

The default grpc-go server spawns a fresh goroutine for every incoming stream. Under sustained ~50k rps load this is the dominant CPU cost on top of the actual policy work — measured via pprof (Go scheduler 34%, runtime.morestack 41%, GC mark 16%; OPA Eval itself only 9%). `grpc.NumStreamWorkers(N)` switches the server to a fixed pool of N worker goroutines that process incoming streams. Stacks are reused (no morestack), creation cost is amortized, and GC has less per-request garbage to chase. Setting N to GOMAXPROCS-equivalent (`runtime.NumCPU`) follows the recommendation in grpc-go's own benchmarks (server.go:609 comment). Measured stacking on top of aserto-dev#724 (the prepared-query cache PR), gRPC client, M-series Mac, stub iap.egress.http policy, median of 2 runs: conc | cache only | + StreamWorkers | gain -----+------------+-----------------+------ 1 | 11,517 | 12,900 | +12% 4 | 28,831 | 31,854 | +10% 8 | 37,572 | 40,985 | +9% 16 | 46,980 | 50,261 | +7% The win is modest because the cache PR already removed the largest lever; what remains is real Go-runtime overhead. NumStreamWorkers is marked Experimental in grpc-go but has been stable since 2022 and is widely used in production gRPC servers under high RPS.

Jura-Z added 2 commits May 21, 2026 12:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

service/builder: use grpc.NumStreamWorkers for fixed goroutine pool#725

service/builder: use grpc.NumStreamWorkers for fixed goroutine pool#725
Jura-Z wants to merge 2 commits into
aserto-dev:mainfrom
Jura-Z:izakipnyi/grpc-stream-workers

Jura-Z commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jura-Z commented May 21, 2026

Summary

Bench

Why merge order matters

Compatibility

Why this is the architectural ceiling

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant