Skip to content

service/builder: use grpc.NumStreamWorkers for fixed goroutine pool#725

Open
Jura-Z wants to merge 2 commits into
aserto-dev:mainfrom
Jura-Z:izakipnyi/grpc-stream-workers
Open

service/builder: use grpc.NumStreamWorkers for fixed goroutine pool#725
Jura-Z wants to merge 2 commits into
aserto-dev:mainfrom
Jura-Z:izakipnyi/grpc-stream-workers

Conversation

@Jura-Z
Copy link
Copy Markdown

@Jura-Z Jura-Z commented May 21, 2026

Summary

The default grpc-go server spawns a fresh goroutine for every incoming
stream. Under sustained ~50k rps load this is the dominant CPU cost on
top of the actual policy work — measured via pprof on a local stub
iap.egress.http policy:

  • Go scheduler: 34%
  • runtime.morestack: 41%
  • GC mark: 16%
  • OPA Eval (the actual work): 9%

grpc.NumStreamWorkers(N) switches the server to a fixed pool of N
worker goroutines that process incoming streams. Stacks are reused (no
morestack), creation cost is amortized, and GC has less per-request
garbage to chase.

Setting N to runtime.NumCPU() follows the recommendation in grpc-go's
own benchmarks (see the comment in server.go:609).

Bench

Stacking on top of #724 (the prepared-query cache PR), gRPC client,
M-series Mac, stub iap.egress.http policy, median of 2 runs:

concurrency cache only + StreamWorkers gain
1 11,517 12,900 +12%
4 28,831 31,854 +10%
8 37,572 40,985 +9%
16 46,980 50,261 +7%

Why merge order matters

This stacks on #724. The branch izakipnyi/grpc-stream-workers is
based on izakipnyi/cache-prepared-eval-query so the diff against
main includes both commits. If #724 lands first, this becomes a
trivial 14-line diff.

Compatibility

  • No API change.
  • No config change.
  • No new dependency (grpc.NumStreamWorkers is part of the existing
    google.golang.org/grpc dep, just a previously-unused option).
  • Marked Experimental in grpc-go but stable since 2022 and used by
    production gRPC servers under high RPS.
  • Existing tests still pass.

Why this is the architectural ceiling

Even with this fix, single-process Topaz still scales sub-linearly
(4.4× from 1→16 way concurrent vs ideal 12×). The remaining cost is
fundamentally Go-runtime overhead: even with stack reuse, each request
still pays for protobuf alloc + GC + middleware-chain dispatch. To go
further you'd need to pool the IsRequest/IsResponse proto objects
with sync.Pool, or scale topazd horizontally (multiple processes
behind a load balancer — verified empirically as 2 processes ≈ 1.5×
total throughput of 1).

Jura-Z added 2 commits May 21, 2026 12:02
Each Is() call rebuilds a rego.PreparedEvalQuery from the runtime's
compiler and store, including ast.ParseBody / ast.ParseRef for the
query body. Under concurrent load this fights the OPA compiler's
internal locks and burns CPU on duplicate parsing.

Memoize the prepared query keyed on (policy_context.path,
policy_context.decisions). The compiler/store/policy bundle are stable
between bundle reloads, so the prepared query is reusable. Invalidation
is wired up via plugins.Manager.RegisterCompilerTrigger, which fires
on bundle activation / discovery update / any compiler rotation.

Concurrency-safe: sync.Map for read-mostly access, golang.org/x/sync
singleflight to collapse concurrent misses on the same key into one
PrepareForEval call.

Measured against a stub iap.egress.http policy, native Go gRPC client
(port 8282), Apple M-series Mac. Median of 3 back-to-back 3-second runs
after a 100-call warmup:

    conc | before rps | after rps | gain
    -----+------------+-----------+------
       1 |     7,803  |   11,517  | +48%
       4 |    18,474  |   28,831  | +56%
       8 |    23,880  |   37,572  | +57%
      16 |    29,929  |   46,980  | +57%

p50 latency drops correspondingly (e.g. conc=1: 113 µs -> 77 µs;
conc=16: 522 µs -> 301 µs). Decisions are byte-identical to the
unmodified path.

Tests:
- TestCacheKey: key derivation is order-sensitive on the decisions
  list and stable across runs.
- TestGetOrPrepare_CachesAndDedupes: factory runs at most once across
  200 concurrent goroutines for the same key; subsequent calls hit
  the cache without invoking the factory.
- TestGetOrPrepare_FactoryError: factory errors are propagated and
  not cached, so transient failures don't poison the cache.
The default grpc-go server spawns a fresh goroutine for every incoming
stream. Under sustained ~50k rps load this is the dominant CPU cost on
top of the actual policy work — measured via pprof (Go scheduler 34%,
runtime.morestack 41%, GC mark 16%; OPA Eval itself only 9%).

`grpc.NumStreamWorkers(N)` switches the server to a fixed pool of N
worker goroutines that process incoming streams. Stacks are reused (no
morestack), creation cost is amortized, and GC has less per-request
garbage to chase.

Setting N to GOMAXPROCS-equivalent (`runtime.NumCPU`) follows the
recommendation in grpc-go's own benchmarks (server.go:609 comment).

Measured stacking on top of aserto-dev#724 (the prepared-query cache PR), gRPC
client, M-series Mac, stub iap.egress.http policy, median of 2 runs:

    conc | cache only | + StreamWorkers | gain
    -----+------------+-----------------+------
       1 |    11,517  |       12,900    |  +12%
       4 |    28,831  |       31,854    |  +10%
       8 |    37,572  |       40,985    |   +9%
      16 |    46,980  |       50,261    |   +7%

The win is modest because the cache PR already removed the largest
lever; what remains is real Go-runtime overhead. NumStreamWorkers is
marked Experimental in grpc-go but has been stable since 2022 and is
widely used in production gRPC servers under high RPS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant