Motivation
As a Rust-based gateway proxying requests to vLLM's Responses API, agentic-api adds a network hop and processing overhead. The existing Criterion benchmark (crates/agentic-server/benches/proxy_bench.rs) establishes a baseline comparing direct-to-LLM vs proxied latency, but doesn't capture behavior under realistic conditions: concurrent load, varying payload sizes, or streaming TTFB. As the proxy evolves (e.g., #29's crate restructure), we need to detect regressions and understand where overhead comes from.
Current State
The existing benchmark:
- Uses Criterion with
async_tokio runtime
- Spawns a mock LLM server (fixed JSON or SSE chunks)
- Spawns the gateway pointing at the mock
- Measures four scenarios:
non_stream/direct, non_stream/proxied, stream/direct, stream/proxied
This gives a good sense of raw proxy overhead in ideal conditions (serial requests, tiny payloads).
Gaps
-
Concurrency — The benchmark runs one request at a time. Real deployments see 10-100+ concurrent requests. We need to measure how p50/p95/p99 latencies degrade under load, and whether the gateway introduces queuing or backpressure issues.
-
Body size scaling — Current payloads are ~50 bytes. Real Responses API payloads range from 1KB (simple tool calls) to 100KB+ (long prompts, large tool schemas). We need to measure whether overhead scales linearly, or if there are hidden copy/allocation bottlenecks.
-
Streaming TTFB — For SSE endpoints, what matters most is time-to-first-chunk, not total transfer time. The current benchmark measures total elapsed time, which masks TTFB degradation.
-
Regression detection in CI — The benchmarks exist but aren't tracked. Without baseline comparison (e.g., criterion --save-baseline), we won't catch regressions introduced by refactors or dependency updates.
Proposed Approach
I acknowledge these micro-benchmarks are tricky — mock backends don't capture network variance, async benchmarks have high noise, and streaming TTFB measurement requires custom instrumentation. Phased approach:
Phase 1: Concurrency and body size variants
- Extend the existing benchmark with concurrency groups (1, 10, 50 concurrent requests)
- Add body size variants: 100B, 1KB, 10KB, 100KB
- This should reveal whether the gateway's forwarding introduces bottlenecks under load
Phase 2: Streaming TTFB
- Add a custom measurement for time-to-first-SSE-chunk (direct vs proxied)
- Compare under varying concurrency
- Open to alternative approaches if Criterion isn't the right tool here
Phase 3: CI integration
- Use
criterion --save-baseline main on merges to main
- Compare PR branches against the baseline
- Focus on catching large regressions, not noise (reasonable threshold TBD)
Discussion Points
- Sample size: Should we increase Criterion's sample count for async benchmarks? The default (100 samples) may not smooth out runtime variance.
- Mock backend realism: Should we introduce artificial latency/jitter in the mock LLM to better simulate production?
- Alternative tools: Is Criterion the right fit for streaming TTFB, or should we consider custom harnesses (e.g.,
tokio::time::Instant + manual statistics)?
- CI noise: What's a reasonable regression threshold? 10%? 20%? Track trends instead of hard thresholds?
Happy to start with Phase 1 (concurrency and body size) as a concrete first step once there's alignment on approach.
Motivation
As a Rust-based gateway proxying requests to vLLM's Responses API, agentic-api adds a network hop and processing overhead. The existing Criterion benchmark (
crates/agentic-server/benches/proxy_bench.rs) establishes a baseline comparing direct-to-LLM vs proxied latency, but doesn't capture behavior under realistic conditions: concurrent load, varying payload sizes, or streaming TTFB. As the proxy evolves (e.g., #29's crate restructure), we need to detect regressions and understand where overhead comes from.Current State
The existing benchmark:
async_tokioruntimenon_stream/direct,non_stream/proxied,stream/direct,stream/proxiedThis gives a good sense of raw proxy overhead in ideal conditions (serial requests, tiny payloads).
Gaps
Concurrency — The benchmark runs one request at a time. Real deployments see 10-100+ concurrent requests. We need to measure how p50/p95/p99 latencies degrade under load, and whether the gateway introduces queuing or backpressure issues.
Body size scaling — Current payloads are ~50 bytes. Real Responses API payloads range from 1KB (simple tool calls) to 100KB+ (long prompts, large tool schemas). We need to measure whether overhead scales linearly, or if there are hidden copy/allocation bottlenecks.
Streaming TTFB — For SSE endpoints, what matters most is time-to-first-chunk, not total transfer time. The current benchmark measures total elapsed time, which masks TTFB degradation.
Regression detection in CI — The benchmarks exist but aren't tracked. Without baseline comparison (e.g.,
criterion --save-baseline), we won't catch regressions introduced by refactors or dependency updates.Proposed Approach
I acknowledge these micro-benchmarks are tricky — mock backends don't capture network variance, async benchmarks have high noise, and streaming TTFB measurement requires custom instrumentation. Phased approach:
Phase 1: Concurrency and body size variants
Phase 2: Streaming TTFB
Phase 3: CI integration
criterion --save-baseline mainon merges tomainDiscussion Points
tokio::time::Instant+ manual statistics)?Happy to start with Phase 1 (concurrency and body size) as a concrete first step once there's alignment on approach.