Enhance micro-benchmark harness to track proxy overhead under realistic conditions

### Motivation

As a Rust-based gateway proxying requests to vLLM's Responses API, agentic-api adds a network hop and processing overhead. The existing Criterion benchmark (`crates/agentic-server/benches/proxy_bench.rs`) establishes a baseline comparing direct-to-LLM vs proxied latency, but doesn't capture behavior under realistic conditions: concurrent load, varying payload sizes, or streaming TTFB. As the proxy evolves (e.g., #29's crate restructure), we need to detect regressions and understand where overhead comes from.

### Current State

The existing benchmark:
- Uses Criterion with `async_tokio` runtime
- Spawns a mock LLM server (fixed JSON or SSE chunks)
- Spawns the gateway pointing at the mock
- Measures four scenarios: `non_stream/direct`, `non_stream/proxied`, `stream/direct`, `stream/proxied`

This gives a good sense of raw proxy overhead in ideal conditions (serial requests, tiny payloads).

### Gaps

1. **Concurrency** — The benchmark runs one request at a time. Real deployments see 10-100+ concurrent requests. We need to measure how p50/p95/p99 latencies degrade under load, and whether the gateway introduces queuing or backpressure issues.

2. **Body size scaling** — Current payloads are ~50 bytes. Real Responses API payloads range from 1KB (simple tool calls) to 100KB+ (long prompts, large tool schemas). We need to measure whether overhead scales linearly, or if there are hidden copy/allocation bottlenecks.

3. **Streaming TTFB** — For SSE endpoints, what matters most is time-to-first-chunk, not total transfer time. The current benchmark measures total elapsed time, which masks TTFB degradation.

4. **Regression detection in CI** — The benchmarks exist but aren't tracked. Without baseline comparison (e.g., `criterion --save-baseline`), we won't catch regressions introduced by refactors or dependency updates.

### Proposed Approach

I acknowledge these micro-benchmarks are tricky — mock backends don't capture network variance, async benchmarks have high noise, and streaming TTFB measurement requires custom instrumentation. Phased approach:

**Phase 1: Concurrency and body size variants**
- Extend the existing benchmark with concurrency groups (1, 10, 50 concurrent requests)
- Add body size variants: 100B, 1KB, 10KB, 100KB
- This should reveal whether the gateway's forwarding introduces bottlenecks under load

**Phase 2: Streaming TTFB**
- Add a custom measurement for time-to-first-SSE-chunk (direct vs proxied)
- Compare under varying concurrency
- Open to alternative approaches if Criterion isn't the right tool here

**Phase 3: CI integration**
- Use `criterion --save-baseline main` on merges to `main`
- Compare PR branches against the baseline
- Focus on catching large regressions, not noise (reasonable threshold TBD)

### Discussion Points

- **Sample size**: Should we increase Criterion's sample count for async benchmarks? The default (100 samples) may not smooth out runtime variance.
- **Mock backend realism**: Should we introduce artificial latency/jitter in the mock LLM to better simulate production?
- **Alternative tools**: Is Criterion the right fit for streaming TTFB, or should we consider custom harnesses (e.g., `tokio::time::Instant` + manual statistics)?
- **CI noise**: What's a reasonable regression threshold? 10%? 20%? Track trends instead of hard thresholds?

Happy to start with Phase 1 (concurrency and body size) as a concrete first step once there's alignment on approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance micro-benchmark harness to track proxy overhead under realistic conditions #32

Motivation

Current State

Gaps

Proposed Approach

Discussion Points

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Enhance micro-benchmark harness to track proxy overhead under realistic conditions #32

Description

Motivation

Current State

Gaps

Proposed Approach

Discussion Points

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions