Skip to content

Enhance micro-benchmark harness to track proxy overhead under realistic conditions #32

@ashwing

Description

@ashwing

Motivation

As a Rust-based gateway proxying requests to vLLM's Responses API, agentic-api adds a network hop and processing overhead. The existing Criterion benchmark (crates/agentic-server/benches/proxy_bench.rs) establishes a baseline comparing direct-to-LLM vs proxied latency, but doesn't capture behavior under realistic conditions: concurrent load, varying payload sizes, or streaming TTFB. As the proxy evolves (e.g., #29's crate restructure), we need to detect regressions and understand where overhead comes from.

Current State

The existing benchmark:

  • Uses Criterion with async_tokio runtime
  • Spawns a mock LLM server (fixed JSON or SSE chunks)
  • Spawns the gateway pointing at the mock
  • Measures four scenarios: non_stream/direct, non_stream/proxied, stream/direct, stream/proxied

This gives a good sense of raw proxy overhead in ideal conditions (serial requests, tiny payloads).

Gaps

  1. Concurrency — The benchmark runs one request at a time. Real deployments see 10-100+ concurrent requests. We need to measure how p50/p95/p99 latencies degrade under load, and whether the gateway introduces queuing or backpressure issues.

  2. Body size scaling — Current payloads are ~50 bytes. Real Responses API payloads range from 1KB (simple tool calls) to 100KB+ (long prompts, large tool schemas). We need to measure whether overhead scales linearly, or if there are hidden copy/allocation bottlenecks.

  3. Streaming TTFB — For SSE endpoints, what matters most is time-to-first-chunk, not total transfer time. The current benchmark measures total elapsed time, which masks TTFB degradation.

  4. Regression detection in CI — The benchmarks exist but aren't tracked. Without baseline comparison (e.g., criterion --save-baseline), we won't catch regressions introduced by refactors or dependency updates.

Proposed Approach

I acknowledge these micro-benchmarks are tricky — mock backends don't capture network variance, async benchmarks have high noise, and streaming TTFB measurement requires custom instrumentation. Phased approach:

Phase 1: Concurrency and body size variants

  • Extend the existing benchmark with concurrency groups (1, 10, 50 concurrent requests)
  • Add body size variants: 100B, 1KB, 10KB, 100KB
  • This should reveal whether the gateway's forwarding introduces bottlenecks under load

Phase 2: Streaming TTFB

  • Add a custom measurement for time-to-first-SSE-chunk (direct vs proxied)
  • Compare under varying concurrency
  • Open to alternative approaches if Criterion isn't the right tool here

Phase 3: CI integration

  • Use criterion --save-baseline main on merges to main
  • Compare PR branches against the baseline
  • Focus on catching large regressions, not noise (reasonable threshold TBD)

Discussion Points

  • Sample size: Should we increase Criterion's sample count for async benchmarks? The default (100 samples) may not smooth out runtime variance.
  • Mock backend realism: Should we introduce artificial latency/jitter in the mock LLM to better simulate production?
  • Alternative tools: Is Criterion the right fit for streaming TTFB, or should we consider custom harnesses (e.g., tokio::time::Instant + manual statistics)?
  • CI noise: What's a reasonable regression threshold? 10%? 20%? Track trends instead of hard thresholds?

Happy to start with Phase 1 (concurrency and body size) as a concrete first step once there's alignment on approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions