Skip to content

Concurrent requests processed sequentially with 22% per-request overhead (BS4 benchmark) #637

@Rih0z

Description

@Rih0z

Description

Foundry Local appears to process concurrent requests sequentially rather than in parallel, and introduces significant per-request overhead under concurrent load.

Benchmark Results

4 identical concurrent requests (BS4) sent to Foundry Local vs llama.cpp, both running Phi-4-mini on the same hardware:

Engine BS1 (single) BS4 (4 concurrent) Expected BS4 (sequential) Actual overhead
Foundry Local 13.59s 66.5s 54.4s (13.59 x 4) +22% (12s extra)
llama.cpp (FA=on) 12.52s 19.2s True parallel
  • Foundry Local BS4 wall time (66.5s) is worse than pure sequential (54.4s), indicating contention overhead
  • llama.cpp processes all 4 requests in 19.2s (aggregate 154.8 tok/s), close to single-request time

Steps to Reproduce

  1. Start Foundry service, load Phi-4-mini-instruct-cuda-gpu:5
  2. Send 1 warmup request (excluded from measurement)
  3. Send 4 identical POST requests concurrently to /v1/chat/completions
  4. Measure wall time until all 4 responses are received

Parameters: max_tokens=800, temperature=0, stream=false

Expected Behavior

Concurrent requests should benefit from batched inference or at minimum not degrade individual request performance.

Actual Behavior

  • 4 concurrent requests take 66.5s (vs 54.4s expected for pure sequential)
  • Each individual request under load takes ~16.6s (vs 13.6s single), a 22% slowdown
  • This suggests not only sequential processing but additional contention overhead

Environment

  • Foundry Local v0.8.119
  • Model: Phi-4-mini-instruct-cuda-gpu:5
  • GPU: NVIDIA RTX 4070 Laptop 8GB VRAM
  • OS: Windows 11 Enterprise
  • Test date: 2026-04-13

Impact

For any multi-user or server deployment scenario, this sequential processing with contention overhead makes Foundry Local significantly less practical than alternatives that support concurrent inference (e.g., llama.cpp with n_parallel).

Metadata

Metadata

Assignees

No one assigned

    Labels

    CLIIssue relates to the CLI

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions