Skip to content

[Feature] Support n parameter in /v1/chat/completions and /v1/completions#4419

Open
ziyangliu-666 wants to merge 3 commits intoInternLM:mainfrom
ziyangliu-666:ziyangliu-666/support-n-parameter
Open

[Feature] Support n parameter in /v1/chat/completions and /v1/completions#4419
ziyangliu-666 wants to merge 3 commits intoInternLM:mainfrom
ziyangliu-666:ziyangliu-666/support-n-parameter

Conversation

@ziyangliu-666
Copy link

@ziyangliu-666 ziyangliu-666 commented Mar 17, 2026

Summary

This PR adds real support for the n parameter in the OpenAI-compatible API server.

Previously, n was already present in the protocol (ChatCompletionRequest.n and CompletionRequest.n) and validated (n > 0), but it was never actually respected at runtime. The server always generated exactly one output, regardless of the value passed.

With this change:

  • /v1/chat/completions now returns n choices when n > 1
  • /v1/completions now returns len(prompts) * n choices
  • streaming responses now carry the correct choice.index
  • seeded sampling becomes reproducible while still producing distinct outputs across choices

Closes #4073


Why

The OpenAI API's n parameter is useful for:

  • best-of-n style sampling
  • diversity-based candidate selection
  • evaluation and benchmarking pipelines
  • reducing client-side request overhead compared with issuing n separate calls

Without this patch, users could pass n > 1, but the server still produced only one output, which was both misleading and incompatible with expected OpenAI API behavior.


What changed

lmdeploy/serve/openai/api_server.py

/v1/chat/completions

  • n=1 keeps the existing behavior unchanged
    • continues using request.session_id
  • n>1
    • creates n independent sessions via get_session(-1)
    • creates n independent generators

Streaming path

  • iterates over generators sequentially
  • keeps per-generator state isolated
  • uses one GptOssChatParser instance per generator
  • emits chunks with the correct choice.index

Non-streaming path

  • runs all generators concurrently with asyncio.gather
  • counts usage.prompt_tokens once, since the prompt is shared
  • sums usage.completion_tokens across all generated choices

Seed handling

  • when seed is provided, generator i uses seed + i
  • this makes outputs reproducible but still distinct across choices

/v1/completions

  • expands generator creation from:
    • one per prompt
  • to:
    • one per (prompt, choice) pair

This means total generators = len(prompts) * n.

Also fixes a pre-existing issue in the streaming path:

  • previously, every streamed chunk used index=0
  • now each chunk reports the correct choice index

Tests

Added a new test file:

  • tests/test_lmdeploy/test_n_parameter.py

This includes 16 unit tests with mocked engine behavior, covering:

  • protocol validation
    • rejects n=0
    • rejects n<0
    • accepts valid values
  • n=1 returns exactly one choice
  • n=3 returns three choices with indices [0, 1, 2]
  • usage.completion_tokens is summed correctly across all choices
  • n=1 preserves request.session_id
  • n>1 uses auto-assigned sessions
  • seed offset behavior
    • seed=100, n=3 -> generator seeds [100, 101, 102]
  • /v1/completions
    • single prompt with n=3 -> 3 choices
    • two prompts with n=2 -> 4 choices

Backward compatibility

No breaking change.

  • n still defaults to 1
  • existing callers see no behavior change unless they explicitly pass n > 1

Example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="your-model",
    messages=[{"role": "user", "content": "Give me a creative project name."}],
    n=3,
    temperature=1.0,
)

for choice in response.choices:
    print(f"[{choice.index}] {choice.message.content}")

…ions

- For n>1 in chat completions, create n independent sessions and generators,
  stream each output sequentially with the correct choice index, and gather
  all n results concurrently in non-streaming mode
- For n>1 in completions, expand session/generator creation to len(prompts)*n
  and fix the pre-existing streaming bug where index was hardcoded to 0
- When seed is set, use seed+i for the i-th generator to produce distinct
  but reproducible outputs
- usage.completion_tokens is correctly summed across all n choices;
  usage.prompt_tokens is counted once (shared input)
- Add 16 unit tests covering validation, choice count, usage aggregation,
  session assignment, seed offset, and multi-prompt × n combinations

Closes InternLM#4073
@ziyangliu-666
Copy link
Author

The pr_functions_test failure is unrelated to this PR — it fails in lmdeploy/lite/quantization/calibration.py (NaN assertion in the activation observer) and the same failure can be seen on #4416. Our changes are limited to lmdeploy/serve/openai/api_server.py and the new test file.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds functional support for the OpenAI-compatible n parameter so /v1/chat/completions and /v1/completions can return multiple choices per request, aligning server behavior with the existing protocol fields/validation.

Changes:

  • Implement multi-choice generation for chat_completions_v1 (multiple sessions/generators, per-choice indices, seed offsetting, concurrent non-stream aggregation).
  • Implement multi-choice generation for completions_v1 (expand generators to len(prompts) * n, fix streaming choice.index tagging).
  • Add a new unit test suite to validate n behavior and indexing/usage aggregation with a mocked engine.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
lmdeploy/serve/openai/api_server.py Implements n>1 session/generator fan-out for chat/completions; updates streaming/non-streaming response construction.
tests/test_lmdeploy/test_n_parameter.py Adds mocked async-engine tests validating n defaults, rejection, indexing, seed offsetting, and basic aggregation behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +85 to +89
def test_completion_n_zero_rejected(self):
ctx = self._make_server_context()
req = CompletionRequest(model='m', prompt='hi', n=0)
assert completion_check_request(req, ctx) != ''

remote_token_ids_i.append(res.token_ids)

if gpt_oss_parser:
message_i = gpt_oss_parser.parse_full(final_token_ids_i)
Comment on lines +684 to +685
if not all(results):
return create_error_response(HTTPStatus.BAD_REQUEST, 'Client disconnected')
- Fix TypeError in chat streaming path: create_stream_response_json
  was returning model_dump_json() (str) but cache_block_ids injection
  subscripted it as a dict; switch to model_dump() + json.dumps()
- Fix stateful GptOssChatParser shared across concurrent asyncio.gather
  calls in non-streaming n>1 path; create a fresh instance per choice,
  consistent with the streaming path
- Fix tool-call parse exceptions being swallowed and misreported as
  "Client disconnected"; re-raise so asyncio.gather propagates them,
  wrap gather in try/except to return INTERNAL_SERVER_ERROR
- Add missing test_completion_n_negative_rejected to match the
  existing test_chat_n_negative_rejected
@ziyangliu-666
Copy link
Author

ziyangliu-666 commented Mar 18, 2026

Follow-up: Copilot review fixes + re-verified self-test

Addressed all 5 issues raised by the Copilot review (commit 99b77cb):

Fixes

# Location Issue Fix
1 api_server.py streaming path create_stream_response_json returned model_dump_json() (str), but cache_block_ids injection subscripted it as a dict → TypeError when cache is enabled Switch to model_dump() + json.dumps() at yield, consistent with completions_v1
2 api_server.py non-streaming n>1 Shared GptOssChatParser instance across concurrent asyncio.gather calls; parse_full delegates to parse_streaming which mutates internal StreamableParser state → cross-choice corruption Create fresh GptOssChatParser() per choice, matching the streaming path
3 api_server.py _collect_chat_response Tool-call parse exceptions were caught, logged, and returned as False; all False results were then reported as 400 Client disconnected → wrong HTTP status and no real error message Re-raise the exception; wrap asyncio.gather in try/except to return 500 INTERNAL_SERVER_ERROR with the actual message; client-disconnect path unchanged
4 test_n_parameter.py CompletionRequest(n=-1) rejection not tested (only chat path had a negative-n test) Added test_completion_n_negative_rejected (17 tests total)

Self-test results after fixes (Qwen3-0.6B, PyTorch backend)

/v1/chat/completions with n=3 — non-streaming:

choices count: 3
  [0] '<think>\nOkay, the user wants a random number between 1 and 10. Let me think. The'
  [1] '<think>\nOkay, the user wants a random number between 1 and 10. Let me think. How'
  [2] '<think>\nOkay, the user wants a random number between 1 and 10. Let me think abou'
usage: prompt=21 completion=240 total=261   # completion_tokens = 3 × 80

/v1/completions with n=3, single prompt:

choices count: 3
  [0] 'Paris. The capital of Italy is Rome. The capital of Spain is Madrid'
  [1] 'Paris, and the capital of Japan is Tokyo. So, which country is'
  [2] 'Paris. The capital of the capital is the city that has no capital.'

/v1/completions with 2 prompts × n=2 (4 choices total):

choices count: 4  (expected 4)
  [0] 'above $0.8 \div \sqrt{0.8}'
  [1] 'clear. What is the name of the sky\'s name?'
  [2] 'through the right-hand side of a closed cylindrical tank...'
  [3] 'through a pipe with a diameter of 26 mm...'

Streaming n=2 — each chunk correctly tagged with its choice index:

stream choices seen: [0, 1]
  stream [0]: '<think>\nOkay, the user wants me to say "yes" or "no" only...'
  stream [1]: '<think>\nOkay, I need to answer "Yes or No" based on the given instruction...'

All four scenarios produce the correct number of choices with correct indexing. Pre-commit hooks pass.

@windreamer
Copy link
Collaborator

Further more, the unit tests your PR added have been skipped due to missing pytest-asyncio in requirements/test.txt. You can just add it to make the unit tests work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ziyangliu-666
Copy link
Author

Fixed in 05f1057 — added pytest-asyncio to requirements/test.txt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Can we support parameter n in OpenAI compatible API?

3 participants