[Feature] Support n parameter in /v1/chat/completions and /v1/completions#4419
[Feature] Support n parameter in /v1/chat/completions and /v1/completions#4419ziyangliu-666 wants to merge 3 commits intoInternLM:mainfrom
Conversation
…ions - For n>1 in chat completions, create n independent sessions and generators, stream each output sequentially with the correct choice index, and gather all n results concurrently in non-streaming mode - For n>1 in completions, expand session/generator creation to len(prompts)*n and fix the pre-existing streaming bug where index was hardcoded to 0 - When seed is set, use seed+i for the i-th generator to produce distinct but reproducible outputs - usage.completion_tokens is correctly summed across all n choices; usage.prompt_tokens is counted once (shared input) - Add 16 unit tests covering validation, choice count, usage aggregation, session assignment, seed offset, and multi-prompt × n combinations Closes InternLM#4073
|
The |
There was a problem hiding this comment.
Pull request overview
Adds functional support for the OpenAI-compatible n parameter so /v1/chat/completions and /v1/completions can return multiple choices per request, aligning server behavior with the existing protocol fields/validation.
Changes:
- Implement multi-choice generation for
chat_completions_v1(multiple sessions/generators, per-choice indices, seed offsetting, concurrent non-stream aggregation). - Implement multi-choice generation for
completions_v1(expand generators tolen(prompts) * n, fix streamingchoice.indextagging). - Add a new unit test suite to validate
nbehavior and indexing/usage aggregation with a mocked engine.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
lmdeploy/serve/openai/api_server.py |
Implements n>1 session/generator fan-out for chat/completions; updates streaming/non-streaming response construction. |
tests/test_lmdeploy/test_n_parameter.py |
Adds mocked async-engine tests validating n defaults, rejection, indexing, seed offsetting, and basic aggregation behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_completion_n_zero_rejected(self): | ||
| ctx = self._make_server_context() | ||
| req = CompletionRequest(model='m', prompt='hi', n=0) | ||
| assert completion_check_request(req, ctx) != '' | ||
|
|
lmdeploy/serve/openai/api_server.py
Outdated
| remote_token_ids_i.append(res.token_ids) | ||
|
|
||
| if gpt_oss_parser: | ||
| message_i = gpt_oss_parser.parse_full(final_token_ids_i) |
| if not all(results): | ||
| return create_error_response(HTTPStatus.BAD_REQUEST, 'Client disconnected') |
- Fix TypeError in chat streaming path: create_stream_response_json was returning model_dump_json() (str) but cache_block_ids injection subscripted it as a dict; switch to model_dump() + json.dumps() - Fix stateful GptOssChatParser shared across concurrent asyncio.gather calls in non-streaming n>1 path; create a fresh instance per choice, consistent with the streaming path - Fix tool-call parse exceptions being swallowed and misreported as "Client disconnected"; re-raise so asyncio.gather propagates them, wrap gather in try/except to return INTERNAL_SERVER_ERROR - Add missing test_completion_n_negative_rejected to match the existing test_chat_n_negative_rejected
Follow-up: Copilot review fixes + re-verified self-testAddressed all 5 issues raised by the Copilot review (commit 99b77cb): Fixes
Self-test results after fixes (Qwen3-0.6B, PyTorch backend)
Streaming All four scenarios produce the correct number of choices with correct indexing. Pre-commit hooks pass. |
|
Further more, the unit tests your PR added have been skipped due to missing |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Fixed in 05f1057 — added |
Summary
This PR adds real support for the
nparameter in the OpenAI-compatible API server.Previously,
nwas already present in the protocol (ChatCompletionRequest.nandCompletionRequest.n) and validated (n > 0), but it was never actually respected at runtime. The server always generated exactly one output, regardless of the value passed.With this change:
/v1/chat/completionsnow returnsnchoices whenn > 1/v1/completionsnow returnslen(prompts) * nchoiceschoice.indexCloses #4073
Why
The OpenAI API's
nparameter is useful for:nseparate callsWithout this patch, users could pass
n > 1, but the server still produced only one output, which was both misleading and incompatible with expected OpenAI API behavior.What changed
lmdeploy/serve/openai/api_server.py/v1/chat/completionsn=1keeps the existing behavior unchangedrequest.session_idn>1nindependent sessions viaget_session(-1)nindependent generatorsStreaming path
GptOssChatParserinstance per generatorchoice.indexNon-streaming path
asyncio.gatherusage.prompt_tokensonce, since the prompt is sharedusage.completion_tokensacross all generated choicesSeed handling
seedis provided, generatoriusesseed + i/v1/completions(prompt, choice)pairThis means total generators =
len(prompts) * n.Also fixes a pre-existing issue in the streaming path:
index=0Tests
Added a new test file:
tests/test_lmdeploy/test_n_parameter.pyThis includes 16 unit tests with mocked engine behavior, covering:
n=0n<0n=1returns exactly one choicen=3returns three choices with indices[0, 1, 2]usage.completion_tokensis summed correctly across all choicesn=1preservesrequest.session_idn>1uses auto-assigned sessionsseed=100,n=3-> generator seeds[100, 101, 102]/v1/completionsn=3-> 3 choicesn=2-> 4 choicesBackward compatibility
No breaking change.
nstill defaults to1n > 1Example