[Feature] Support n parameter in /v1/chat/completions and /v1/completions by ziyangliu-666 · Pull Request #4419 · InternLM/lmdeploy

ziyangliu-666 · 2026-03-17T09:58:13Z

Summary

This PR adds real support for the n parameter in the OpenAI-compatible API server.

Previously, n was already present in the protocol (ChatCompletionRequest.n and CompletionRequest.n) and validated (n > 0), but it was never actually respected at runtime. The server always generated exactly one output, regardless of the value passed.

With this change:

/v1/chat/completions now returns n choices when n > 1
/v1/completions now returns len(prompts) * n choices
streaming responses now carry the correct choice.index
seeded sampling becomes reproducible while still producing distinct outputs across choices

Closes #4073

Why

The OpenAI API's n parameter is useful for:

best-of-n style sampling
diversity-based candidate selection
evaluation and benchmarking pipelines
reducing client-side request overhead compared with issuing n separate calls

Without this patch, users could pass n > 1, but the server still produced only one output, which was both misleading and incompatible with expected OpenAI API behavior.

What changed

`lmdeploy/serve/openai/api_server.py`

`/v1/chat/completions`

n=1 keeps the existing behavior unchanged
- continues using request.session_id
n>1
- creates n independent sessions via get_session(-1)
- creates n independent generators

Streaming path

iterates over generators sequentially
keeps per-generator state isolated
uses one GptOssChatParser instance per generator
emits chunks with the correct choice.index

Non-streaming path

runs all generators concurrently with asyncio.gather
counts usage.prompt_tokens once, since the prompt is shared
sums usage.completion_tokens across all generated choices

Seed handling

when seed is provided, generator i uses seed + i
this makes outputs reproducible but still distinct across choices

`/v1/completions`

expands generator creation from:
- one per prompt
to:
- one per (prompt, choice) pair

This means total generators = len(prompts) * n.

Also fixes a pre-existing issue in the streaming path:

previously, every streamed chunk used index=0
now each chunk reports the correct choice index

Tests

Added a new test file:

tests/test_lmdeploy/test_n_parameter.py

This includes 16 unit tests with mocked engine behavior, covering:

protocol validation
- rejects n=0
- rejects n<0
- accepts valid values
n=1 returns exactly one choice
n=3 returns three choices with indices [0, 1, 2]
usage.completion_tokens is summed correctly across all choices
n=1 preserves request.session_id
n>1 uses auto-assigned sessions
seed offset behavior
- seed=100, n=3 -> generator seeds [100, 101, 102]
/v1/completions
- single prompt with n=3 -> 3 choices
- two prompts with n=2 -> 4 choices

Backward compatibility

No breaking change.

n still defaults to 1
existing callers see no behavior change unless they explicitly pass n > 1

Example

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="your-model",
    messages=[{"role": "user", "content": "Give me a creative project name."}],
    n=3,
    temperature=1.0,
)

for choice in response.choices:
    print(f"[{choice.index}] {choice.message.content}")

ziyangliu-666 · 2026-03-17T10:33:38Z

The pr_functions_test failure is unrelated to this PR — it fails in lmdeploy/lite/quantization/calibration.py (NaN assertion in the activation observer) and the same failure can be seen on #4416. Our changes are limited to lmdeploy/serve/openai/api_server.py and the new test file.

Copilot

Pull request overview

Adds functional support for the OpenAI-compatible n parameter so /v1/chat/completions and /v1/completions can return multiple choices per request, aligning server behavior with the existing protocol fields/validation.

Changes:

Implement multi-choice generation for chat_completions_v1 (multiple sessions/generators, per-choice indices, seed offsetting, concurrent non-stream aggregation).
Implement multi-choice generation for completions_v1 (expand generators to len(prompts) * n, fix streaming choice.index tagging).
Add a new unit test suite to validate n behavior and indexing/usage aggregation with a mocked engine.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
`lmdeploy/serve/openai/api_server.py`	Implements `n>1` session/generator fan-out for chat/completions; updates streaming/non-streaming response construction.
`tests/test_lmdeploy/test_n_parameter.py`	Adds mocked async-engine tests validating `n` defaults, rejection, indexing, seed offsetting, and basic aggregation behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ziyangliu-666 · 2026-03-18T01:37:06Z

Follow-up: Copilot review fixes + re-verified self-test

Addressed all 5 issues raised by the Copilot review (commit 99b77cb):

Fixes

#	Location	Issue	Fix
1	`api_server.py` streaming path	`create_stream_response_json` returned `model_dump_json()` (str), but `cache_block_ids` injection subscripted it as a dict → `TypeError` when cache is enabled	Switch to `model_dump()` + `json.dumps()` at yield, consistent with `completions_v1`
2	`api_server.py` non-streaming `n>1`	Shared `GptOssChatParser` instance across concurrent `asyncio.gather` calls; `parse_full` delegates to `parse_streaming` which mutates internal `StreamableParser` state → cross-choice corruption	Create fresh `GptOssChatParser()` per choice, matching the streaming path
3	`api_server.py` `_collect_chat_response`	Tool-call parse exceptions were caught, logged, and returned as `False`; all `False` results were then reported as `400 Client disconnected` → wrong HTTP status and no real error message	Re-raise the exception; wrap `asyncio.gather` in `try/except` to return `500 INTERNAL_SERVER_ERROR` with the actual message; client-disconnect path unchanged
4	`test_n_parameter.py`	`CompletionRequest(n=-1)` rejection not tested (only chat path had a negative-n test)	Added `test_completion_n_negative_rejected` (17 tests total)

Self-test results after fixes (Qwen3-0.6B, PyTorch backend)

/v1/chat/completions with n=3 — non-streaming:

choices count: 3
  [0] '<think>\nOkay, the user wants a random number between 1 and 10. Let me think. The'
  [1] '<think>\nOkay, the user wants a random number between 1 and 10. Let me think. How'
  [2] '<think>\nOkay, the user wants a random number between 1 and 10. Let me think abou'
usage: prompt=21 completion=240 total=261   # completion_tokens = 3 × 80

/v1/completions with n=3, single prompt:

choices count: 3
  [0] 'Paris. The capital of Italy is Rome. The capital of Spain is Madrid'
  [1] 'Paris, and the capital of Japan is Tokyo. So, which country is'
  [2] 'Paris. The capital of the capital is the city that has no capital.'

/v1/completions with 2 prompts × n=2 (4 choices total):

choices count: 4  (expected 4)
  [0] 'above $0.8 \div \sqrt{0.8}'
  [1] 'clear. What is the name of the sky\'s name?'
  [2] 'through the right-hand side of a closed cylindrical tank...'
  [3] 'through a pipe with a diameter of 26 mm...'

Streaming n=2 — each chunk correctly tagged with its choice index:

stream choices seen: [0, 1]
  stream [0]: '<think>\nOkay, the user wants me to say "yes" or "no" only...'
  stream [1]: '<think>\nOkay, I need to answer "Yes or No" based on the given instruction...'

All four scenarios produce the correct number of choices with correct indexing. Pre-commit hooks pass.

windreamer · 2026-03-18T02:35:53Z

Further more, the unit tests your PR added have been skipped due to missing pytest-asyncio in requirements/test.txt. You can just add it to make the unit tests work.

ziyangliu-666 · 2026-03-18T02:46:43Z

Fixed in 05f1057 — added pytest-asyncio to requirements/test.txt.

windreamer · 2026-03-24T03:42:27Z

Rebased on recent main branch, resolve dthe conflict and try to force push.

…ions - For n>1 in chat completions, create n independent sessions and generators, stream each output sequentially with the correct choice index, and gather all n results concurrently in non-streaming mode - For n>1 in completions, expand session/generator creation to len(prompts)*n and fix the pre-existing streaming bug where index was hardcoded to 0 - When seed is set, use seed+i for the i-th generator to produce distinct but reproducible outputs - usage.completion_tokens is correctly summed across all n choices; usage.prompt_tokens is counted once (shared input) - Add 16 unit tests covering validation, choice count, usage aggregation, session assignment, seed offset, and multi-prompt × n combinations Closes InternLM#4073

- Fix TypeError in chat streaming path: create_stream_response_json was returning model_dump_json() (str) but cache_block_ids injection subscripted it as a dict; switch to model_dump() + json.dumps() - Fix stateful GptOssChatParser shared across concurrent asyncio.gather calls in non-streaming n>1 path; create a fresh instance per choice, consistent with the streaming path - Fix tool-call parse exceptions being swallowed and misreported as "Client disconnected"; re-raise so asyncio.gather propagates them, wrap gather in try/except to return INTERNAL_SERVER_ERROR - Add missing test_completion_n_negative_rejected to match the existing test_chat_n_negative_rejected

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Resolve conflict in api_server.py: - Incorporate tool_parser.adjust_request() from main before n parameter session logic - Add media_io_kwargs parameter to generate() calls in n parameter loop Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ziyangliu-666 · 2026-03-30T03:19:13Z

Resolved merge conflict with main in eef2c64: incorporated tool_parser.adjust_request() and media_io_kwargs parameter from upstream.

lvhan028 · 2026-03-30T04:30:17Z

Hi, @ziyangliu-666

Thank you for your contribution. I really appreciate the effort you put into it.

While reviewing, I took a closer look at the current state of our api_server. It's become quite bloated over time and its maintainability is not ideal. We're currently evaluating how much refactoring work is needed to bring it back to a healthier state.

Because your PR interacts with that part of the codebase, I'd like to align its merge with our refactoring plan. Specifically, I want to first estimate the scope of the refactoring work. If it turns out to be significant (meaning it'll take a while), then merging your PR sooner before any larger restructuring. If the refactoring ends up being relatively light, it may be better to merge after we clean things up.

So for now, I'd like to keep this PR open while we do that quick assessment. I'll follow up with you once I have a clearer picture. Hope that sounds reasonable, and thank you again for your contribution

ziyangliu-666 · 2026-03-30T04:35:07Z

@lvhan028 Thanks for the detailed explanation! That sounds totally reasonable.

Happy to wait while you assess the refactoring scope. Let me know if there's anything I can do to help with the alignment, or if any changes are needed on this PR in the meantime :)

lvhan028 · 2026-04-02T02:53:32Z

        sequence_start=True,
        sequence_end=True,
        do_preprocess=False,
-        media_io_kwargs=request.media_io_kwargs,


Please bring "media_io_kwargs=request.media_io_kwargs," back

Fixed, restored in the generate function.

lvhan028 · 2026-04-02T03:17:55Z

gpt_oss_parser is defined here.

lvhan028 · 2026-04-02T03:18:16Z

+            delta_token_ids = []
+            streaming_tools = False
+            # each generator needs its own stateful streaming parser instance
+            _gpt_oss_parser = GptOssChatParser() if gpt_oss_parser is not None else None


I think we can safely remove the gpt_oss_parser defined in Line 431

Done, removed the outer gpt_oss_parser instance. Now using a simple is_gpt_oss boolean flag, and each generator creates its own parser.

lvhan028 · 2026-04-02T03:48:49Z

+                    if request.tool_choice != 'none' and VariableInterface.tool_parser is not None:
+                        if res.finish_reason == 'stop' and streaming_tools is True:
+                            res.finish_reason = 'tool_calls'
+                        tool_delta = VariableInterface.tool_parser.extract_tool_calls_streaming(


Here is a potential risk. A safer approach would be to create a dedicated parser for each generator. Additionally, since some models require the parser to adjust the request, it would be better to have separate copies of the request as well.

I’m currently working on refactoring the lmdeploy parsers implementation #4419. It would be best to address this after PR #4419 is completed.

Hi, @ziyangliu-666 the refactoring of the parsers will take a bit more time. If you're comfortable with it, I’d be happy to take over this PR after my changes are merged. Otherwise, I appreciate your patience in the meantime.

Thanks for the heads up. Happy to wait for #4419 to land first — feel free to take over after that.

windreamer requested a review from Copilot March 17, 2026 16:29

Copilot started reviewing on behalf of windreamer March 17, 2026 16:30 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

Comment thread tests/test_lmdeploy/test_n_parameter.py

Comment thread lmdeploy/serve/openai/api_server.py

Comment thread lmdeploy/serve/openai/api_server.py Outdated

Comment thread lmdeploy/serve/openai/api_server.py Outdated

Comment thread lmdeploy/serve/openai/api_server.py

windreamer force-pushed the ziyangliu-666/support-n-parameter branch 2 times, most recently from 6a470f0 to 69a90fd Compare March 26, 2026 06:55

ziyangliu-666 and others added 3 commits March 26, 2026 18:09

Add pytest-asyncio to test requirements

5d433fb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

windreamer force-pushed the ziyangliu-666/support-n-parameter branch from 69a90fd to 5d433fb Compare March 26, 2026 10:09

lvhan028 requested review from CUHKSZzxy and lvhan028 April 2, 2026 02:49

lvhan028 reviewed Apr 2, 2026

View reviewed changes

lvhan028 added the improvement label Apr 2, 2026

lvhan028 reviewed Apr 2, 2026

View reviewed changes

Comment thread lmdeploy/serve/openai/api_server.py Outdated

Copy link
Copy Markdown

Collaborator

lvhan028 Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpt_oss_parser is defined here.

lvhan028 reviewed Apr 2, 2026

View reviewed changes

ziyangliu-666 and others added 4 commits April 2, 2026 19:28

address review: restore media_io_kwargs and remove unused gpt_oss_parser

daf1abf

resolve conflict

986ab87

merge main

63a657c

fix ut

4e3e82c

Conversation

ziyangliu-666 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed

lmdeploy/serve/openai/api_server.py

/v1/chat/completions

/v1/completions

Tests

Backward compatibility

Example

Uh oh!

ziyangliu-666 commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ziyangliu-666 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Follow-up: Copilot review fixes + re-verified self-test

Fixes

Self-test results after fixes (Qwen3-0.6B, PyTorch backend)

Uh oh!

windreamer commented Mar 18, 2026

Uh oh!

ziyangliu-666 commented Mar 18, 2026

Uh oh!

windreamer commented Mar 24, 2026

Uh oh!

ziyangliu-666 commented Mar 30, 2026

Uh oh!

lvhan028 commented Mar 30, 2026

Uh oh!

ziyangliu-666 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lvhan028 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

ziyangliu-666 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ziyangliu-666 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

ziyangliu-666 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ziyangliu-666 commented Mar 17, 2026 •

edited

Loading

`lmdeploy/serve/openai/api_server.py`

`/v1/chat/completions`

`/v1/completions`

ziyangliu-666 commented Mar 18, 2026 •

edited

Loading

ziyangliu-666 commented Mar 30, 2026 •

edited

Loading

lvhan028 Apr 2, 2026 •

edited

Loading