Skip to content

feat: add OpenAI /v1/completions adapter for vLLM gpt-oss-120b accuracy#308

Open
arekay-nv wants to merge 1 commit intomainfrom
arekay/openai-completions-adapter
Open

feat: add OpenAI /v1/completions adapter for vLLM gpt-oss-120b accuracy#308
arekay-nv wants to merge 1 commit intomainfrom
arekay/openai-completions-adapter

Conversation

@arekay-nv
Copy link
Copy Markdown
Collaborator

Adds APIType.OPENAI_COMPLETIONS routing to /v1/completions, which accepts pre-tokenized token ID arrays and bypasses vLLM's chat template — required for gpt-oss-120b where the Harmony format must be applied client-side.

  • Add APIType.OPENAI_COMPLETIONS with default_route "/v1/completions"
  • Add TextCompletionRequest/Response/SSE msgspec types
  • Add OpenAITextCompletionsAdapter (mirrors SGLang adapter, reuses OpenAISSEAccumulator)
  • Register adapter and accumulator in endpoint_client/config.py
  • Rename gptoss → gptoss_sglang presets; add gptoss_vllm across aime25/gpqa/livecodebench
  • Update sglang_gptoss_120b_example.yaml to use gptoss_sglang presets
  • Update vllm_gptoss_120b_example.yaml to use openai_completions + gptoss_vllm presets
  • Add 18 unit tests covering adapter, SSE, preset existence, and APIType integration

fix: move lazy test imports to module level; fix decode_sse_message return type

  • Move all inline imports in test_completions_adapter.py to file-level
  • Add test for empty-text SSE choice path
  • Fix HttpRequestAdapter.decode_sse_message abstract annotation from str -> Any (SGLang and completions adapters both return SSEDelta structs, not str)

examples/04_GPTOSS120B_Example/Readme.md:

  • Replace stale chat-completions note with accurate openai_completions description
  • Update performance-only vLLM api_type reference from "openai" to "openai_completions"

What does this PR do?

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

Adds APIType.OPENAI_COMPLETIONS routing to /v1/completions, which accepts
pre-tokenized token ID arrays and bypasses vLLM's chat template — required
for gpt-oss-120b where the Harmony format must be applied client-side.

- Add APIType.OPENAI_COMPLETIONS with default_route "/v1/completions"
- Add TextCompletionRequest/Response/SSE msgspec types
- Add OpenAITextCompletionsAdapter (mirrors SGLang adapter, reuses OpenAISSEAccumulator)
- Register adapter and accumulator in endpoint_client/config.py
- Rename gptoss → gptoss_sglang presets; add gptoss_vllm across aime25/gpqa/livecodebench
- Update sglang_gptoss_120b_example.yaml to use gptoss_sglang presets
- Update vllm_gptoss_120b_example.yaml to use openai_completions + gptoss_vllm presets
- Add 18 unit tests covering adapter, SSE, preset existence, and APIType integration

fix: move lazy test imports to module level; fix decode_sse_message return type

- Move all inline imports in test_completions_adapter.py to file-level
- Add test for empty-text SSE choice path
- Fix HttpRequestAdapter.decode_sse_message abstract annotation from str -> Any
  (SGLang and completions adapters both return SSEDelta structs, not str)

examples/04_GPTOSS120B_Example/Readme.md:
- Replace stale chat-completions note with accurate openai_completions description
- Update performance-only vLLM api_type reference from "openai" to "openai_completions"
@github-actions github-actions Bot requested a review from nvzhihanj May 9, 2026 11:38
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@arekay-nv arekay-nv requested review from nv-alicheng and viraatc May 9, 2026 11:38
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new openai_completions API type and adapter to support the OpenAI /v1/completions endpoint, enabling the use of pre-tokenized input with vLLM. This change allows users to bypass server-side chat templates, ensuring parity with SGLang results for specific models like gpt-oss-120b. The implementation includes the OpenAITextCompletionsAdapter, updated configuration templates, documentation, and new unit tests. I have no feedback to provide.

@arekay-nv arekay-nv requested a review from tianmu-li May 9, 2026 11:40
@arekay-nv arekay-nv marked this pull request as ready for review May 11, 2026 02:38
@arekay-nv arekay-nv requested a review from a team May 11, 2026 02:38
"""

OPENAI = "openai"
OPENAI_COMPLETIONS = "openai_completions"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to be explicit and say it's v1 completion (vs v1_chat_completions). Anticipate OAI to come up with some new template in the future and we would have to refactor here as well (also above)

Comment on lines +36 to +41
def gptoss_sglang() -> list[Transform]:
return [UserPromptFormatter(user_prompt_format=_FORMAT)]


def gptoss() -> list[Transform]:
return [
UserPromptFormatter(
user_prompt_format=(
"You are a python coding expert that solves problems step-by-step.\n"
"You must provide the reasoning to arriving at your solution and the code to solve the problem.\n"
"Do not try simulating the code execution. The code must be enclosed within ```python delimiters.\n\n\n"
"{question}\n"
"### Format: You will use the following starter code to write the solution to the problem and enclose your code within delimiters.\n"
"```python\n"
"{starter_code}\n"
"```\n"
),
),
]
def gptoss_vllm() -> list[Transform]:
return [UserPromptFormatter(user_prompt_format=_FORMAT)]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a duplicate, and different function calling the same implementation


def gptoss() -> list[Transform]:
return [
# Step 1: Format the prompt from question and choices
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a duplicate

Comment on lines +27 to +32
def gptoss_sglang() -> list[Transform]:
return [UserPromptFormatter(user_prompt_format=_FORMAT)]


def gptoss_vllm() -> list[Transform]:
return [UserPromptFormatter(user_prompt_format=_FORMAT)]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the tranformation is the same between sglang and vLLM, and the dinstinct behavior here is whether we pre-tokenize or not. Using gptoss_ will be misleading.

And the part which determines the pre-tokenization is the api_type: "openai_completions" not the trasnformation here right? Seems like we should deduplicate these 2.

- "http://localhost:8000"
api_key: null
api_type: "openai"
api_type: "openai_completions"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding correctly this flag is where we control pre-tokenization?

Comment thread AGENTS.md
│ ├── types.py # OpenAI response types (chat + text completion)
│ ├── openai_adapter.py # Chat completions adapter (/v1/chat/completions)
│ ├── openai_msgspec_adapter.py # msgspec-based chat completions adapter (fast path)
│ ├── completions_adapter.py # Text completions adapter (/v1/completions, pre-tokenized input)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1 complettion adapter

Comment thread AGENTS.md
│ ├── openai_msgspec_adapter.py # msgspec-based adapter (fast path)
│ ├── accumulator.py # Streaming response accumulator
│ ├── types.py # OpenAI response types (chat + text completion)
│ ├── openai_adapter.py # Chat completions adapter (/v1/chat/completions)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to rename this as v1_chat_completion-adapter as well (not a must have for this PR)

# See the License for the specific language governing permissions and
# limitations under the License.

"""OpenAI /v1/completions adapter for vLLM with pre-tokenized prompts."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably shouldn't say it's for vLLM?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants