feat: add OpenAI /v1/completions adapter for vLLM gpt-oss-120b accuracy#308
feat: add OpenAI /v1/completions adapter for vLLM gpt-oss-120b accuracy#308
Conversation
Adds APIType.OPENAI_COMPLETIONS routing to /v1/completions, which accepts pre-tokenized token ID arrays and bypasses vLLM's chat template — required for gpt-oss-120b where the Harmony format must be applied client-side. - Add APIType.OPENAI_COMPLETIONS with default_route "/v1/completions" - Add TextCompletionRequest/Response/SSE msgspec types - Add OpenAITextCompletionsAdapter (mirrors SGLang adapter, reuses OpenAISSEAccumulator) - Register adapter and accumulator in endpoint_client/config.py - Rename gptoss → gptoss_sglang presets; add gptoss_vllm across aime25/gpqa/livecodebench - Update sglang_gptoss_120b_example.yaml to use gptoss_sglang presets - Update vllm_gptoss_120b_example.yaml to use openai_completions + gptoss_vllm presets - Add 18 unit tests covering adapter, SSE, preset existence, and APIType integration fix: move lazy test imports to module level; fix decode_sse_message return type - Move all inline imports in test_completions_adapter.py to file-level - Add test for empty-text SSE choice path - Fix HttpRequestAdapter.decode_sse_message abstract annotation from str -> Any (SGLang and completions adapters both return SSEDelta structs, not str) examples/04_GPTOSS120B_Example/Readme.md: - Replace stale chat-completions note with accurate openai_completions description - Update performance-only vLLM api_type reference from "openai" to "openai_completions"
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces a new openai_completions API type and adapter to support the OpenAI /v1/completions endpoint, enabling the use of pre-tokenized input with vLLM. This change allows users to bypass server-side chat templates, ensuring parity with SGLang results for specific models like gpt-oss-120b. The implementation includes the OpenAITextCompletionsAdapter, updated configuration templates, documentation, and new unit tests. I have no feedback to provide.
| """ | ||
|
|
||
| OPENAI = "openai" | ||
| OPENAI_COMPLETIONS = "openai_completions" |
There was a problem hiding this comment.
We might want to be explicit and say it's v1 completion (vs v1_chat_completions). Anticipate OAI to come up with some new template in the future and we would have to refactor here as well (also above)
| def gptoss_sglang() -> list[Transform]: | ||
| return [UserPromptFormatter(user_prompt_format=_FORMAT)] | ||
|
|
||
|
|
||
| def gptoss() -> list[Transform]: | ||
| return [ | ||
| UserPromptFormatter( | ||
| user_prompt_format=( | ||
| "You are a python coding expert that solves problems step-by-step.\n" | ||
| "You must provide the reasoning to arriving at your solution and the code to solve the problem.\n" | ||
| "Do not try simulating the code execution. The code must be enclosed within ```python delimiters.\n\n\n" | ||
| "{question}\n" | ||
| "### Format: You will use the following starter code to write the solution to the problem and enclose your code within delimiters.\n" | ||
| "```python\n" | ||
| "{starter_code}\n" | ||
| "```\n" | ||
| ), | ||
| ), | ||
| ] | ||
| def gptoss_vllm() -> list[Transform]: | ||
| return [UserPromptFormatter(user_prompt_format=_FORMAT)] |
There was a problem hiding this comment.
This seems like a duplicate, and different function calling the same implementation
|
|
||
| def gptoss() -> list[Transform]: | ||
| return [ | ||
| # Step 1: Format the prompt from question and choices |
| def gptoss_sglang() -> list[Transform]: | ||
| return [UserPromptFormatter(user_prompt_format=_FORMAT)] | ||
|
|
||
|
|
||
| def gptoss_vllm() -> list[Transform]: | ||
| return [UserPromptFormatter(user_prompt_format=_FORMAT)] |
There was a problem hiding this comment.
It seems like the tranformation is the same between sglang and vLLM, and the dinstinct behavior here is whether we pre-tokenize or not. Using gptoss_ will be misleading.
And the part which determines the pre-tokenization is the api_type: "openai_completions" not the trasnformation here right? Seems like we should deduplicate these 2.
| - "http://localhost:8000" | ||
| api_key: null | ||
| api_type: "openai" | ||
| api_type: "openai_completions" |
There was a problem hiding this comment.
Am I understanding correctly this flag is where we control pre-tokenization?
| │ ├── types.py # OpenAI response types (chat + text completion) | ||
| │ ├── openai_adapter.py # Chat completions adapter (/v1/chat/completions) | ||
| │ ├── openai_msgspec_adapter.py # msgspec-based chat completions adapter (fast path) | ||
| │ ├── completions_adapter.py # Text completions adapter (/v1/completions, pre-tokenized input) |
| │ ├── openai_msgspec_adapter.py # msgspec-based adapter (fast path) | ||
| │ ├── accumulator.py # Streaming response accumulator | ||
| │ ├── types.py # OpenAI response types (chat + text completion) | ||
| │ ├── openai_adapter.py # Chat completions adapter (/v1/chat/completions) |
There was a problem hiding this comment.
Probably want to rename this as v1_chat_completion-adapter as well (not a must have for this PR)
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """OpenAI /v1/completions adapter for vLLM with pre-tokenized prompts.""" |
There was a problem hiding this comment.
Probably shouldn't say it's for vLLM?
Adds APIType.OPENAI_COMPLETIONS routing to /v1/completions, which accepts pre-tokenized token ID arrays and bypasses vLLM's chat template — required for gpt-oss-120b where the Harmony format must be applied client-side.
fix: move lazy test imports to module level; fix decode_sse_message return type
examples/04_GPTOSS120B_Example/Readme.md:
What does this PR do?
Type of change
Related issues
Testing
Checklist