Support multiple prompts/token arrays in a single request by Phylliida · Pull Request #201 · runpod-workers/worker-vllm

Phylliida · 2025-07-31T00:34:39Z

OpenAI compatible API supports this already, however it doesn't allow for polling so it isn't suitable for requests that take a very long time.

This makes it so you can use the polling stuff but also send multiple requests in a single request, via:

send prompt as an array of prompts, or
send tokens as an array of array of tokens

It also adds support for tokens parameter (which can be just a single array, or array of array as mentioned), which previously wasn't supported.

TimPietruskyRunPod

Review: PR #201

The idea here is solid — supporting multiple prompts/token arrays in a single request fills a real gap, since the OpenAI-compatible route doesn't support RunPod's polling for long-running requests. However, there are several bugs and an incomplete implementation that need to be addressed before this can merge.

Bug: `is` vs `isinstance` in type checks

In src/utils.py, both helper functions use is for type checking:

# prompt_to_vllm_prompt
elif prompt is list:

# tokens_to_vllm_prompt
elif tokens[0] is list:

prompt is list checks identity against the list class object — it will never be True for an actual list instance like ["hello", "world"]. These should be:

elif isinstance(prompt, list):
# and
elif isinstance(tokens[0], list):

As-is, multi-prompt input silently falls through to the single-prompt branch every time, so the core feature doesn't actually work.

Bug: `vllm` module not imported in `utils.py`

The new functions use vllm.TextPrompt and vllm.TokensPrompt, but utils.py doesn't import the vllm module itself. The existing imports pull specific symbols (SamplingParams, random_uuid, etc.) but not the top-level module. This will raise a NameError at runtime.

Either import vllm or import the specific types:

from vllm.inputs import TextPrompt, TokensPrompt

Incomplete: multi-prompt not handled in the engine

prompt_to_vllm_prompt and tokens_to_vllm_prompt can return a list of prompt objects, but _generate_vllm passes llm_input directly to self.llm.generate() with a single request_id:

results_generator = self.llm.generate(llm_input, validated_sampling_params, request_id)

AsyncLLMEngine.generate() expects a single prompt per call. If llm_input is a list of TextPrompt objects, this will either error or produce unexpected results. The engine needs to iterate over each prompt, generate separate request_ids, and aggregate the results. This is the biggest gap — the parsing layer supports multi-prompt but the generation layer doesn't.

Breaking change: `apply_chat_template` default

# Before
self.apply_chat_template = job.get("apply_chat_template", False)

# After  
self.apply_chat_template = job.get("apply_chat_template", job.get("messages") is not None)

This changes the default behavior: users sending messages without explicitly setting apply_chat_template will now get the template applied automatically. This is arguably correct behavior (if you send messages, you probably want the template), but it's a breaking change that should be called out in the PR description. Existing users relying on the old default may see different output.

Corresponding `_generate_vllm` change looks correct

# Before
if apply_chat_template or isinstance(llm_input, list):

# After
if apply_chat_template:

With multi-prompt support, llm_input can now be a list of TextPrompt objects (not messages), so the old isinstance(llm_input, list) check would incorrectly apply the chat template to non-message list inputs. The new apply_chat_template default (True when messages present) compensates. This change is correct and necessary.

Minor: truthiness check in `get_llm_input`

if value:
    return fn(value)

This skips falsy values (empty string, empty list, 0). Probably fine since empty inputs aren't useful, but [] for messages or "" for prompt would silently fall through to the next key or return None, which is a subtle behavior change from the original job.get("messages", job.get("prompt")).

Summary

The concept is valid and addresses a real need. Main blockers:

Bug: prompt is list → isinstance(prompt, list) (and same for tokens)
Bug: Missing vllm import in utils.py
Incomplete: Engine doesn't handle list of prompts — needs iteration with separate request IDs

Once those are fixed, the smaller items (breaking change documentation, edge cases) can be wrapped up. Happy to re-review after!

Phylliida added 3 commits July 30, 2025 17:31

Support multiple prompts/token arrays in a single request

997b6a3

No more list test since we have multiple options

bc1bc2c

Auto apply chat template for messages

7161f2a

velaraptor-runpod requested a review from TimPietruskyRunPod March 10, 2026 15:26

TimPietruskyRunPod requested changes Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple prompts/token arrays in a single request#201

Support multiple prompts/token arrays in a single request#201
Phylliida wants to merge 3 commits into
runpod-workers:mainfrom
Phylliida:patch-1

Phylliida commented Jul 31, 2025 •

edited

Loading

Uh oh!

TimPietruskyRunPod left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Phylliida commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TimPietruskyRunPod left a comment

Choose a reason for hiding this comment

Review: PR #201

Bug: is vs isinstance in type checks

Bug: vllm module not imported in utils.py

Incomplete: multi-prompt not handled in the engine

Breaking change: apply_chat_template default

Corresponding _generate_vllm change looks correct

Minor: truthiness check in get_llm_input

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Phylliida commented Jul 31, 2025 •

edited

Loading

Bug: `is` vs `isinstance` in type checks

Bug: `vllm` module not imported in `utils.py`

Breaking change: `apply_chat_template` default

Corresponding `_generate_vllm` change looks correct

Minor: truthiness check in `get_llm_input`