Support multiple prompts/token arrays in a single request#201
Support multiple prompts/token arrays in a single request#201Phylliida wants to merge 3 commits into
Conversation
TimPietruskyRunPod
left a comment
There was a problem hiding this comment.
Review: PR #201
The idea here is solid — supporting multiple prompts/token arrays in a single request fills a real gap, since the OpenAI-compatible route doesn't support RunPod's polling for long-running requests. However, there are several bugs and an incomplete implementation that need to be addressed before this can merge.
Bug: is vs isinstance in type checks
In src/utils.py, both helper functions use is for type checking:
# prompt_to_vllm_prompt
elif prompt is list:
# tokens_to_vllm_prompt
elif tokens[0] is list:prompt is list checks identity against the list class object — it will never be True for an actual list instance like ["hello", "world"]. These should be:
elif isinstance(prompt, list):
# and
elif isinstance(tokens[0], list):As-is, multi-prompt input silently falls through to the single-prompt branch every time, so the core feature doesn't actually work.
Bug: vllm module not imported in utils.py
The new functions use vllm.TextPrompt and vllm.TokensPrompt, but utils.py doesn't import the vllm module itself. The existing imports pull specific symbols (SamplingParams, random_uuid, etc.) but not the top-level module. This will raise a NameError at runtime.
Either import vllm or import the specific types:
from vllm.inputs import TextPrompt, TokensPromptIncomplete: multi-prompt not handled in the engine
prompt_to_vllm_prompt and tokens_to_vllm_prompt can return a list of prompt objects, but _generate_vllm passes llm_input directly to self.llm.generate() with a single request_id:
results_generator = self.llm.generate(llm_input, validated_sampling_params, request_id)AsyncLLMEngine.generate() expects a single prompt per call. If llm_input is a list of TextPrompt objects, this will either error or produce unexpected results. The engine needs to iterate over each prompt, generate separate request_ids, and aggregate the results. This is the biggest gap — the parsing layer supports multi-prompt but the generation layer doesn't.
Breaking change: apply_chat_template default
# Before
self.apply_chat_template = job.get("apply_chat_template", False)
# After
self.apply_chat_template = job.get("apply_chat_template", job.get("messages") is not None)This changes the default behavior: users sending messages without explicitly setting apply_chat_template will now get the template applied automatically. This is arguably correct behavior (if you send messages, you probably want the template), but it's a breaking change that should be called out in the PR description. Existing users relying on the old default may see different output.
Corresponding _generate_vllm change looks correct
# Before
if apply_chat_template or isinstance(llm_input, list):
# After
if apply_chat_template:With multi-prompt support, llm_input can now be a list of TextPrompt objects (not messages), so the old isinstance(llm_input, list) check would incorrectly apply the chat template to non-message list inputs. The new apply_chat_template default (True when messages present) compensates. This change is correct and necessary.
Minor: truthiness check in get_llm_input
if value:
return fn(value)This skips falsy values (empty string, empty list, 0). Probably fine since empty inputs aren't useful, but [] for messages or "" for prompt would silently fall through to the next key or return None, which is a subtle behavior change from the original job.get("messages", job.get("prompt")).
Summary
The concept is valid and addresses a real need. Main blockers:
- Bug:
prompt is list→isinstance(prompt, list)(and same for tokens) - Bug: Missing
vllmimport inutils.py - Incomplete: Engine doesn't handle list of prompts — needs iteration with separate request IDs
Once those are fixed, the smaller items (breaking change documentation, edge cases) can be wrapped up. Happy to re-review after!
OpenAI compatible API supports this already, however it doesn't allow for polling so it isn't suitable for requests that take a very long time.
This makes it so you can use the polling stuff but also send multiple requests in a single request, via:
promptas an array of prompts, ortokensas an array of array of tokensIt also adds support for tokens parameter (which can be just a single array, or array of array as mentioned), which previously wasn't supported.