Skip to content

Support multiple prompts/token arrays in a single request#201

Open
Phylliida wants to merge 3 commits into
runpod-workers:mainfrom
Phylliida:patch-1
Open

Support multiple prompts/token arrays in a single request#201
Phylliida wants to merge 3 commits into
runpod-workers:mainfrom
Phylliida:patch-1

Conversation

@Phylliida
Copy link
Copy Markdown

@Phylliida Phylliida commented Jul 31, 2025

OpenAI compatible API supports this already, however it doesn't allow for polling so it isn't suitable for requests that take a very long time.

This makes it so you can use the polling stuff but also send multiple requests in a single request, via:

  • send prompt as an array of prompts, or
  • send tokens as an array of array of tokens

It also adds support for tokens parameter (which can be just a single array, or array of array as mentioned), which previously wasn't supported.

Copy link
Copy Markdown
Contributor

@TimPietruskyRunPod TimPietruskyRunPod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #201

The idea here is solid — supporting multiple prompts/token arrays in a single request fills a real gap, since the OpenAI-compatible route doesn't support RunPod's polling for long-running requests. However, there are several bugs and an incomplete implementation that need to be addressed before this can merge.


Bug: is vs isinstance in type checks

In src/utils.py, both helper functions use is for type checking:

# prompt_to_vllm_prompt
elif prompt is list:

# tokens_to_vllm_prompt
elif tokens[0] is list:

prompt is list checks identity against the list class object — it will never be True for an actual list instance like ["hello", "world"]. These should be:

elif isinstance(prompt, list):
# and
elif isinstance(tokens[0], list):

As-is, multi-prompt input silently falls through to the single-prompt branch every time, so the core feature doesn't actually work.


Bug: vllm module not imported in utils.py

The new functions use vllm.TextPrompt and vllm.TokensPrompt, but utils.py doesn't import the vllm module itself. The existing imports pull specific symbols (SamplingParams, random_uuid, etc.) but not the top-level module. This will raise a NameError at runtime.

Either import vllm or import the specific types:

from vllm.inputs import TextPrompt, TokensPrompt

Incomplete: multi-prompt not handled in the engine

prompt_to_vllm_prompt and tokens_to_vllm_prompt can return a list of prompt objects, but _generate_vllm passes llm_input directly to self.llm.generate() with a single request_id:

results_generator = self.llm.generate(llm_input, validated_sampling_params, request_id)

AsyncLLMEngine.generate() expects a single prompt per call. If llm_input is a list of TextPrompt objects, this will either error or produce unexpected results. The engine needs to iterate over each prompt, generate separate request_ids, and aggregate the results. This is the biggest gap — the parsing layer supports multi-prompt but the generation layer doesn't.


Breaking change: apply_chat_template default

# Before
self.apply_chat_template = job.get("apply_chat_template", False)

# After  
self.apply_chat_template = job.get("apply_chat_template", job.get("messages") is not None)

This changes the default behavior: users sending messages without explicitly setting apply_chat_template will now get the template applied automatically. This is arguably correct behavior (if you send messages, you probably want the template), but it's a breaking change that should be called out in the PR description. Existing users relying on the old default may see different output.


Corresponding _generate_vllm change looks correct

# Before
if apply_chat_template or isinstance(llm_input, list):

# After
if apply_chat_template:

With multi-prompt support, llm_input can now be a list of TextPrompt objects (not messages), so the old isinstance(llm_input, list) check would incorrectly apply the chat template to non-message list inputs. The new apply_chat_template default (True when messages present) compensates. This change is correct and necessary.


Minor: truthiness check in get_llm_input

if value:
    return fn(value)

This skips falsy values (empty string, empty list, 0). Probably fine since empty inputs aren't useful, but [] for messages or "" for prompt would silently fall through to the next key or return None, which is a subtle behavior change from the original job.get("messages", job.get("prompt")).


Summary

The concept is valid and addresses a real need. Main blockers:

  1. Bug: prompt is listisinstance(prompt, list) (and same for tokens)
  2. Bug: Missing vllm import in utils.py
  3. Incomplete: Engine doesn't handle list of prompts — needs iteration with separate request IDs

Once those are fixed, the smaller items (breaking change documentation, edge cases) can be wrapped up. Happy to re-review after!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants