feat(renderer-client): thread multimodal sidecar through rollout + transport#1346
Open
hallerite wants to merge 4 commits into
Open
feat(renderer-client): thread multimodal sidecar through rollout + transport#1346hallerite wants to merge 4 commits into
hallerite wants to merge 4 commits into
Conversation
…ansport
Surfaces the renderer's MultiModalData sidecar (pixel_values, placeholder
ranges, mm_hashes) end-to-end so multimodal renderers can drive vLLM's
/inference/v1/generate `multi_modal_data` features field and the
downstream trainer's `mm_kwargs` without going through the legacy
chat-completions / MITO multimodal path.
renderer_client.py
- `_step_multi_modal_data(step)`: recover the prior turn's mm_data from
the trajectory step (parsed-tokens or raw-message side).
- `_get_incremental_prompt_ids` now returns `RenderedTokens | None` and
forwards `previous_multi_modal_data` to `bridge_to_next_turn` so the
new turn's placeholder runs cover every earlier-turn image. Without
this carry-forward, vLLM sees mismatched placeholder counts and falls
back to hash-cache lookup or errors. Text-only renderers' raw
`list[int]` returns are normalized via `as_rendered_tokens`.
- `RendererClient.create_completion` unpacks the bridged result into
`(prompt_ids, multi_modal_data)` and forwards both to `generate`.
- `parse_response_tokens`: copies `response.multi_modal_data` onto the
emitted `ResponseTokens` so downstream consumers can read it.
types.py
- `ResponseTokens.multi_modal_data: Any | None`
- `TrajectoryStepTokens.multi_modal_data: NotRequired[Any]`
Both typed as `Any` to avoid a hard import dependency on `renderers`.
utils/response_utils.py
- `parse_response_tokens` propagates `multi_modal_data` onto the
`TrajectoryStepTokens` output when present.
utils/save_utils.py
- `is_json_serializable` accepts torch tensors / numpy arrays / renderer
sidecar dataclasses — these aren't JSON-native but survive the
prime-rl msgpack encoder, and trajectories carrying them are excluded
from the JSONL save at the orchestrator boundary (orchestrator passes
`exclude_keys={"trajectory"}` to `save_rollouts`).
- `_strip_intermediate_mm_data(trajectory)`: drop `tokens.multi_modal_data`
from all but the last step before transport. `bridge_to_next_turn`
merges prior turns' mm_data into the new turn, so naively shipping
mm_data on every step duplicates every image O(N²) bytes for an N-turn
rollout; only the last step's sidecar is read by the trainer.
utils/serve_utils.py
- Custom msgpack encoder gains torch tensor / numpy ndarray /
dataclass support. Tensors are encoded as
`{__torch_tensor__: True, dtype, shape, data}` with raw bytes payload.
Torch is imported lazily so text-only consumers don't pay for it.
- `decode_tensor_payload` / `walk_decode_tensors` rehydrate tensor
payloads on the receiving side.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
Adopts the renderers split-Protocol surface (MultimodalRenderer +
uniform ``RenderedTokens | None`` bridge return) and resolves CI:
- ``renderer_client.py``: drop ``as_rendered_tokens`` (no longer needed
since bridge returns ``RenderedTokens | None`` uniformly).
``_get_incremental_prompt_ids`` dispatches on
``isinstance(renderer, MultimodalRenderer)`` — multimodal path passes
``previous_multi_modal_data`` so prior-turn images carry forward into
``mm_placeholders``; text-only path uses the base ``Renderer.bridge``
signature unchanged. The previous always-pass design relied on every
text-only renderer accepting and ignoring the kwarg, which spread the
multimodal contract across the whole renderer registry.
- ``save_utils.py``: cast iteration variable to ``Mapping[str, Any]``
in ``_strip_intermediate_mm_data`` so ty narrows ``step.get("tokens")``
correctly (previously ty inferred ``_KT`` as ``Never`` after the
non-Mapping branch was excluded).
- ``serve_utils.py``: replace ``import torch`` with
``importlib.import_module("torch")`` inside ``decode_tensor_payload``
— torch is a soft runtime dep here (callers that pass
``to_torch=True`` are expected to have it installed). Static type
checkers in downstream consumers without torch installed don't fail
on unresolved-import anymore.
- ``pyproject.toml`` + ``uv.lock``: pin renderers to feat/multimodal-vlm
branch until that PR lands and a new PyPI release is published. The
branch provides ``MultimodalRenderer``, ``MultiModalData``,
``PlaceholderRange``, ``RenderedTokens``, and the
``previous_multi_modal_data`` kwarg this branch consumes.
- ``tests/test_renderer_client.py``: ``_BridgeRenderer`` stub now
returns ``RenderedTokens`` (matches the new Protocol). Update list
equality / slice assertions to use ``result.token_ids`` since the
uniform bridge return shape is ``RenderedTokens | None``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…irection renderer_client (bugbot #3, high): the previous isinstance(renderer, MultimodalRenderer) check was performed against the outer renderer parameter, which in production is a RendererPool. RendererPool was not a Renderer subclass, so the multimodal branch never fired and the PR's mm carry-forward was silently broken under pooled use. Renderers PR now has RendererPool implement the Renderer protocol structurally, and this side dispatches via the cached is_multimodal(r) helper which works on either a bare renderer or a pool. save_utils (bugbot #2, medium): is_json_serializable previously whitelisted torch tensors and renderer dataclasses, but make_serializable has no handler for them — it would stringify to "tensor(...)" garbage if anything actually hit JSON. The whitelist worked only because the orchestrator excludes "trajectory" at the JSONL boundary. Restore the honest JSON-only contract and bypass the gate explicitly in state_to_output for col == "trajectory" (where msgpack handles tensors via its custom encoder). save_utils (bugbot #4, low): _strip_intermediate_mm_data was stripping step["tokens"]["multi_modal_data"] but not the duplicate at step["response"].message.tokens.multi_modal_data. The Pydantic Response serialization preserves it through msgpack via model_dump(), so the O(N²) bloat the function targets was only halved. Now strips both. Also drop the pool-vs-bare-renderer branching ladder via _maybe_offload (asyncio.to_thread iff pool); pool's checkout is now an implementation detail of the pool itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 397b8aa. Configure here.
willccbb
requested changes
May 12, 2026
| # MultiModalData, PlaceholderRange, as_rendered_tokens, and the | ||
| # `previous_multi_modal_data` kwarg on Renderer.bridge_to_next_turn | ||
| # that this branch consumes. Drop after merge + release. | ||
| renderers = { git = "https://github.com/PrimeIntellect-ai/renderers.git", branch = "feat/multimodal-vlm" } |
Member
There was a problem hiding this comment.
Merged the renderers PR, can we release renderers + replace this? Should never have a git source pinned in verifiers, PyPI won't respect it + we always want vf to be releasable
eligotts
reviewed
May 12, 2026
| # earlier steps before transport. Bridge accesses mm_data only | ||
| # within the env-worker rollout loop, which has already | ||
| # finished by the time state_to_output runs. | ||
| value = _strip_intermediate_mm_data(value) |
Contributor
There was a problem hiding this comment.
this assumes pure extension right? how do we handle branching/compaction rollouts?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Threads the renderer's
MultiModalDatasidecar (pixel_values, placeholder ranges, mm_hashes) from the renderer through/inference/v1/generateto vLLM and onto trajectory tokens — so the new renderer-only multimodal path in prime-rl can drop the orchestrator-sideAutoProcessorand image cache entirely.Companion PRs:
What changes
Rollout (
renderer_client.py)_get_incremental_prompt_idsnow returnsRenderedTokens | None(token_ids+multi_modal_data); text-only renderers' rawlist[int]is normalized viaas_rendered_tokensso callers unpack uniformly_step_multi_modal_data(step)recovers the prior turn'smm_datafrom the trajectory step and forwards it tobridge_to_next_turnso the new turn's placeholder runs cover every earlier-turn imageRendererClient.create_completionunpacks the bridged result into(prompt_ids, multi_modal_data)and forwards both togenerateparse_response_tokenscopiesresponse.multi_modal_dataontoResponseTokensis_multimodal(renderer)(cachedbool) replaces the runtime-checkableisinstancewalk on the hot path; pool methods are called directly (the pool implements the protocol structurally) and offloaded viaasyncio.to_threadwhen the renderer is a poolTypes (
types.py)ResponseTokens.multi_modal_data: Any | NoneTrajectoryStepTokens.multi_modal_data: NotRequired[Any]Typed as
Anyto avoid a hard import dep onrenderers.Transport (
utils/serve_utils.py){__torch_tensor__: True, dtype, shape, data}with raw bytes payload (torch imported lazily — text-only consumers don't pay for it)decode_tensor_payload/walk_decode_tensorsrehydrate on the receiving sideSave (
utils/save_utils.py)is_json_serializablekeeps its honest JSON-only contract;state_to_outputbypasses the gate for thetrajectorycolumn (msgpack-transported)_strip_intermediate_mm_datadropstokens.multi_modal_datafrom all but the last trajectory step before transport — bridge merges prior turns'mm_datainto each new turn, so naively shipping is O(N²) bytes per N-turn rolloutTest plan
mm_data→ fast path unchangedmm_datareaches vLLM viamulti_modal_dataand the trainer viaTrajectoryStepTokens["multi_modal_data"]multi_modal_dataaccepted, no spurious "not JSON-serializable"; no O(N²) duplication