Skip to content

feat(renderer-client): thread multimodal sidecar through rollout + transport#1346

Open
hallerite wants to merge 4 commits into
mainfrom
feat/renderer-multimodal-passthrough
Open

feat(renderer-client): thread multimodal sidecar through rollout + transport#1346
hallerite wants to merge 4 commits into
mainfrom
feat/renderer-multimodal-passthrough

Conversation

@hallerite
Copy link
Copy Markdown
Member

@hallerite hallerite commented May 11, 2026

Summary

Threads the renderer's MultiModalData sidecar (pixel_values, placeholder ranges, mm_hashes) from the renderer through /inference/v1/generate to vLLM and onto trajectory tokens — so the new renderer-only multimodal path in prime-rl can drop the orchestrator-side AutoProcessor and image cache entirely.

Companion PRs:

What changes

Rollout (renderer_client.py)

  • _get_incremental_prompt_ids now returns RenderedTokens | None (token_ids + multi_modal_data); text-only renderers' raw list[int] is normalized via as_rendered_tokens so callers unpack uniformly
  • _step_multi_modal_data(step) recovers the prior turn's mm_data from the trajectory step and forwards it to bridge_to_next_turn so the new turn's placeholder runs cover every earlier-turn image
  • RendererClient.create_completion unpacks the bridged result into (prompt_ids, multi_modal_data) and forwards both to generate
  • parse_response_tokens copies response.multi_modal_data onto ResponseTokens
  • Pool dispatch: is_multimodal(renderer) (cached bool) replaces the runtime-checkable isinstance walk on the hot path; pool methods are called directly (the pool implements the protocol structurally) and offloaded via asyncio.to_thread when the renderer is a pool

Types (types.py)

  • ResponseTokens.multi_modal_data: Any | None
  • TrajectoryStepTokens.multi_modal_data: NotRequired[Any]

Typed as Any to avoid a hard import dep on renderers.

Transport (utils/serve_utils.py)

  • Custom msgpack encoder gains torch tensor / numpy ndarray / dataclass support; tensors encode as {__torch_tensor__: True, dtype, shape, data} with raw bytes payload (torch imported lazily — text-only consumers don't pay for it)
  • decode_tensor_payload / walk_decode_tensors rehydrate on the receiving side

Save (utils/save_utils.py)

  • is_json_serializable keeps its honest JSON-only contract; state_to_output bypasses the gate for the trajectory column (msgpack-transported)
  • _strip_intermediate_mm_data drops tokens.multi_modal_data from all but the last trajectory step before transport — bridge merges prior turns' mm_data into each new turn, so naively shipping is O(N²) bytes per N-turn rollout

Test plan

  • Text-only RL: no mm_data → fast path unchanged
  • Multimodal RL: mm_data reaches vLLM via multi_modal_data and the trainer via TrajectoryStepTokens["multi_modal_data"]
  • Bridge: previous-turn images carried forward, placeholder count matches the combined token sequence
  • Transport: msgpack round-trip preserves tensor shapes / dtypes
  • JSONL save: trajectories with multi_modal_data accepted, no spurious "not JSON-serializable"; no O(N²) duplication

…ansport

Surfaces the renderer's MultiModalData sidecar (pixel_values, placeholder
ranges, mm_hashes) end-to-end so multimodal renderers can drive vLLM's
/inference/v1/generate `multi_modal_data` features field and the
downstream trainer's `mm_kwargs` without going through the legacy
chat-completions / MITO multimodal path.

renderer_client.py
- `_step_multi_modal_data(step)`: recover the prior turn's mm_data from
  the trajectory step (parsed-tokens or raw-message side).
- `_get_incremental_prompt_ids` now returns `RenderedTokens | None` and
  forwards `previous_multi_modal_data` to `bridge_to_next_turn` so the
  new turn's placeholder runs cover every earlier-turn image. Without
  this carry-forward, vLLM sees mismatched placeholder counts and falls
  back to hash-cache lookup or errors. Text-only renderers' raw
  `list[int]` returns are normalized via `as_rendered_tokens`.
- `RendererClient.create_completion` unpacks the bridged result into
  `(prompt_ids, multi_modal_data)` and forwards both to `generate`.
- `parse_response_tokens`: copies `response.multi_modal_data` onto the
  emitted `ResponseTokens` so downstream consumers can read it.

types.py
- `ResponseTokens.multi_modal_data: Any | None`
- `TrajectoryStepTokens.multi_modal_data: NotRequired[Any]`
Both typed as `Any` to avoid a hard import dependency on `renderers`.

utils/response_utils.py
- `parse_response_tokens` propagates `multi_modal_data` onto the
  `TrajectoryStepTokens` output when present.

utils/save_utils.py
- `is_json_serializable` accepts torch tensors / numpy arrays / renderer
  sidecar dataclasses — these aren't JSON-native but survive the
  prime-rl msgpack encoder, and trajectories carrying them are excluded
  from the JSONL save at the orchestrator boundary (orchestrator passes
  `exclude_keys={"trajectory"}` to `save_rollouts`).
- `_strip_intermediate_mm_data(trajectory)`: drop `tokens.multi_modal_data`
  from all but the last step before transport. `bridge_to_next_turn`
  merges prior turns' mm_data into the new turn, so naively shipping
  mm_data on every step duplicates every image O(N²) bytes for an N-turn
  rollout; only the last step's sidecar is read by the trainer.

utils/serve_utils.py
- Custom msgpack encoder gains torch tensor / numpy ndarray /
  dataclass support. Tensors are encoded as
  `{__torch_tensor__: True, dtype, shape, data}` with raw bytes payload.
  Torch is imported lazily so text-only consumers don't pay for it.
- `decode_tensor_payload` / `walk_decode_tensors` rehydrate tensor
  payloads on the receiving side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread verifiers/clients/renderer_client.py
Comment thread verifiers/utils/save_utils.py Outdated
Adopts the renderers split-Protocol surface (MultimodalRenderer +
uniform ``RenderedTokens | None`` bridge return) and resolves CI:

- ``renderer_client.py``: drop ``as_rendered_tokens`` (no longer needed
  since bridge returns ``RenderedTokens | None`` uniformly).
  ``_get_incremental_prompt_ids`` dispatches on
  ``isinstance(renderer, MultimodalRenderer)`` — multimodal path passes
  ``previous_multi_modal_data`` so prior-turn images carry forward into
  ``mm_placeholders``; text-only path uses the base ``Renderer.bridge``
  signature unchanged. The previous always-pass design relied on every
  text-only renderer accepting and ignoring the kwarg, which spread the
  multimodal contract across the whole renderer registry.
- ``save_utils.py``: cast iteration variable to ``Mapping[str, Any]``
  in ``_strip_intermediate_mm_data`` so ty narrows ``step.get("tokens")``
  correctly (previously ty inferred ``_KT`` as ``Never`` after the
  non-Mapping branch was excluded).
- ``serve_utils.py``: replace ``import torch`` with
  ``importlib.import_module("torch")`` inside ``decode_tensor_payload``
  — torch is a soft runtime dep here (callers that pass
  ``to_torch=True`` are expected to have it installed). Static type
  checkers in downstream consumers without torch installed don't fail
  on unresolved-import anymore.
- ``pyproject.toml`` + ``uv.lock``: pin renderers to feat/multimodal-vlm
  branch until that PR lands and a new PyPI release is published. The
  branch provides ``MultimodalRenderer``, ``MultiModalData``,
  ``PlaceholderRange``, ``RenderedTokens``, and the
  ``previous_multi_modal_data`` kwarg this branch consumes.
- ``tests/test_renderer_client.py``: ``_BridgeRenderer`` stub now
  returns ``RenderedTokens`` (matches the new Protocol). Update list
  equality / slice assertions to use ``result.token_ids`` since the
  uniform bridge return shape is ``RenderedTokens | None``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread verifiers/clients/renderer_client.py
Comment thread verifiers/utils/save_utils.py
hallerite and others added 2 commits May 12, 2026 00:02
…irection

renderer_client (bugbot #3, high): the previous isinstance(renderer,
MultimodalRenderer) check was performed against the outer renderer
parameter, which in production is a RendererPool. RendererPool was not a
Renderer subclass, so the multimodal branch never fired and the PR's mm
carry-forward was silently broken under pooled use. Renderers PR now has
RendererPool implement the Renderer protocol structurally, and this side
dispatches via the cached is_multimodal(r) helper which works on either
a bare renderer or a pool.

save_utils (bugbot #2, medium): is_json_serializable previously
whitelisted torch tensors and renderer dataclasses, but make_serializable
has no handler for them — it would stringify to "tensor(...)" garbage if
anything actually hit JSON. The whitelist worked only because the
orchestrator excludes "trajectory" at the JSONL boundary. Restore the
honest JSON-only contract and bypass the gate explicitly in
state_to_output for col == "trajectory" (where msgpack handles tensors
via its custom encoder).

save_utils (bugbot #4, low): _strip_intermediate_mm_data was stripping
step["tokens"]["multi_modal_data"] but not the duplicate at
step["response"].message.tokens.multi_modal_data. The Pydantic Response
serialization preserves it through msgpack via model_dump(), so the
O(N²) bloat the function targets was only halved. Now strips both.

Also drop the pool-vs-bare-renderer branching ladder via _maybe_offload
(asyncio.to_thread iff pool); pool's checkout is now an implementation
detail of the pool itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 397b8aa. Configure here.

Comment thread verifiers/clients/renderer_client.py
@hallerite hallerite self-assigned this May 12, 2026
@hallerite hallerite requested review from eligotts and willccbb May 12, 2026 01:09
Comment thread pyproject.toml
# MultiModalData, PlaceholderRange, as_rendered_tokens, and the
# `previous_multi_modal_data` kwarg on Renderer.bridge_to_next_turn
# that this branch consumes. Drop after merge + release.
renderers = { git = "https://github.com/PrimeIntellect-ai/renderers.git", branch = "feat/multimodal-vlm" }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged the renderers PR, can we release renderers + replace this? Should never have a git source pinned in verifiers, PyPI won't respect it + we always want vf to be releasable

# earlier steps before transport. Bridge accesses mm_data only
# within the env-worker rollout loop, which has already
# finished by the time state_to_output runs.
value = _strip_intermediate_mm_data(value)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assumes pure extension right? how do we handle branching/compaction rollouts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants