Support RL VLM of rl_design branch by hhaAndroid · Pull Request #1598 · InternLM/xtuner

hhaAndroid · 2026-03-19T05:05:55Z

No description provided.

hhaAndroid · 2026-03-19T05:45:45Z

@claude review

HAOCHENYE · 2026-03-19T06:12:11Z

@claude review

xtuner/v1/rl/trainer/worker.py

xtuner/v1/datasets/collator.py

xtuner/v1/datasets/rl_tokenize_fn/qwen3_vl_tokenize_fn.py

xtuner/v1/train/rl_colocate_trainer.py

xtuner/v1/datasets/rl_tokenize_fn/qwen3_vl_tokenize_fn.py

xtuner/v1/data_proto/rl_data.py

claude · 2026-03-19T06:15:46Z

Claude: ## Review Summary

This PR adds vision-language (VL) model support to the RL training pipeline, specifically for Qwen3-VL. It introduces a new tokenize function, moves position_ids to RolloutState, changes pixel_values storage from torch.Tensor to np.ndarray (for Ray shared memory), and updates the training data preparation to handle multimodal inputs.

Issues

Critical

worker.py:539-549: When pixel_values is an np.ndarray (the new default type), the conversion branch is skipped and the model receives a numpy array instead of a torch.Tensor. Missing else branch to convert ndarray to tensor.

Warning

collator.py:21-27: fake_collator signature changed from list[DataItem] to list[RolloutState] (breaking change), and unconditional import ray added inside the function.
rl_colocate_trainer.py:527-531: VLM detection via "train_prompt_ids" in extra_fields is fragile — consider making it an explicit RolloutState field.

Nit

Misleading function name remove_consecutive_twos — should reflect its actual purpose (removing consecutive image context tokens).
New public classes in qwen3_vl_tokenize_fn.py missing type hints and docstrings per CLAUDE.md standards.
Import ordering in rl_data.py — numpy import not properly separated.
Trailing whitespace on several lines.

Verdict

REQUEST_CHANGES

hhaAndroid · 2026-03-19T13:31:39Z

@claude review

claude · 2026-03-19T13:38:55Z

xtuner/v1/train/rl_colocate_trainer.py

 def get_train_seq_ctx(
-    input_ids: torch.LongTensor, multimodal_train_info: dict | None = None, len_response_ids: int = 0
+    input_ids: torch.LongTensor,
+    position_ids: torch.Tensor,


Claude: Warning: Type hint says torch.Tensor but position_ids can be None — the body immediately checks if position_ids is not None. Should be torch.Tensor | None to match the actual contract.

Suggested change

position_ids: torch.Tensor,

position_ids: torch.Tensor | None,

claude · 2026-03-19T13:38:59Z

xtuner/v1/data_proto/rl_data.py

+    @field_serializer("mm_info")
+    def _serialize_mm_info(self, value: MultimodalInfo | None) -> MultimodalInfo | None:
+        return None


Claude: Warning: This serializer unconditionally drops mm_info during serialization. This means any model_dump() / model_dump_json() call silently loses multimodal data. The return type annotation also says MultimodalInfo | None but only None is ever returned.

Please add a brief docstring explaining why this is intentional (e.g., mm_info contains numpy arrays / Ray ObjectRefs that can't be serialized), similar to the _serialize_routed_experts docstring above. This prevents future contributors from seeing the always-None return and thinking it's a bug.

claude · 2026-03-19T13:39:01Z

xtuner/v1/datasets/rl_tokenize_fn/qwen3_vl_tokenize_fn.py

+from ..mllm_tokenize_fn.qwen3_vl_tokenize_fn import Qwen3VLTokenizeFnConfig, Qwen3VLTokenizeFunction, QwenVL3DataItem
+from ..utils import replace_image_context_and_collect_media_data
+
+


Claude: Warning: Missing type hints per project standards (CLAUDE.md: "All new code must include type hints for function signatures"). tokens and img_context_id have no type annotations, and the return type is missing.

Suggested change

def remove_consecutive_img_context_tokens(tokens: list[int], img_context_id: int) -> list[int]:

claude · 2026-03-19T13:39:02Z

xtuner/v1/rl/utils/ray_utils.py

+
+def fake_collator(instances: list[RolloutState], **kwargs):
+    for rollout_state in instances:
+        if hasattr(rollout_state, "mm_info") and rollout_state.mm_info is not None:


Claude: Nit: RolloutState is a Pydantic BaseModel, so mm_info is always present as an attribute (with default None). The hasattr check is redundant — you can simplify to just if rollout_state.mm_info is not None.

Also, this function is missing type hints and a return type annotation per CLAUDE.md standards.

claude · 2026-03-19T13:39:10Z

xtuner/v1/rl/trainer/worker.py

+                if not isinstance(pixel_values, np.ndarray):
                    assert isinstance(pixel_values, list), (
                        f"pixel_values should be list of tensor, got {type(pixel_values)}"
                    )
                    pixel_values = [ray.get(pixel_obf) for pixel_obf in pixel_values]
+                    pixel_values = [torch.as_tensor(pixel_value) for pixel_value in pixel_values]
                    pixel_values = torch.cat(pixel_values, dim=0)
                    seq_ctx.pixel_values = pixel_values
+                else:
+                    raise NotImplementedError("The case where pixel_values is a numpy array is not implemented yet.")


Claude: Warning: The logic here only handles two cases: list (of Ray ObjectRefs) and np.ndarray (raises NotImplementedError). But a single Ray ObjectRef (not wrapped in a list) would fail the isinstance(pixel_values, list) assertion.

Tracing the flow: fake_collator stores each RolloutState's pixel_values as a single ray.put() ObjectRef. Unless there's an intermediate packing step that collects these into a list, seq_ctx.pixel_values arriving here as a single ObjectRef would hit the assertion error. Could you verify that the packing path always wraps ObjectRefs into a list before they reach this point?

Also, the NotImplementedError for the np.ndarray case seems like it should be straightforward to implement (just torch.as_tensor(pixel_values)) — is this intentionally deferred?

claude · 2026-03-19T13:39:37Z

Claude: ## Review Summary

This PR adds VLM (Qwen3-VL) support to the RL training pipeline. Changes since the last review address several previously flagged issues (function naming, import ordering, fake_collator moved to ray_utils). The overall architecture — storing pixel_values as np.ndarray for Ray shared memory, moving position_ids to RolloutState, and adding a dedicated RL tokenize function — is sound.

Remaining Issues

Warning

[rl_colocate_trainer.py:68] position_ids typed as torch.Tensor but can be None — should be torch.Tensor | None.
[rl_data.py:123-125] _serialize_mm_info unconditionally returns None with no docstring explaining why. Needs documentation like the sibling _serialize_routed_experts.
[worker.py:542-551] pixel_values handling: single Ray ObjectRef (from fake_collator) may not be wrapped in a list before reaching this code — please verify the packing path.
[qwen3_vl_tokenize_fn.py:9] remove_consecutive_img_context_tokens missing type hints (CLAUDE.md requirement).

Nit

[ray_utils.py:191] Redundant hasattr check on Pydantic model — mm_info is always an attribute.

Verdict

REQUEST_CHANGES — mostly type hint and documentation gaps, plus one data flow question on the pixel_values path in worker.py.

support vl

941e271