Skip to content

Support RL VLM of rl_design branch#1598

Open
hhaAndroid wants to merge 3 commits intoInternLM:rl_designfrom
hhaAndroid:new_rl_vl
Open

Support RL VLM of rl_design branch#1598
hhaAndroid wants to merge 3 commits intoInternLM:rl_designfrom
hhaAndroid:new_rl_vl

Conversation

@hhaAndroid
Copy link
Collaborator

No description provided.

@hhaAndroid
Copy link
Collaborator Author

@claude review

1 similar comment
@HAOCHENYE
Copy link
Collaborator

@claude review

@claude
Copy link

claude bot commented Mar 19, 2026

Claude: ## Review Summary

This PR adds vision-language (VL) model support to the RL training pipeline, specifically for Qwen3-VL. It introduces a new tokenize function, moves position_ids to RolloutState, changes pixel_values storage from torch.Tensor to np.ndarray (for Ray shared memory), and updates the training data preparation to handle multimodal inputs.

Issues

Critical

  • worker.py:539-549: When pixel_values is an np.ndarray (the new default type), the conversion branch is skipped and the model receives a numpy array instead of a torch.Tensor. Missing else branch to convert ndarray to tensor.

Warning

  • collator.py:21-27: fake_collator signature changed from list[DataItem] to list[RolloutState] (breaking change), and unconditional import ray added inside the function.
  • rl_colocate_trainer.py:527-531: VLM detection via "train_prompt_ids" in extra_fields is fragile — consider making it an explicit RolloutState field.

Nit

  • Misleading function name remove_consecutive_twos — should reflect its actual purpose (removing consecutive image context tokens).
  • New public classes in qwen3_vl_tokenize_fn.py missing type hints and docstrings per CLAUDE.md standards.
  • Import ordering in rl_data.pynumpy import not properly separated.
  • Trailing whitespace on several lines.

Verdict

REQUEST_CHANGES

@hhaAndroid hhaAndroid changed the title support vl Support RL VLM of rl_design branch Mar 19, 2026
@hhaAndroid
Copy link
Collaborator Author

@claude review

def get_train_seq_ctx(
input_ids: torch.LongTensor, multimodal_train_info: dict | None = None, len_response_ids: int = 0
input_ids: torch.LongTensor,
position_ids: torch.Tensor,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Warning: Type hint says torch.Tensor but position_ids can be None — the body immediately checks if position_ids is not None. Should be torch.Tensor | None to match the actual contract.

Suggested change
position_ids: torch.Tensor,
position_ids: torch.Tensor | None,

Comment on lines +123 to +125
@field_serializer("mm_info")
def _serialize_mm_info(self, value: MultimodalInfo | None) -> MultimodalInfo | None:
return None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Warning: This serializer unconditionally drops mm_info during serialization. This means any model_dump() / model_dump_json() call silently loses multimodal data. The return type annotation also says MultimodalInfo | None but only None is ever returned.

Please add a brief docstring explaining why this is intentional (e.g., mm_info contains numpy arrays / Ray ObjectRefs that can't be serialized), similar to the _serialize_routed_experts docstring above. This prevents future contributors from seeing the always-None return and thinking it's a bug.

from ..mllm_tokenize_fn.qwen3_vl_tokenize_fn import Qwen3VLTokenizeFnConfig, Qwen3VLTokenizeFunction, QwenVL3DataItem
from ..utils import replace_image_context_and_collect_media_data


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Warning: Missing type hints per project standards (CLAUDE.md: "All new code must include type hints for function signatures"). tokens and img_context_id have no type annotations, and the return type is missing.

Suggested change
def remove_consecutive_img_context_tokens(tokens: list[int], img_context_id: int) -> list[int]:


def fake_collator(instances: list[RolloutState], **kwargs):
for rollout_state in instances:
if hasattr(rollout_state, "mm_info") and rollout_state.mm_info is not None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Nit: RolloutState is a Pydantic BaseModel, so mm_info is always present as an attribute (with default None). The hasattr check is redundant — you can simplify to just if rollout_state.mm_info is not None.

Also, this function is missing type hints and a return type annotation per CLAUDE.md standards.

Comment on lines +542 to +551
if not isinstance(pixel_values, np.ndarray):
assert isinstance(pixel_values, list), (
f"pixel_values should be list of tensor, got {type(pixel_values)}"
)
pixel_values = [ray.get(pixel_obf) for pixel_obf in pixel_values]
pixel_values = [torch.as_tensor(pixel_value) for pixel_value in pixel_values]
pixel_values = torch.cat(pixel_values, dim=0)
seq_ctx.pixel_values = pixel_values
else:
raise NotImplementedError("The case where pixel_values is a numpy array is not implemented yet.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Warning: The logic here only handles two cases: list (of Ray ObjectRefs) and np.ndarray (raises NotImplementedError). But a single Ray ObjectRef (not wrapped in a list) would fail the isinstance(pixel_values, list) assertion.

Tracing the flow: fake_collator stores each RolloutState's pixel_values as a single ray.put() ObjectRef. Unless there's an intermediate packing step that collects these into a list, seq_ctx.pixel_values arriving here as a single ObjectRef would hit the assertion error. Could you verify that the packing path always wraps ObjectRefs into a list before they reach this point?

Also, the NotImplementedError for the np.ndarray case seems like it should be straightforward to implement (just torch.as_tensor(pixel_values)) — is this intentionally deferred?

@claude
Copy link

claude bot commented Mar 19, 2026

Claude: ## Review Summary

This PR adds VLM (Qwen3-VL) support to the RL training pipeline. Changes since the last review address several previously flagged issues (function naming, import ordering, fake_collator moved to ray_utils). The overall architecture — storing pixel_values as np.ndarray for Ray shared memory, moving position_ids to RolloutState, and adding a dedicated RL tokenize function — is sound.

Remaining Issues

Warning

  • [rl_colocate_trainer.py:68] position_ids typed as torch.Tensor but can be None — should be torch.Tensor | None.
  • [rl_data.py:123-125] _serialize_mm_info unconditionally returns None with no docstring explaining why. Needs documentation like the sibling _serialize_routed_experts.
  • [worker.py:542-551] pixel_values handling: single Ray ObjectRef (from fake_collator) may not be wrapped in a list before reaching this code — please verify the packing path.
  • [qwen3_vl_tokenize_fn.py:9] remove_consecutive_img_context_tokens missing type hints (CLAUDE.md requirement).

Nit

  • [ray_utils.py:191] Redundant hasattr check on Pydantic model — mm_info is always an attribute.

Verdict

REQUEST_CHANGES — mostly type hint and documentation gaps, plus one data flow question on the pixel_values path in worker.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants