Skip to content

Add GPU placement validation before starting rollout engines#1934

Open
fmh66 wants to merge 1 commit into
THUDM:mainfrom
fmh66:fix-rollout-gpu-placement-validation
Open

Add GPU placement validation before starting rollout engines#1934
fmh66 wants to merge 1 commit into
THUDM:mainfrom
fmh66:fix-rollout-gpu-placement-validation

Conversation

@fmh66
Copy link
Copy Markdown

@fmh66 fmh66 commented May 21, 2026

Summary

Fixes #1896.

This PR adds an explicit GPU placement boundary check in ServerGroup.start_engines() before creating SGLang Ray actors.

Previously, when the placement group GPU slots and SGLang engine configuration were inconsistent, start_engines() could fail with an internal:

IndexError: list index out of range

at reordered_gpu_ids[gpu_index].

With this change, invalid configurations fail earlier with a ValueError that includes the relevant rollout configuration context.

Changes

  • Add a lightweight validation helper for rollout server group GPU indices.
  • Validate gpu_offset + num_engines * num_gpu_per_engine <= len(reordered_gpu_ids) before indexing placement group GPU IDs.
  • Include actionable fields in the error message:
    • worker_type
    • gpu_offset
    • num_gpus_per_engine
    • num_engines
    • required_gpu_slots
    • len(reordered_gpu_ids)
    • rollout_num_gpus
    • rollout_num_gpus_per_engine
  • Add unit coverage for valid configs, empty groups, and invalid configs with clear error context.

Testing

  • python -m pytest tests/test_rollout_validation.py
  • Manual validation of the helper behavior for both valid and invalid configurations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] test_qwen2.5_0.5B_ppo_critic_only_short.py fails with IndexError at start_engines after #1866 (multi-role megatron config + stale sglang defaults)

1 participant