Skip to content

feat(eval): support NeMo-Gym multi-turn rollouts#2453

Open
taivu1998 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-1089-nemo-gym-eval
Open

feat(eval): support NeMo-Gym multi-turn rollouts#2453
taivu1998 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-1089-nemo-gym-eval

Conversation

@taivu1998
Copy link
Copy Markdown

Problem

Standalone eval currently supports only the single-turn environment step path. NeMo-Gym already owns the multi-turn rollout loop used by training, but examples/run_eval.py cannot route eval datasets through that Gym-backed rollout path.

Closes #1089.

Root Cause

The eval driver always loads eval datasets, creates a single scoring environment, generates one assistant response, and calls env.step(...). NeMo-Gym data and environments require a different path:

  • NemoGymDataset and nemo_gym_data_processor preserve Gym row metadata in extra_env_info
  • rl_collate_fn preserves the metadata and training-style fields expected by the Gym rollout helper
  • vLLM must expose OpenAI-compatible HTTP server URLs for Gym to call the policy
  • run_async_nemo_gym_rollout owns the multi-turn rollout, reward extraction, and result postprocessing

Changes

  • Add eval.rollout_mode with single_turn and nemo_gym modes.
  • Add eval config validation for Gym requirements:
    • max_rollout_turns: null
    • num_tests_per_prompt: 1
    • async vLLM engine with HTTP server exposure
    • top_k: null
    • no stop strings or stop token IDs
  • Route rollout_mode=nemo_gym through NemoGymDataset, rl_collate_fn, NemoGym, and run_async_nemo_gym_rollout.
  • Add mean_reward scoring for Gym eval while keeping pass@k available when Gym rewards are binary.
  • Preserve existing single-turn eval behavior and explicitly reject ignored multi-turn limits in single-turn mode.
  • Save structured JSON eval outputs instead of stringifying message logs and env metadata.
  • Add examples/configs/evals/nemo_gym_eval.yaml and update existing eval exemplars with the new required config keys.
  • Extend eval unit tests for rollout-mode validation, collator selection, mean-reward scoring, and Gym result saving.

Validation

  • tests/unit/evals/test_eval.py: 16 passed
  • Focused eval YAML schema validation over all examples/configs/evals/*.yaml: passed
  • uvx ruff check examples/run_eval.py nemo_rl/evals/eval.py nemo_rl/environments/nemo_gym.py nemo_rl/experience/rollouts.py nemo_rl/data/__init__.py tests/unit/evals/test_eval.py: passed
  • uvx ruff format --check examples/run_eval.py nemo_rl/evals/eval.py nemo_rl/environments/nemo_gym.py nemo_rl/experience/rollouts.py nemo_rl/data/__init__.py tests/unit/evals/test_eval.py: passed
  • python -m py_compile on changed Python files: passed
  • git diff --check: passed

Note: the local repo-native uv run path is blocked on this macOS host because /usr/local/bin/python3.13 reports an empty platform.mac_ver() to uv. The focused pytest run was executed in a temporary Python 3.13 uv environment with the repo import dependencies and a temporary decord import stub outside the repository, because decord has no usable macOS arm64 CPython 3.13 wheel here and this test file does not exercise video decoding.

Signed-off-by: taivu1998 <46636857+taivu1998@users.noreply.github.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@taivu1998 taivu1998 marked this pull request as ready for review May 11, 2026 03:07
@taivu1998 taivu1998 requested review from a team as code owners May 11, 2026 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support multi-turn rollout in eval

2 participants