refactor: split train and val dataset in response dataset #1649

yuki-97 · 2025-12-17T16:41:13Z

Related issue: #1050

Split train and val in build-in dataset, so that we could unblock multiple dataset support.
Unify the built-in datasets under nemo_rl/data/datasets/response_datasets/ into a similar format.
Remove duplicated dataset name: clevr_cogent and openmathinstruct2.

New Param
Add a new param split_validation_size to handle the case that one dataset is used for both training and validation. (e.g., OpenMathInstruct-2 in examples/configs/grpo_math_1B.yaml)

If data.train.split_validation_size > 0 and data.validation is None, will use part of the training dataset as validation dataset.
If data.train.split_validation_size > 0 and data.validation is not None, will use both "part of the training dataset" and "provided validation dataset" as validation dataset.

Usage

data:
  # other data settings, see `examples/configs/sft.yaml` for more details
  ...
  # dataset settings
  train:
    # this dataset will override input_key and use the default values for other vars
    data_path: /path/to/local/train_dataset.jsonl  # local file or hf_org/hf_dataset_name (HuggingFace)
    input_key: question
    split: train  # used for HuggingFace datasets
    split_validation_size: 0.05  # use 5% of the training data as validation data
    seed: 42  # seed for train/validation split when split_validation_size > 0
  validation:
    # this dataset will use the default values for other vars except data_path
    data_path: /path/to/local/val_dataset.jsonl
  default:
    # will use below vars as default values if dataset doesn't specify it
    dataset_name: ResponseDataset
    input_key: input
    output_key: output
    prompt_file: null
    system_prompt_file: null
    processor: "sft_processor"

Migrate Guide

For dataset that loads from local JSONL file or HuggingFace (openai_format and ResponseDataset)

# old
data:
  dataset_name: ResponseDataset
  train_data_path: <PathToTrainingDataset>
  val_data_path: <PathToValidationDataset>
  input_key: <QuestionKey>
  output_key: <AnswerKey>
  train_split: <TrainSplit>
  val_split: <ValSplit>

# new
data:
  # other data settings, see `examples/configs/sft.yaml` for more details
  ...
  # dataset settings
  train:
    # this dataset will override input_key and use the default values for other vars
    data_path: /path/to/local/train_dataset.jsonl  # local file or hf_org/hf_dataset_name (HuggingFace)
    input_key: question
    split: train  # used for HuggingFace datasets
    split_validation_size: 0.05  # use 5% of the training data as validation data
    seed: 42  # seed for train/validation split when split_validation_size > 0
  validation:
    # this dataset will use the default values for other vars except data_path
    data_path: /path/to/local/val_dataset.jsonl
  default:
    # will use below vars as default values if dataset doesn't specify it
    dataset_name: ResponseDataset
    input_key: input
    output_key: output
    prompt_file: null
    system_prompt_file: null
    processor: "sft_processor"

For some built-in datasets that needs change

DAPOMath17K

# old
data:
  dataset_name: DAPOMath17K

# new
data:
  train:
    dataset_name: DAPOMath17K
  validation:
    dataset_name: DAPOMathAIME2024

DeepScaler

# old
data:
  dataset_name: DeepScaler

# new
data:
  train:
    dataset_name: DeepScaler
  validation:
    dataset_name: AIME2024
    repeat: 16

clevr-cogent

# old
data:
  dataset_name: clevr-cogent
  split: trainA

# new
data:
    train:
      dataset_name: clevr-cogent
      split: train
    validation:
      dataset_name: clevr-cogent
      split: valA

HelpSteer3

# old
data:
  dataset_name: HelpSteer3
  split: preference
# new
  train:
    dataset_name: HelpSteer3
    split: train
  validation:
    dataset_name: HelpSteer3
    split: validation

For other built-in datasets, you only need to move them and set the correct split, e.g.

# old
data:
  dataset_name: "squad"

# new
data:
  train:
    dataset_name: "squad"
    split: "train"
  validation:
    dataset_name: "squad"
    split: "validation"

Test Result

algo	result
sft
sft-vlm
grpo
grpo-vlm
distillation

Summary by CodeRabbit

Release Notes

New Features
- Added support for separate training and validation dataset configuration with new train and validation blocks in data settings
- Introduced new datasets: AIME2024, DAPOMath variants with automatic validation split capability
- Enhanced dataset framework with improved flexibility for processor selection and environment configuration
Documentation
- Updated guides with new data configuration structure and examples for train/validation dataset setup
- Clarified supported dataset listings and configuration format for multi-dataset training scenarios
Bug Fixes & Improvements
- Improved dataset loading workflow with better support for shared datasets and per-task processing
- Streamlined configuration migration from flat to nested dataset structure across all example configs

_{✏️ Tip: You can customize this high-level summary in your review settings.}

terrykong

some initial thoughts

since it's a big PR @ashors1 could you help as a second review?

examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-fsdp2tp2.yaml

examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron.yaml

nemo_rl/data/__init__.py

examples/run_grpo.py

nemo_rl/data/datasets/response_datasets/__init__.py

nemo_rl/data/datasets/response_datasets/oasst.py

tests/unit/data/datasets/test_response_dataset.py

examples/run_grpo.py

nemo_rl/data/datasets/response_datasets/deepscaler.py

examples/run_grpo.py

nemo_rl/data/datasets/response_datasets/tulu3.py

nemo_rl/data/datasets/response_datasets/oasst.py

Signed-off-by: Yuki Huang <yukih@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com> update all run_xxx and recipe of response dataset to use default Signed-off-by: Yuki Huang <yukih@nvidia.com> fix missing default Signed-off-by: Yuki Huang <yukih@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 · 2026-01-20T16:26:34Z

~~Running nightly test now and need some minor fix, will push it later.~~
Done, all good.

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 added the CI:L0 Run doctests and unit tests label Dec 17, 2025

yuki-97 temporarily deployed to nemo-ci December 17, 2025 16:41 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci December 17, 2025 16:44 — with GitHub Actions Inactive

yuki-97 force-pushed the yukih/split-train-val-dataset branch from f8dcf7c to 2f78c84 Compare December 18, 2025 05:05

yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Dec 18, 2025

yuki-97 temporarily deployed to nemo-ci December 18, 2025 05:06 — with GitHub Actions Inactive

yuki-97 force-pushed the yukih/split-train-val-dataset branch from 2f78c84 to fd448be Compare December 18, 2025 05:23

github-actions bot added the documentation Improvements or additions to documentation label Dec 18, 2025

yuki-97 force-pushed the yukih/split-train-val-dataset branch from 2aa7ce0 to 6a093d1 Compare December 18, 2025 07:08

yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Dec 18, 2025

yuki-97 temporarily deployed to nemo-ci December 18, 2025 07:09 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci December 18, 2025 07:13 — with GitHub Actions Inactive

terrykong reviewed Dec 18, 2025

View reviewed changes

yuki-97 changed the title ~~feat: split train val dataset and refactor for response dataset~~ refactor: split train val dataset in response dataset Dec 18, 2025

yuki-97 changed the title ~~refactor: split train val dataset in response dataset~~ refactor: split train and val dataset in response dataset Dec 18, 2025

yuki-97 commented Dec 18, 2025

View reviewed changes

tests/unit/data/datasets/test_response_dataset.py Show resolved Hide resolved