Skip to content

Conversation

@yuki-97
Copy link
Contributor

@yuki-97 yuki-97 commented Dec 17, 2025

Related issue: #1050

  1. Split train and val in build-in dataset, so that we could unblock multiple dataset support.
  2. Unify the built-in datasets under nemo_rl/data/datasets/response_datasets/ into a similar format.
  3. Remove duplicated dataset name: clevr_cogent and openmathinstruct2.

New Param
Add a new param split_validation_size to handle the case that one dataset is used for both training and validation. (e.g., OpenMathInstruct-2 in examples/configs/grpo_math_1B.yaml)

  1. If data.train.split_validation_size > 0 and data.validation is None, will use part of the training dataset as validation dataset.
  2. If data.train.split_validation_size > 0 and data.validation is not None, will use both "part of the training dataset" and "provided validation dataset" as validation dataset.

Usage

data:
  # other data settings, see `examples/configs/sft.yaml` for more details
  ...
  # dataset settings
  train:
    # this dataset will override input_key and use the default values for other vars
    data_path: /path/to/local/train_dataset.jsonl  # local file or hf_org/hf_dataset_name (HuggingFace)
    input_key: question
    split: train  # used for HuggingFace datasets
    split_validation_size: 0.05  # use 5% of the training data as validation data
    seed: 42  # seed for train/validation split when split_validation_size > 0
  validation:
    # this dataset will use the default values for other vars except data_path
    data_path: /path/to/local/val_dataset.jsonl
  default:
    # will use below vars as default values if dataset doesn't specify it
    dataset_name: ResponseDataset
    input_key: input
    output_key: output
    prompt_file: null
    system_prompt_file: null
    processor: "sft_processor"

Migrate Guide

  1. For dataset that loads from local JSONL file or HuggingFace (openai_format and ResponseDataset)
    # old
    data:
      dataset_name: ResponseDataset
      train_data_path: <PathToTrainingDataset>
      val_data_path: <PathToValidationDataset>
      input_key: <QuestionKey>
      output_key: <AnswerKey>
      train_split: <TrainSplit>
      val_split: <ValSplit>
    
    # new
    data:
      # other data settings, see `examples/configs/sft.yaml` for more details
      ...
      # dataset settings
      train:
        # this dataset will override input_key and use the default values for other vars
        data_path: /path/to/local/train_dataset.jsonl  # local file or hf_org/hf_dataset_name (HuggingFace)
        input_key: question
        split: train  # used for HuggingFace datasets
        split_validation_size: 0.05  # use 5% of the training data as validation data
        seed: 42  # seed for train/validation split when split_validation_size > 0
      validation:
        # this dataset will use the default values for other vars except data_path
        data_path: /path/to/local/val_dataset.jsonl
      default:
        # will use below vars as default values if dataset doesn't specify it
        dataset_name: ResponseDataset
        input_key: input
        output_key: output
        prompt_file: null
        system_prompt_file: null
        processor: "sft_processor"
  2. For some built-in datasets that needs change
    1. DAPOMath17K
      # old
      data:
        dataset_name: DAPOMath17K
      
      # new
      data:
        train:
          dataset_name: DAPOMath17K
        validation:
          dataset_name: DAPOMathAIME2024
    2. DeepScaler
      # old
      data:
        dataset_name: DeepScaler
      
      # new
      data:
        train:
          dataset_name: DeepScaler
        validation:
          dataset_name: AIME2024
          repeat: 16
    3. clevr-cogent
      # old
      data:
        dataset_name: clevr-cogent
        split: trainA
      
      # new
      data:
          train:
            dataset_name: clevr-cogent
            split: train
          validation:
            dataset_name: clevr-cogent
            split: valA
    4. HelpSteer3
      # old
      data:
        dataset_name: HelpSteer3
        split: preference
      # new
        train:
          dataset_name: HelpSteer3
          split: train
        validation:
          dataset_name: HelpSteer3
          split: validation
  3. For other built-in datasets, you only need to move them and set the correct split, e.g.
    # old
    data:
      dataset_name: "squad"
    
    # new
    data:
      train:
        dataset_name: "squad"
        split: "train"
      validation:
        dataset_name: "squad"
        split: "validation"

Test Result

algo result
sft image
sft-vlm image
grpo image
grpo-vlm image
distillation image

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for separate training and validation dataset configuration with new train and validation blocks in data settings
    • Introduced new datasets: AIME2024, DAPOMath variants with automatic validation split capability
    • Enhanced dataset framework with improved flexibility for processor selection and environment configuration
  • Documentation

    • Updated guides with new data configuration structure and examples for train/validation dataset setup
    • Clarified supported dataset listings and configuration format for multi-dataset training scenarios
  • Bug Fixes & Improvements

    • Improved dataset loading workflow with better support for shared datasets and per-task processing
    • Streamlined configuration migration from flat to nested dataset structure across all example configs

✏️ Tip: You can customize this high-level summary in your review settings.

@yuki-97 yuki-97 added the CI:L0 Run doctests and unit tests label Dec 17, 2025
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch from f8dcf7c to 2f78c84 Compare December 18, 2025 05:05
@yuki-97 yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Dec 18, 2025
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch from 2f78c84 to fd448be Compare December 18, 2025 05:23
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 18, 2025
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch from 2aa7ce0 to 6a093d1 Compare December 18, 2025 07:08
@yuki-97 yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Dec 18, 2025
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial thoughts

since it's a big PR @ashors1 could you help as a second review?

@yuki-97 yuki-97 changed the title feat: split train val dataset and refactor for response dataset refactor: split train val dataset in response dataset Dec 18, 2025
@yuki-97 yuki-97 changed the title refactor: split train val dataset in response dataset refactor: split train and val dataset in response dataset Dec 18, 2025
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch 2 times, most recently from 6b34af3 to fea258d Compare December 19, 2025 15:50
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L0 Run doctests and unit tests labels Dec 19, 2025
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>

update all run_xxx and recipe of response dataset to use default

Signed-off-by: Yuki Huang <yukih@nvidia.com>

fix missing default

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch from f9def0d to ec862a3 Compare January 20, 2026 10:40
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97
Copy link
Contributor Author

yuki-97 commented Jan 20, 2026

Running nightly test now and need some minor fix, will push it later.
Done, all good.

Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Jan 21, 2026
@terrykong terrykong enabled auto-merge (squash) January 21, 2026 23:17
@terrykong terrykong merged commit 967d2c8 into main Jan 22, 2026
54 of 58 checks passed
@terrykong terrykong deleted the yukih/split-train-val-dataset branch January 22, 2026 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants