This repository contains the source code for the experiments in paper "Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning" by Yixuan Even Xu*, Yash Savani*, Fei Fang, Zico Kolter. We implemented GRPO-PODS (Policy Optimization with Down-Sampling) and compared its performance with vanilla GRPO. Our implementation is based on Unsloth and OpenR1.
-
To install relevant dependencies, install
uvand enteruv sync
-
To re-run the single-GPU experiments, edit
config/train.yaml, and entermkdir -p checkpoints uv run python3 train.py
-
To evaluate the saved checkpoints of a single-GPU experiment run, edit
config/test.yaml, and enteruv run python3 evaluate-run.py
-
To run the multi-GPU experiments,
cdinto theopen-r1directory, follow the install instructions in theREADME.mdfile within the directory, and then run the following script.bash exp.sh
The data can be collected and downloaded from the corresponding wandb runs and plotted using the plotting scripts.
-
To generate the plots in the paper, enter
uv run python3 scripts/plot.py uv run python3 scripts/plot-h100s.py
This repository's source code is available under the Apache-2.0 License.
If you use this code in your research, please cite our paper:
@article{xu2025not,
title={Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning},
author={Xu, Yixuan Even and Savani, Yash and Fang, Fei and Kolter, Zico},
journal={arXiv preprint arXiv:2504.13818},
year={2025}
}For any questions or issues, please contact us via email:
- Yixuan Even Xu: yixuanx@cs.cmu.edu
- Yash Savani: ysavani@cs.cmu.edu