Skip to content

Mpasha17/RewardChain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stepwise DPO Implementation

This project implements stepwise Direct Preference Optimization (DPO) for math problem solving using the PRM800K dataset.

Installation

git clone <your-repo-url>
cd RewardChain
pip install -r requirements.txt
export PYTHONPATH="$(pwd)/src:${PYTHONPATH:-}"

Quickstart

1. Data Preprocessing

python src/scripts/process_data.py --split train --output_dir data/processed/prm800k --max_samples 1000
python src/scripts/process_data.py --split test --output_dir data/processed/prm800k --max_samples 100

2. Training

python src/scripts/train.py \
    --train_data data/processed/prm800k/train.jsonl \
    --output_dir ./dpo_model \
    --model_name microsoft/DialoGPT-medium

3. Evaluation

python src/scripts/evaluate.py \
    --model_path ./dpo_model \
    --test_data data/processed/prm800k/test.jsonl \
    --output_path ./evaluation_results.json

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages