- Team Name: GRPO
- Members:
- Timothy Wang (tjw2145)
- Manush Kalwari (mmk2266)
- Troy Daniels (td2847)
We fine-tune and run inference on two instruction-tuned LLMs DeepSeek-Coder-7B-Instruct-v1.5 (6.91B parameters) and Phi-3.5-Mini-Instruct (3.82B parameters) for math reasoning under hardware constraints using a single NVIDIA T4 GPU. The task involves solving two math benchmarks:
- Countdown: easier, requires generating equations under constraints.
- AIME 2024: harder, advanced competition-style math problems.
We apply Group Relative Policy Optimization (GRPO), Supervised Fine-Tuning (SFT), and quantization to ensure feasibility and performance.
- Models Used: DeepSeek-Coder-7B-Instruct-v1.5, Phi-3.5-Mini-Instruct
- Framework: PyTorch
- Custom Methods:
- Applied GRPO for fine-tuning on task-specific JSONL datasets.
- Used SFT prior to GRPO for improved reward shaping.
- Employed 4-bit and 8-bit quantization via bitsandbytes.
- Integrated LoRA adapters uploaded to HuggingFace for modular fine-tuning.
The following results are from the Countdown dataset. The highest score of 20% was obtained by non-quantized Phi model, trained on 100 samples with max-token length of 1024 trained for 3 epochs.
On the AIME 2024 dataset, which is significantly more challenging, our lightweight models—Phi-3.5 Mini and quantized DeepSeek-R1—could not keep up, achieving near-zero scores (e.g., ~0.05 accuracy). highlighting the challenge of solving advanced competition problems even with fine-tuning. However, the outputs produced by our adapters did reflect the models’ capacities to learn reasoning, despite their poor performance.
| Metric | Value |
|---|---|
| Final Score | 20 % |
| Inference Latency | ~42s |
| Model Size | Phi- 3.82B (non-quantized) |
| Peak Memory Use | Fits within 16 GB T4 VRAM |
| Training Time/Epoch | 3 epochs |
| Device | NVIDIA T4 GPU |
Table: Experiments with Phi-3.5 Mini model on Countdown task

pip install -r requirements.txtView training and evaluation metrics here:
https://wandb.ai/mtt-hpml/projectsGRPO only (baseline):
python3 grpo_training.py --model phi --quantize --task aime \
--train_dataset_path ../tasks/aime/aime_train_dataset.jsonl \
--eval_dataset_path ../tasks/aime/aime_eval_dataset.jsonlpython3 sft_training.py --model phi --task countdown \
--train_dataset_path ../tasks/countdown/countdown_train_dataset_distilled.jsonlNote that we use the distilled dataset for SFT training. The distilled dataset contains the original prompts of the training dataset as well as responses from GPT-o1-mini to transfer knowledge from GPT to our smaller models.
python3 upload_adapter.py --adapter_path ./outputs/... \
--repo_id YOUR_HF_REPO_ID
python3 grpo_training.py --model deepseek --adapter YOUR_HF_REPO_ID ...python eval.py --weights checkpoints/best_model.pthfor inference:
python3 query_model.py \
--model deepseek-ai/deepseek-coder-7b-instruct-v1.5 \
--adapter_repo TDani/deepseek_aime_n100_mcl_256_quantized \
--prompt_file ../tasks/aime/aime_eval_dataset.jsonl \
--output_file deepseek_outputs/output.jsonl-
There are two ways to enable profiling
- In the main branch, uncomment to line of code. Specifically
- Switch to TroysBranch, and run from there, this is where profiling mostly happended to avoid overhead
- on_trace_ready found on line 203 and callbacks=[ProfilerCallback(profiler)] on line 213 (not recommeded)
- In the main branch, uncomment to line of code. Specifically
-
Profiling is currently only capturing one step because of the large memory pressure to profile (writes 2-3 GB per step)
- This can be changed by updating the schedule
-
Run the GRPO script normally and profiling traces will be saved to ./logs/profiler directory
-
All profiles can be found on a running VMs until May 16th (after that will be taken down) at URLs:
- http://34.42.25.189:6006/#pytorch_profiler (Compares running GRPO against T4 and L4 with and without quantization)
- http://34.170.77.214:6006/#pytorch_profiler (Compares running GRPO with deepseek and Phi with and without quantization)

-
Notes
- All GRPO datasets are in tasks/, preprocessed via get_aime_dataset.py or get_countdown_dataset.py.
- Fine-tuned adapters are uploaded via upload_adapter.py.
- Final scores are located in inference/scores/.
- Profiling logs are saved in ./logs/profiler.