posttrain-quant-serve

Study repo for RL post-training a small Qwen model with GRPO, serving it with vLLM, quantizing the trained checkpoint, and measuring whether RL post-training changes quantization behavior.

Research question

Does RL post-training change how cleanly a model quantizes?

Concretely, this repo will compare:

Base model
GRPO-trained model
Quantized base model
Quantized GRPO-trained model

Metrics will include perplexity or small eval accuracy, latency, throughput, memory use, and weight/outlier diagnostics.

Scope

Target model:

Day 0 smoke: Qwen2.5-0.5B-Instruct on GSM8K
Scale target: Qwen3-1.7B or similar small reasoning model
Stretch: larger multi-GPU GRPO only after the smoke path is stable

Core stack:

TRL GRPOTrainer for RL post-training
GSM8K answer checker for verifiable rewards
PyTorch FSDP / Accelerate for later distributed scaling
vLLM for serving and throughput/latency benchmarking
AWQ or local quantization diagnostics for the quantization study
Slurm for Michigan GPU cluster runs

Milestones

Run a tiny single-GPU GRPO smoke test on 10 GSM8K problems
Confirm reward, KL, and completion-length logs are produced
Save and reload a GRPO checkpoint cleanly
Run a short scaled GRPO job and record memory, reward, KL, and wall-clock
Serve base and GRPO-trained checkpoints with vLLM
Quantize base and GRPO-trained checkpoints
Compare FP16-base vs FP16-GRPO vs W4-base vs W4-GRPO
Publish results table, plots, and reproducibility notes

Layout

configs/      Training, serving, and quantization configs
notes/        Study log, reading notes, and understanding checks
scripts/      Runnable training, serving, benchmark, and analysis scripts
slurm/        sbatch templates for cluster runs
src/          Reusable project code
results/      Curated result tables and plots

Large local artifacts such as datasets, checkpoints, raw logs, and model outputs are ignored by git.

Day 0 Target

Do not touch larger models yet. First prove the smallest GRPO path works:

Create and verify the cluster Python/CUDA environment.
Download the smoke-test model and GSM8K data.
Read enough GRPO to understand rollouts, rewards, advantages, and KL.
Run a tiny single-GPU Qwen2.5-0.5B-Instruct GRPO run for 5 then 100 steps.
Confirm the run saves a checkpoint and can reload it.

Start with notes/project-overview.md for the plain-English explanation of the project and learning path.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
configs		configs
notebooks		notebooks
notes		notes
results		results
scripts		scripts
slurm		slurm
src/posttrain_quant_serve		src/posttrain_quant_serve
.gitignore		.gitignore
README.md		README.md
requirements-awq.txt		requirements-awq.txt
requirements-serving.txt		requirements-serving.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

posttrain-quant-serve

Research question

Scope

Milestones

Layout

Day 0 Target

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

posttrain-quant-serve

Research question

Scope

Milestones

Layout

Day 0 Target

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages