This project provides:
- Exact speculative sampling (SpS).
- AutoJudge (paper-aligned judge decoding: Algorithm 1 mining + Logistic Regression classifier).
- Top-K lossy baseline for paper-style comparisons.
- SpecExec (exact target sampling with draft-branch cache prefill and pruning).
- A Hugging Face adapter with KV cache and optional quantization.
- A benchmark harness on MT-Bench with JSONL metrics.
Features
- Baseline, speculative, AutoJudge, Top-K, and SpecExec decoding in one benchmark entrypoint.
- MT-Bench loader (JSON/JSONL).
- Benchmark runner with median timing, resume support, and method-specific metrics.
- Preset configs for models, methods, and paired experiments.
- Makefile shortcuts for local and Docker workflows.
- Docker support for CPU and GPU.
- CI pipeline (GitHub Actions) for checks/tests + benchmark JSONL schema validation.
Getting Started (From Zero)
- Bootstrap dependencies on a clean Ubuntu host (safe mode, does not touch NVIDIA driver):
bash scripts/install_dependencies.shRecommended Python version is 3.11 (.python-version in repo). Dependencies are pinned in requirements*.txt for reproducible runs.
For GPU Python extras (bitsandbytes, accelerate):
bash scripts/install_dependencies.sh --gpuFor EOL Ubuntu (for example Ubuntu 17), script stops by default. Continue only if you explicitly accept risks:
bash scripts/install_dependencies.sh --allow-eol-ubuntu- Install Docker Engine (Ubuntu 24.04 example):
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo ${UBUNTU_CODENAME:-$VERSION_CODENAME}) stable" | sudo tee /etc/apt/sources.list.d/docker.list >/dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl enable --now dockerFor GPU runs, keep your existing NVIDIA driver and install NVIDIA Container Toolkit:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker- Put MT‑Bench dataset file (JSON/JSONL) into project folder
datasets/, for exampledatasets/mt_bench.jsonl. - Build a CPU image:
docker build -t sp-samp .- Run tests (CPU):
docker run --rm sp-samp- Run a CPU benchmark (toy models):
docker run --rm sp-samp \
python -m benchmarks.bench_speculative \
--method both \
--runs 1 \
--max-samples 5 \
--max-new-tokens 32 \
--vocab-size 2048- Build a GPU image (CUDA example):
docker build -f Dockerfile.gpu \
--build-arg BASE_IMAGE=nvidia/cuda:12.8.1-cudnn-runtime-ubuntu22.04 \
--build-arg TORCH_INDEX_URL=https://download.pytorch.org/whl/cu128 \
--build-arg TORCH_VERSION=2.9.1 \
-t sp-samp-gpu .- Run a GPU benchmark (HF model, results saved to JSONL):
docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
python -m benchmarks.bench_speculative \
--dataset /data/mt_bench.jsonl \
--hf-model RedHatAI/gpt-oss-20b \
--device cuda \
--use-chat-template \
--max-samples 50 \
--max-new-tokens 128 \
--k 4 \
--runs 5 \
--out /data/results.jsonl- Run all methods in one launch (baseline + speculative + autojudge + topk + specexec):
docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
python -m benchmarks.bench_speculative \
--dataset /data/mt_bench.jsonl \
--hf-model Qwen/Qwen2.5-3B-Instruct \
--hf-draft-model Qwen/Qwen2.5-0.5B-Instruct \
--tokenizer Qwen/Qwen2.5-0.5B-Instruct \
--draft-tokenizer Qwen/Qwen2.5-0.5B-Instruct \
--device cuda \
--use-chat-template \
--method all \
--k 4 \
--runs 5 \
--out /data/results_all.jsonl- Run SpecExec only (branch execution parameters included):
docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
python -m sp_samp.cli specexec \
--config-dir configs \
--experiment qwen25_3b_target_qwen25_0p5b_specexec_k4 \
--dataset /data/mt_bench.jsonl \
--parallel-branches 8 \
--branch-prune-threshold 0.0 \
--out /data/results_specexec.jsonl- Run AutoJudge only with checkpoint reuse (paper-aligned):
docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
python -m benchmarks.bench_speculative \
--dataset /data/gsm8k_train.jsonl \
--hf-model Qwen/Qwen2.5-3B-Instruct \
--hf-draft-model Qwen/Qwen2.5-0.5B-Instruct \
--tokenizer Qwen/Qwen2.5-0.5B-Instruct \
--draft-tokenizer Qwen/Qwen2.5-0.5B-Instruct \
--device cuda \
--use-chat-template \
--method autojudge \
--autojudge-task gsm8k \
--autojudge-train-dataset /data/gsm8k_train.jsonl \
--autojudge-train-samples 4000 \
--autojudge-recall-target 0.9 \
--autojudge-train-split 0.9 \
--autojudge-checkpoint /data/autojudge_llama3.pt \
--out /data/results_autojudge.jsonlMake Targets Defaults for benchmark paths:
DATASET=datasets/mt_bench.jsonlOUT=datasets/results.jsonl
- Show all commands:
make help- Install/upgrade dependencies in safe mode:
make setup- Install/upgrade including GPU Python extras:
make setup-gpu- Syntax check:
make check- Validate benchmark JSONL schema:
make validate-results RESULTS=datasets/results.jsonl- List presets:
make list-presets- Validate config logic:
make validate-configs- Quick toy benchmark (no HF models):
make bench-toy OUT=/tmp/bench_toy.jsonl- Quick HF smoke run (needs
torch+transformers, downloads tiny model):
make smoke-hf OUT=/tmp/smoke_hf.jsonlGPU host variant:
make smoke-hf-gpu OUT=/tmp/smoke_hf_gpu.jsonl
python scripts/validate_results_jsonl.py --path /tmp/smoke_hf_gpu.jsonl --strictExpected result for the GPU variant: one run summary in console and 2 JSONL records (run + summary) in output file.
10. Run experiment on MT-Bench:
make bench DATASET=datasets/mt_bench.jsonl OUT=datasets/results.jsonl- Run AutoJudge preset:
make autojudge DATASET=datasets/mt_bench.jsonl OUT=datasets/results_autojudge.jsonl- Run SpecExec preset:
make specexec DATASET=datasets/mt_bench.jsonl OUT=datasets/results_specexec.jsonl- Build and run GPU Docker flow:
make docker-build-gpu
make docker-bench DATASET=datasets/mt_bench.jsonl OUT=datasets/results.jsonl
make docker-specexec DATASET=datasets/mt_bench.jsonl OUT=datasets/results_specexec.jsonlFor one-off sudo usage without docker-group setup:
make docker-gpu-check DOCKER_CMD="sudo docker"
make docker-build-gpu-safe DOCKER_CMD="sudo docker"If your Docker host hits BuildKit snapshot/export errors, use:
make docker-build-gpu-safeor clean builder cache:
make docker-prune-builderCheck GPU passthrough before long runs:
make docker-gpu-check
make docker-gpu-check-imagedocker-gpu-check first tries nvidia-smi in a clean CUDA container. If NVML fails there, it falls back to a torch.cuda check in your built image.
If fallback says image missing, build it first with make docker-build-gpu-safe.
14. Enforce headless GPU mode for long runs:
make bench DATASET=datasets/mt_bench.jsonl OUT=datasets/results.jsonl HEADLESS=1- Short load run in Docker (gpt_oss_20b_4bit preset):
sudo docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
python -m sp_samp.cli bench \
--config-dir configs \
--experiment gptoss20b_target_gptoss20b_draft_k4 \
--dataset /data/mt_bench.jsonl \
--runs 1 \
--max-samples 20 \
--max-new-tokens 128 \
--out /data/results_gptoss20b_load.jsonlIf gpt-oss-20b fails to load because of local bitsandbytes/triton runtime mismatch, fallback to:
sudo docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
python -m sp_samp.cli bench \
--config-dir configs \
--experiment mistral_target_mistral_draft_k4 \
--dataset /data/mt_bench.jsonl \
--runs 1 \
--max-samples 20 \
--max-new-tokens 128 \
--out /data/results_gptoss20b_load.jsonlValidate the output:
python scripts/validate_results_jsonl.py --path datasets/results_gptoss20b_load.jsonl --strict- Paper-style GSM8K AutoJudge evaluation (Qwen2.5 0.5B -> 3B, medium sweep):
mkdir -p datasets
curl -fsSL https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/train.jsonl -o datasets/gsm8k_train.jsonl
curl -fsSL https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl -o datasets/gsm8k_test.jsonl
make paper-evalRaw JSONL stays in datasets/; report artifacts are written to reports/ as:
reports/autojudge_qwen25_paper_<date>.mdreports/autojudge_qwen25_paper_<date>.csvreports/autojudge_qwen25_paper_<date>.json
The benchmark supports dedicated paper-eval switches:
--eval-task mtbench|gsm8k--gsm8k-eval-mode zero_shot_cot|plain--topk-rank <int|all>--topk-grid <csv>- GSM8K output fields in JSONL:
gsm8k_exact_match,gsm8k_correct,gsm8k_total
Troubleshooting (RTX 50xx / Blackwell)
- Symptom:
sm_120 is not compatible with the current PyTorch installationorno kernel image is available for execution on the device. - Cause: torch build without native Blackwell kernels.
- Fix: rebuild with cu128+ torch:
make docker-build-gpu-safe \
CUDA_BASE_IMAGE=nvidia/cuda:12.8.1-cudnn-runtime-ubuntu22.04 \
TORCH_INDEX_URL=https://download.pytorch.org/whl/cu128 \
TORCH_VERSION=2.9.1Presets
- Models:
configs/models.json - Methods:
configs/methods.json - Experiments (target/draft pairings):
configs/experiments.json - Method templates (AutoJudge/Top-K/SpecExec):
configs/method_templates.json
CLI Runner
- List presets:
python -m sp_samp.cli list-presets --config-dir configs- Direct method selection:
python -m benchmarks.bench_speculative \
--method specexec \
--dataset datasets/mt_bench.jsonl \
--hf-model meta-llama/Meta-Llama-3-8B-Instruct \
--hf-draft-model meta-llama/Meta-Llama-3-8B-Instruct \
--parallel-branches 8 \
--branch-prune-threshold 0.0- Run benchmark using presets:
python -m sp_samp.cli bench \
--config-dir configs \
--model-preset gpt_oss_20b_4bit \
--method-preset speculative_k4 \
--dataset datasets/mt_bench.jsonl \
--out datasets/results.jsonl- Run benchmark using an experiment preset:
python -m sp_samp.cli bench \
--config-dir configs \
--experiment qwen25_3b_target_qwen25_0p5b_all_methods \
--dataset datasets/mt_bench.jsonl \
--out datasets/results.jsonl- Run paper-method bundle only (baseline + speculative + autojudge + topk):
python -m sp_samp.cli bench \
--config-dir configs \
--experiment qwen25_3b_target_qwen25_0p5b_all_paper \
--dataset datasets/gsm8k_test.jsonl \
--eval-task gsm8k \
--gsm8k-eval-mode zero_shot_cot \
--out datasets/results_all_paper.jsonl- Run AutoJudge shortcut command:
python -m sp_samp.cli autojudge \
--config-dir configs \
--experiment qwen25_3b_target_qwen25_0p5b_autojudge_k4 \
--dataset datasets/mt_bench.jsonl \
--out datasets/results_autojudge.jsonl- Run SpecExec shortcut command:
python -m sp_samp.cli specexec \
--config-dir configs \
--experiment qwen25_3b_target_qwen25_0p5b_specexec_k4 \
--dataset datasets/mt_bench.jsonl \
--out datasets/results_specexec.jsonl- Require headless GPU mode (fail-fast when display is active):
python -m sp_samp.cli bench \
--config-dir configs \
--experiment qwen25_3b_target_qwen25_0p5b_all_methods \
--dataset datasets/mt_bench.jsonl \
--require-headless \
--out datasets/results.jsonlMetrics Output The benchmark writes JSONL records with per-run metrics and a summary record per method. Fields include:
status(ok/error/skipped)resume_key(used to skip completed runs on re-launch)tokens_per_secacceptance_rateavg_tokens_per_stepproposed,accepted,rejectionsjudge_accept_rate(AutoJudge only)target_fallback_rate(AutoJudge only)autojudge_train_samples,autojudge_val_auc,autojudge_val_recall(AutoJudge only)autojudge_threshold_calibrated,autojudge_threshold_used(AutoJudge only)- legacy aliases:
autojudge_threshold_selected(= calibrated),autojudge_threshold(= used) topk_accept_rate,topk_rank_effective,topk_mismatches,topk_accepted_mismatches(Top-K only)branch_prune_rate(SpecExec only)effective_parallelism(SpecExec only)target_calls_per_token(AutoJudge, Top-K, and SpecExec)draft_calls_per_token(SpecExec only)cache_hit_rate(SpecExec only)max_active_branches(SpecExec only)gsm8k_exact_match,gsm8k_correct,gsm8k_total(when--eval-task gsm8k)error_type,error_message,traceback(for failed runs)- System metadata:
git_sha,hostname,gpu_name,gpu_driver,cuda_runtime,torch_version,transformers_version,display_active - Validate output schema with:
python scripts/validate_results_jsonl.py --path datasets/results.jsonl --strictProject Layout
sp_samp/: core library, AutoJudge, SpecExec, HF adapter.benchmarks/: benchmark runner.configs/: preset configs.tests/: tests.papers/: source research papers (AutoJudge, SpecExec, Speculative Sampling).file_changes/: dated change records and audit notes.Dockerfile,Dockerfile.gpu: containers.
Algorithm Variants vs. Papers
The implementation is paper-aligned with the following documented deviations:
- AutoJudge mining: Initial response is generated by the target model (matches the paper's mathematical definition of I(x)), whereas Algorithm 1 pseudocode shows the draft model. No correctness impact; see
file_changes/2026-02-25-paper-alignment.md. - AutoJudge decoding: Uses greedy (argmax) decoding throughout. Paper Appendix A describes Gumbel-max stochastic sampling. Valid deterministic variant.
- SpecExec tree search: Uses BFS level-by-level with
parallel_branches/branch_prune_thresholdinstead of the paper's SSSP / modified Dijkstra priority-queue algorithm. Output distribution is preserved; this is a known, intentional simplification.
Full audit notes: file_changes/2026-02-25-paper-alignment.md.
Notes
- SpecExec is implemented as an exact (distribution-preserving) decoder with speculative cache prefill and configurable
parallel_branches/branch_prune_threshold. - HF SpecExec uses KV-cache reuse and depth-wise tree passes for faster cache construction.
- Draft and target models must share an identical tokenizer vocabulary mapping for speculative, AutoJudge, Top-K, and SpecExec correctness.
- AutoJudge training in this implementation is
gsm8k-specific. If--autojudge-checkpointis absent, provide a GSM8K-compatible train dataset (question+answer) via--autojudge-train-datasetor--dataset. - AutoJudge now fails fast with an actionable message when the train dataset path is missing or not GSM8K-compatible (for example, MT-Bench JSONL is not valid for judge training).
scripts/install_dependencies.shis idempotent for apt/pip dependencies and never modifies NVIDIA driver packages.scripts/install_dependencies.shpreferspython3.11when available and warns when running with lower versions.- Ubuntu 17 is EOL. Script blocks by default on EOL Ubuntu unless
--allow-eol-ubuntuis set explicitly. - Re-running a benchmark with the same
OUTfile automatically skips completed runs (resume mode). - Failed runs are written to JSONL and do not stop the whole benchmark method loop.
make validate-configschecks config references and tokenizer compatibility inconfigs/*.json.make validate-resultsenforces JSONL schema compatibility for downstream analysis.- Current Make defaults are paper-aligned for compatible open models:
Qwen2.5-0.5B-Instructdraft ->Qwen2.5-3B-Instructtarget. Qwen2.5-0.5B -> Qwen2.5-7Bis kept as legacy preset pair, but it fails strict speculative methods because model vocab sizes differ.
Docker Troubleshooting
- Error
parent snapshot ... does not existduringdocker buildis typically a Docker BuildKit cache/snapshot issue on host. - Recovery sequence:
make docker-prune-builder
make docker-build-gpu-safedocker-build-gpu-safefirst tries BuildKit and then automatically retries with legacy builder (DOCKER_BUILDKIT=0) if BuildKit fails.- Error
RuntimeError: No CUDA GPUs are availableinside container means GPU runtime passthrough is not working (host-side setup), not model code. - Error
Could not find the bitsandbytes CUDA binaryorNo module named 'triton.ops'during GPT-OSS load indicates bitsandbytes/triton mismatch in container runtime path. - Quick diagnostics:
make docker-gpu-check
make docker-gpu-check-image- If
docker-gpu-checkfails, (re)configurenvidia-container-toolkitand restart Docker daemon. - For deterministic progress while investigating GPT-OSS runtime issues, run the short load fallback preset
mistral_target_mistral_draft_k4.