Skip to content

Adaptive speculative decoding for LLM inference latency optimization

License

Notifications You must be signed in to change notification settings

levvius/adaptive-speculative-decoding

Repository files navigation

Speculative Sampling Playground

This project provides:

  • Exact speculative sampling (SpS).
  • AutoJudge (paper-aligned judge decoding: Algorithm 1 mining + Logistic Regression classifier).
  • Top-K lossy baseline for paper-style comparisons.
  • SpecExec (exact target sampling with draft-branch cache prefill and pruning).
  • A Hugging Face adapter with KV cache and optional quantization.
  • A benchmark harness on MT-Bench with JSONL metrics.

Features

  1. Baseline, speculative, AutoJudge, Top-K, and SpecExec decoding in one benchmark entrypoint.
  2. MT-Bench loader (JSON/JSONL).
  3. Benchmark runner with median timing, resume support, and method-specific metrics.
  4. Preset configs for models, methods, and paired experiments.
  5. Makefile shortcuts for local and Docker workflows.
  6. Docker support for CPU and GPU.
  7. CI pipeline (GitHub Actions) for checks/tests + benchmark JSONL schema validation.

Getting Started (From Zero)

  1. Bootstrap dependencies on a clean Ubuntu host (safe mode, does not touch NVIDIA driver):
bash scripts/install_dependencies.sh

Recommended Python version is 3.11 (.python-version in repo). Dependencies are pinned in requirements*.txt for reproducible runs. For GPU Python extras (bitsandbytes, accelerate):

bash scripts/install_dependencies.sh --gpu

For EOL Ubuntu (for example Ubuntu 17), script stops by default. Continue only if you explicitly accept risks:

bash scripts/install_dependencies.sh --allow-eol-ubuntu
  1. Install Docker Engine (Ubuntu 24.04 example):
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo ${UBUNTU_CODENAME:-$VERSION_CODENAME}) stable" | sudo tee /etc/apt/sources.list.d/docker.list >/dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl enable --now docker

For GPU runs, keep your existing NVIDIA driver and install NVIDIA Container Toolkit:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
  1. Put MT‑Bench dataset file (JSON/JSONL) into project folder datasets/, for example datasets/mt_bench.jsonl.
  2. Build a CPU image:
docker build -t sp-samp .
  1. Run tests (CPU):
docker run --rm sp-samp
  1. Run a CPU benchmark (toy models):
docker run --rm sp-samp \
  python -m benchmarks.bench_speculative \
  --method both \
  --runs 1 \
  --max-samples 5 \
  --max-new-tokens 32 \
  --vocab-size 2048
  1. Build a GPU image (CUDA example):
docker build -f Dockerfile.gpu \
  --build-arg BASE_IMAGE=nvidia/cuda:12.8.1-cudnn-runtime-ubuntu22.04 \
  --build-arg TORCH_INDEX_URL=https://download.pytorch.org/whl/cu128 \
  --build-arg TORCH_VERSION=2.9.1 \
  -t sp-samp-gpu .
  1. Run a GPU benchmark (HF model, results saved to JSONL):
docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
  python -m benchmarks.bench_speculative \
  --dataset /data/mt_bench.jsonl \
  --hf-model RedHatAI/gpt-oss-20b \
  --device cuda \
  --use-chat-template \
  --max-samples 50 \
  --max-new-tokens 128 \
  --k 4 \
  --runs 5 \
  --out /data/results.jsonl
  1. Run all methods in one launch (baseline + speculative + autojudge + topk + specexec):
docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
  python -m benchmarks.bench_speculative \
  --dataset /data/mt_bench.jsonl \
  --hf-model Qwen/Qwen2.5-3B-Instruct \
  --hf-draft-model Qwen/Qwen2.5-0.5B-Instruct \
  --tokenizer Qwen/Qwen2.5-0.5B-Instruct \
  --draft-tokenizer Qwen/Qwen2.5-0.5B-Instruct \
  --device cuda \
  --use-chat-template \
  --method all \
  --k 4 \
  --runs 5 \
  --out /data/results_all.jsonl
  1. Run SpecExec only (branch execution parameters included):
docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
  python -m sp_samp.cli specexec \
  --config-dir configs \
  --experiment qwen25_3b_target_qwen25_0p5b_specexec_k4 \
  --dataset /data/mt_bench.jsonl \
  --parallel-branches 8 \
  --branch-prune-threshold 0.0 \
  --out /data/results_specexec.jsonl
  1. Run AutoJudge only with checkpoint reuse (paper-aligned):
docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
  python -m benchmarks.bench_speculative \
  --dataset /data/gsm8k_train.jsonl \
  --hf-model Qwen/Qwen2.5-3B-Instruct \
  --hf-draft-model Qwen/Qwen2.5-0.5B-Instruct \
  --tokenizer Qwen/Qwen2.5-0.5B-Instruct \
  --draft-tokenizer Qwen/Qwen2.5-0.5B-Instruct \
  --device cuda \
  --use-chat-template \
  --method autojudge \
  --autojudge-task gsm8k \
  --autojudge-train-dataset /data/gsm8k_train.jsonl \
  --autojudge-train-samples 4000 \
  --autojudge-recall-target 0.9 \
  --autojudge-train-split 0.9 \
  --autojudge-checkpoint /data/autojudge_llama3.pt \
  --out /data/results_autojudge.jsonl

Make Targets Defaults for benchmark paths:

  • DATASET=datasets/mt_bench.jsonl
  • OUT=datasets/results.jsonl
  1. Show all commands:
make help
  1. Install/upgrade dependencies in safe mode:
make setup
  1. Install/upgrade including GPU Python extras:
make setup-gpu
  1. Syntax check:
make check
  1. Validate benchmark JSONL schema:
make validate-results RESULTS=datasets/results.jsonl
  1. List presets:
make list-presets
  1. Validate config logic:
make validate-configs
  1. Quick toy benchmark (no HF models):
make bench-toy OUT=/tmp/bench_toy.jsonl
  1. Quick HF smoke run (needs torch + transformers, downloads tiny model):
make smoke-hf OUT=/tmp/smoke_hf.jsonl

GPU host variant:

make smoke-hf-gpu OUT=/tmp/smoke_hf_gpu.jsonl
python scripts/validate_results_jsonl.py --path /tmp/smoke_hf_gpu.jsonl --strict

Expected result for the GPU variant: one run summary in console and 2 JSONL records (run + summary) in output file. 10. Run experiment on MT-Bench:

make bench DATASET=datasets/mt_bench.jsonl OUT=datasets/results.jsonl
  1. Run AutoJudge preset:
make autojudge DATASET=datasets/mt_bench.jsonl OUT=datasets/results_autojudge.jsonl
  1. Run SpecExec preset:
make specexec DATASET=datasets/mt_bench.jsonl OUT=datasets/results_specexec.jsonl
  1. Build and run GPU Docker flow:
make docker-build-gpu
make docker-bench DATASET=datasets/mt_bench.jsonl OUT=datasets/results.jsonl
make docker-specexec DATASET=datasets/mt_bench.jsonl OUT=datasets/results_specexec.jsonl

For one-off sudo usage without docker-group setup:

make docker-gpu-check DOCKER_CMD="sudo docker"
make docker-build-gpu-safe DOCKER_CMD="sudo docker"

If your Docker host hits BuildKit snapshot/export errors, use:

make docker-build-gpu-safe

or clean builder cache:

make docker-prune-builder

Check GPU passthrough before long runs:

make docker-gpu-check
make docker-gpu-check-image

docker-gpu-check first tries nvidia-smi in a clean CUDA container. If NVML fails there, it falls back to a torch.cuda check in your built image. If fallback says image missing, build it first with make docker-build-gpu-safe. 14. Enforce headless GPU mode for long runs:

make bench DATASET=datasets/mt_bench.jsonl OUT=datasets/results.jsonl HEADLESS=1
  1. Short load run in Docker (gpt_oss_20b_4bit preset):
sudo docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
  python -m sp_samp.cli bench \
  --config-dir configs \
  --experiment gptoss20b_target_gptoss20b_draft_k4 \
  --dataset /data/mt_bench.jsonl \
  --runs 1 \
  --max-samples 20 \
  --max-new-tokens 128 \
  --out /data/results_gptoss20b_load.jsonl

If gpt-oss-20b fails to load because of local bitsandbytes/triton runtime mismatch, fallback to:

sudo docker run --rm --gpus all -v "$(pwd)/datasets:/data" sp-samp-gpu \
  python -m sp_samp.cli bench \
  --config-dir configs \
  --experiment mistral_target_mistral_draft_k4 \
  --dataset /data/mt_bench.jsonl \
  --runs 1 \
  --max-samples 20 \
  --max-new-tokens 128 \
  --out /data/results_gptoss20b_load.jsonl

Validate the output:

python scripts/validate_results_jsonl.py --path datasets/results_gptoss20b_load.jsonl --strict
  1. Paper-style GSM8K AutoJudge evaluation (Qwen2.5 0.5B -> 3B, medium sweep):
mkdir -p datasets
curl -fsSL https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/train.jsonl -o datasets/gsm8k_train.jsonl
curl -fsSL https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl -o datasets/gsm8k_test.jsonl
make paper-eval

Raw JSONL stays in datasets/; report artifacts are written to reports/ as:

  • reports/autojudge_qwen25_paper_<date>.md
  • reports/autojudge_qwen25_paper_<date>.csv
  • reports/autojudge_qwen25_paper_<date>.json

The benchmark supports dedicated paper-eval switches:

  • --eval-task mtbench|gsm8k
  • --gsm8k-eval-mode zero_shot_cot|plain
  • --topk-rank <int|all>
  • --topk-grid <csv>
  • GSM8K output fields in JSONL: gsm8k_exact_match, gsm8k_correct, gsm8k_total

Troubleshooting (RTX 50xx / Blackwell)

  • Symptom: sm_120 is not compatible with the current PyTorch installation or no kernel image is available for execution on the device.
  • Cause: torch build without native Blackwell kernels.
  • Fix: rebuild with cu128+ torch:
make docker-build-gpu-safe \
  CUDA_BASE_IMAGE=nvidia/cuda:12.8.1-cudnn-runtime-ubuntu22.04 \
  TORCH_INDEX_URL=https://download.pytorch.org/whl/cu128 \
  TORCH_VERSION=2.9.1

Presets

  • Models: configs/models.json
  • Methods: configs/methods.json
  • Experiments (target/draft pairings): configs/experiments.json
  • Method templates (AutoJudge/Top-K/SpecExec): configs/method_templates.json

CLI Runner

  1. List presets:
python -m sp_samp.cli list-presets --config-dir configs
  1. Direct method selection:
python -m benchmarks.bench_speculative \
  --method specexec \
  --dataset datasets/mt_bench.jsonl \
  --hf-model meta-llama/Meta-Llama-3-8B-Instruct \
  --hf-draft-model meta-llama/Meta-Llama-3-8B-Instruct \
  --parallel-branches 8 \
  --branch-prune-threshold 0.0
  1. Run benchmark using presets:
python -m sp_samp.cli bench \
  --config-dir configs \
  --model-preset gpt_oss_20b_4bit \
  --method-preset speculative_k4 \
  --dataset datasets/mt_bench.jsonl \
  --out datasets/results.jsonl
  1. Run benchmark using an experiment preset:
python -m sp_samp.cli bench \
  --config-dir configs \
  --experiment qwen25_3b_target_qwen25_0p5b_all_methods \
  --dataset datasets/mt_bench.jsonl \
  --out datasets/results.jsonl
  1. Run paper-method bundle only (baseline + speculative + autojudge + topk):
python -m sp_samp.cli bench \
  --config-dir configs \
  --experiment qwen25_3b_target_qwen25_0p5b_all_paper \
  --dataset datasets/gsm8k_test.jsonl \
  --eval-task gsm8k \
  --gsm8k-eval-mode zero_shot_cot \
  --out datasets/results_all_paper.jsonl
  1. Run AutoJudge shortcut command:
python -m sp_samp.cli autojudge \
  --config-dir configs \
  --experiment qwen25_3b_target_qwen25_0p5b_autojudge_k4 \
  --dataset datasets/mt_bench.jsonl \
  --out datasets/results_autojudge.jsonl
  1. Run SpecExec shortcut command:
python -m sp_samp.cli specexec \
  --config-dir configs \
  --experiment qwen25_3b_target_qwen25_0p5b_specexec_k4 \
  --dataset datasets/mt_bench.jsonl \
  --out datasets/results_specexec.jsonl
  1. Require headless GPU mode (fail-fast when display is active):
python -m sp_samp.cli bench \
  --config-dir configs \
  --experiment qwen25_3b_target_qwen25_0p5b_all_methods \
  --dataset datasets/mt_bench.jsonl \
  --require-headless \
  --out datasets/results.jsonl

Metrics Output The benchmark writes JSONL records with per-run metrics and a summary record per method. Fields include:

  • status (ok/error/skipped)
  • resume_key (used to skip completed runs on re-launch)
  • tokens_per_sec
  • acceptance_rate
  • avg_tokens_per_step
  • proposed, accepted, rejections
  • judge_accept_rate (AutoJudge only)
  • target_fallback_rate (AutoJudge only)
  • autojudge_train_samples, autojudge_val_auc, autojudge_val_recall (AutoJudge only)
  • autojudge_threshold_calibrated, autojudge_threshold_used (AutoJudge only)
  • legacy aliases: autojudge_threshold_selected (= calibrated), autojudge_threshold (= used)
  • topk_accept_rate, topk_rank_effective, topk_mismatches, topk_accepted_mismatches (Top-K only)
  • branch_prune_rate (SpecExec only)
  • effective_parallelism (SpecExec only)
  • target_calls_per_token (AutoJudge, Top-K, and SpecExec)
  • draft_calls_per_token (SpecExec only)
  • cache_hit_rate (SpecExec only)
  • max_active_branches (SpecExec only)
  • gsm8k_exact_match, gsm8k_correct, gsm8k_total (when --eval-task gsm8k)
  • error_type, error_message, traceback (for failed runs)
  • System metadata: git_sha, hostname, gpu_name, gpu_driver, cuda_runtime, torch_version, transformers_version, display_active
  • Validate output schema with:
python scripts/validate_results_jsonl.py --path datasets/results.jsonl --strict

Project Layout

  • sp_samp/: core library, AutoJudge, SpecExec, HF adapter.
  • benchmarks/: benchmark runner.
  • configs/: preset configs.
  • tests/: tests.
  • papers/: source research papers (AutoJudge, SpecExec, Speculative Sampling).
  • file_changes/: dated change records and audit notes.
  • Dockerfile, Dockerfile.gpu: containers.

Algorithm Variants vs. Papers

The implementation is paper-aligned with the following documented deviations:

  • AutoJudge mining: Initial response is generated by the target model (matches the paper's mathematical definition of I(x)), whereas Algorithm 1 pseudocode shows the draft model. No correctness impact; see file_changes/2026-02-25-paper-alignment.md.
  • AutoJudge decoding: Uses greedy (argmax) decoding throughout. Paper Appendix A describes Gumbel-max stochastic sampling. Valid deterministic variant.
  • SpecExec tree search: Uses BFS level-by-level with parallel_branches / branch_prune_threshold instead of the paper's SSSP / modified Dijkstra priority-queue algorithm. Output distribution is preserved; this is a known, intentional simplification.

Full audit notes: file_changes/2026-02-25-paper-alignment.md.

Notes

  • SpecExec is implemented as an exact (distribution-preserving) decoder with speculative cache prefill and configurable parallel_branches / branch_prune_threshold.
  • HF SpecExec uses KV-cache reuse and depth-wise tree passes for faster cache construction.
  • Draft and target models must share an identical tokenizer vocabulary mapping for speculative, AutoJudge, Top-K, and SpecExec correctness.
  • AutoJudge training in this implementation is gsm8k-specific. If --autojudge-checkpoint is absent, provide a GSM8K-compatible train dataset (question + answer) via --autojudge-train-dataset or --dataset.
  • AutoJudge now fails fast with an actionable message when the train dataset path is missing or not GSM8K-compatible (for example, MT-Bench JSONL is not valid for judge training).
  • scripts/install_dependencies.sh is idempotent for apt/pip dependencies and never modifies NVIDIA driver packages.
  • scripts/install_dependencies.sh prefers python3.11 when available and warns when running with lower versions.
  • Ubuntu 17 is EOL. Script blocks by default on EOL Ubuntu unless --allow-eol-ubuntu is set explicitly.
  • Re-running a benchmark with the same OUT file automatically skips completed runs (resume mode).
  • Failed runs are written to JSONL and do not stop the whole benchmark method loop.
  • make validate-configs checks config references and tokenizer compatibility in configs/*.json.
  • make validate-results enforces JSONL schema compatibility for downstream analysis.
  • Current Make defaults are paper-aligned for compatible open models: Qwen2.5-0.5B-Instruct draft -> Qwen2.5-3B-Instruct target.
  • Qwen2.5-0.5B -> Qwen2.5-7B is kept as legacy preset pair, but it fails strict speculative methods because model vocab sizes differ.

Docker Troubleshooting

  • Error parent snapshot ... does not exist during docker build is typically a Docker BuildKit cache/snapshot issue on host.
  • Recovery sequence:
make docker-prune-builder
make docker-build-gpu-safe
  • docker-build-gpu-safe first tries BuildKit and then automatically retries with legacy builder (DOCKER_BUILDKIT=0) if BuildKit fails.
  • Error RuntimeError: No CUDA GPUs are available inside container means GPU runtime passthrough is not working (host-side setup), not model code.
  • Error Could not find the bitsandbytes CUDA binary or No module named 'triton.ops' during GPT-OSS load indicates bitsandbytes/triton mismatch in container runtime path.
  • Quick diagnostics:
make docker-gpu-check
make docker-gpu-check-image
  • If docker-gpu-check fails, (re)configure nvidia-container-toolkit and restart Docker daemon.
  • For deterministic progress while investigating GPT-OSS runtime issues, run the short load fallback preset mistral_target_mistral_draft_k4.

About

Adaptive speculative decoding for LLM inference latency optimization

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 2

  •  
  •