fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704) by noahgift · Pull Request #1879 · paiml/aprender

noahgift · 2026-05-22T10:37:31Z

Summary

Post-mortem of the PMAT-701 cascade revealed PR #1869 (Bug B) was a wrong turn. It routed Q4K teachers to `RealizarQ4KTeacher`, which runs layer-norm + attention + softmax on CPU (only Q4K matmuls dispatch to GPU). The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof the path is unusable as a real distillation teacher.

Five-whys

7B 500-step validation hung at step 0 → `RealizarQ4KTeacher` forward is CPU-bound; GPU stayed at 0% the entire run.
`OwnedQuantizedModelCuda::forward_cuda` is mostly CPU SIMD → only individual Q4K matmuls dispatch to GPU.
PR fix(distill): RealizarQ4KTeacher — Q4K-native frozen-teacher path (PMAT-701 Bug B) #1869 picked that path → to avoid F32 dequant at upload (claimed 28 GB inflation → Linux OOM-kill).
That claim was wrong post-PMAT-701 → single SIGKILL observation with `MANAGED_MEMORY=1` was never verified as OOM-killer via dmesg; the cuBLAS path was never re-tested under PMAT-701 Bug A's autodetect default.
Why I committed before verifying → cascade momentum; a one-line dispatch flip would have rejected the hypothesis in 5 minutes; a multi-PR architectural detour shipped instead.

Root cause: conflated the cuMemAlloc 30 GB ceiling (real, fixed by Bug A) with a step-0 SIGKILL on the explicit-managed path (phantom, never verified). The F32 dequant fits in 128 GB unified memory; cuBLAS is the fast path and the right default.

Fix

`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:

New env var `APR_DISTILL_TEACHER_BACKEND` ∈ {`auto` (default), `cudatrainer`, `realizar-q4k`}.
Default dispatch: `CudaTrainerTeacher` (cuBLAS, F32 dequant) for all teacher types.
`RealizarQ4KTeacher` retained as opt-in fallback for memory-constrained dGPUs.
Generic `TruncatingTeacher` wrapper applies PMAT-703 vocab alignment uniformly to both backends (supersedes fix(distill): vocab-align teacher logits for Qwen2.5-Coder 7B → 0.5B KD (PMAT-703) #1877's planned per-backend truncation).

Contract

`contracts/apr-distill-teacher-backend-selection-v1.yaml` (validates clean):

3 equations: backend dispatch, forward latency invariant, Bug B demotion
4 falsifiers: default routes to CudaTrainer; env override reaches Realizar; 7B 500-step completes < 30 min; forward parity within Q4K noise floor
2 Kani harnesses; qa_gate F-BACKEND-SELECT-001

The original Bug B contract (`cuda-q4k-frozen-teacher-v1.yaml`) is demoted, not retracted — its math holds as a memory-constrained fallback path; its DEFAULT-PATH claim was wrong on unified-memory devices.

Verification on gx10

Dispatch log:

```
[PMAT-704] backend=auto → CudaTrainerTeacher (cuBLAS) [override with APR_DISTILL_TEACHER_BACKEND=realizar-q4k for memory-constrained dGPU]
[PMAT-703] vocab alignment: teacher native=152064, student=151936 → truncating teacher logits to student vocab
[PMAT-704] teacher backend = CudaTrainerTeacher [Q4K/Q6K (dequant to F32 at GPU upload; cuBLAS GEMM)]
[CUDA] cuBLAS initialized — forward TF32 tensor cores (41x vs SIMD)
✓ 28 transformer blocks uploaded to GPU
```

GPU utilization observed at 96% during training (was 0% on the Realizar path). cuBLAS is dispatching correctly; the cascade is no longer hanging at the first step.

Cascade context

Fifth fix in the PMAT-701 family:

feat(cuda): autodetect Grace Blackwell + Q4K frozen-teacher contract (PMAT-701) #1863 Bug A: allocator autodetect Grace Blackwell
fix(distill): RealizarQ4KTeacher — Q4K-native frozen-teacher path (PMAT-701 Bug B) #1869 Bug B: RealizarQ4KTeacher (now demoted to opt-in fallback)
fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702) #1874 Defect 3 / PMAT-702: apr eval no-fake-pass on broken models
fix(distill): vocab-align teacher logits for Qwen2.5-Coder 7B → 0.5B KD (PMAT-703) #1877 Bug B's vocab alignment (superseded by TruncatingTeacher here)
This PR: cuBLAS default + opt-in Realizar fallback (PMAT-704)

Test plan

`cargo build --release --features cuda -p apr-cli` — clean
`cargo fmt -p apr-cli --check` — clean
`pv validate contracts/apr-distill-teacher-backend-selection-v1.yaml` — clean
Dispatch markers fire correctly on gx10
GPU utilization 96% during training (vs 0% on Bug B's path)
CI: `ci / gate` + `workspace-test` green
Validation run completion + per-step loss capture as evidence (in flight; Monitor task beaokloec)

🤖 Generated with Claude Code

…LAS) — revert Bug B's slow path (PMAT-704) Post-mortem of the PMAT-701 cascade revealed PR #1869 (Bug B) routed Q4K teachers to `RealizarQ4KTeacher`, a CPU-heavy forward path (layer-norm + attention + softmax all on CPU; only Q4K matmuls dispatch to GPU). The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof the path is unusable as a real distillation teacher. Five-whys (evidence/distill-7b-cublas-cudatrainer/findings.json): * Why 1: 7B 500-step validation hung at step 0 → RealizarQ4KTeacher's forward runs most ops on CPU; GPU stayed at 0%. * Why 2: realizar's `OwnedQuantizedModelCuda::forward_cuda` is mostly CPU SIMD; only individual Q4K matmuls dispatch to GPU. * Why 3: PR #1869 picked that path to avoid an F32 dequant at upload (claimed "28 GB inflation + student → Linux OOM-kill"). * Why 4: That claim was based on a single SIGKILL observation with MANAGED_MEMORY=1 explicit, never verified via dmesg as actual OOM-killer, never re-tested under PMAT-701 Bug A's autodetect default. * Why 5: Cascade momentum + incomplete root-cause discipline. The cheap experiment (one-line dispatch flip) would have rejected the hypothesis in 5 minutes; a multi-PR architectural detour shipped instead. Root cause: conflated the cuMemAlloc 30 GB ceiling (real, fixed by Bug A) with a step-0 SIGKILL on the explicit-managed path (phantom, never verified). The F32 dequant fits in 128 GB unified memory; cuBLAS is the fast path and the right default. ## Fix `crates/apr-cli/src/commands/distill.rs::run_cuda_backend`: * New env var `APR_DISTILL_TEACHER_BACKEND` with values: - `auto` (default) → CudaTrainerTeacher (cuBLAS, F32 dequant) - `cudatrainer` → CudaTrainerTeacher (explicit) - `realizar-q4k` → RealizarQ4KTeacher (memory-constrained-device fallback) * Q4K detection still happens, but only controls the fallback path's availability — the DEFAULT dispatch is cuBLAS for all teacher types. * Generic `TruncatingTeacher` wrapper applies PMAT-703 vocab alignment uniformly to both backends. Replaces the per-backend truncation that PR #1877 was planning to add inside `RealizarQ4KTeacher::from_apr_path_with_target_vocab`. `contracts/apr-distill-teacher-backend-selection-v1.yaml`: * 3 equations (backend_dispatch, forward_latency_invariant, bug_b_demotion) * 4 falsifiers: default routes to CudaTrainer; env override reaches Realizar; 7B 500-step completes < 30 min; forward parity within Q4K noise floor * 2 Kani harnesses; qa_gate F-BACKEND-SELECT-001 * Validates clean: `pv validate` reports 0 errors, 0 warnings. ## Verification on gx10 Dispatch log shows the new path firing: [PMAT-704] backend=auto → CudaTrainerTeacher (cuBLAS) [override with APR_DISTILL_TEACHER_BACKEND=realizar-q4k for memory-constrained dGPU] [PMAT-703] vocab alignment: teacher native=152064, student=151936 → truncating teacher logits to student vocab [PMAT-704] teacher backend = CudaTrainerTeacher [Q4K/Q6K (dequant to F32 at GPU upload; cuBLAS GEMM)] [CUDA] cuBLAS initialized — forward TF32 tensor cores (41x vs SIMD) ✓ Loaded pre-trained weights successfully (APR) ✓ 28 transformer blocks uploaded to GPU GPU utilization observed at 96% during training (was 0% on the RealizarQ4KTeacher path). cuBLAS is dispatching correctly; the cascade is no longer hanging at the first step. ## Cascade context This is the fifth fix in the PMAT-701 family: - #1863 Bug A: allocator autodetect Grace Blackwell - #1869 Bug B: RealizarQ4KTeacher (now demoted to opt-in fallback) - #1874 Defect 3 / PMAT-702: apr eval no-fake-pass on broken models - #1877 Bug B's vocab alignment (superseded by TruncatingTeacher in this PR) - This PR: cuBLAS default + opt-in Realizar fallback (PMAT-704) The original Bug B contract (cuda-q4k-frozen-teacher-v1.yaml) is **demoted, not retracted**: its math is correct as a memory-constrained fallback path; its DEFAULT-PATH claim was wrong on unified-memory devices. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 22, 2026 10:37

This was referenced May 22, 2026

docs(spec): SPEC-DISTILL-001 §87 — PMAT-704 post-mortem on Bug B wrong turn #1880

Open

feat(distill): wire ProgressCallback into Pipeline — close training-monitoring gap (PMAT-705) #1881

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704)#1879

fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704)#1879
noahgift wants to merge 1 commit into
mainfrom
fix/distill-teacher-backend-selection-pmat-704

noahgift commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 22, 2026

Summary

Five-whys

Fix

Contract

Verification on gx10

Cascade context

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant