fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704)#1879
Open
noahgift wants to merge 1 commit into
Open
Conversation
…LAS) — revert Bug B's slow path (PMAT-704) Post-mortem of the PMAT-701 cascade revealed PR #1869 (Bug B) routed Q4K teachers to `RealizarQ4KTeacher`, a CPU-heavy forward path (layer-norm + attention + softmax all on CPU; only Q4K matmuls dispatch to GPU). The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof the path is unusable as a real distillation teacher. Five-whys (evidence/distill-7b-cublas-cudatrainer/findings.json): * Why 1: 7B 500-step validation hung at step 0 → RealizarQ4KTeacher's forward runs most ops on CPU; GPU stayed at 0%. * Why 2: realizar's `OwnedQuantizedModelCuda::forward_cuda` is mostly CPU SIMD; only individual Q4K matmuls dispatch to GPU. * Why 3: PR #1869 picked that path to avoid an F32 dequant at upload (claimed "28 GB inflation + student → Linux OOM-kill"). * Why 4: That claim was based on a single SIGKILL observation with MANAGED_MEMORY=1 explicit, never verified via dmesg as actual OOM-killer, never re-tested under PMAT-701 Bug A's autodetect default. * Why 5: Cascade momentum + incomplete root-cause discipline. The cheap experiment (one-line dispatch flip) would have rejected the hypothesis in 5 minutes; a multi-PR architectural detour shipped instead. Root cause: conflated the cuMemAlloc 30 GB ceiling (real, fixed by Bug A) with a step-0 SIGKILL on the explicit-managed path (phantom, never verified). The F32 dequant fits in 128 GB unified memory; cuBLAS is the fast path and the right default. ## Fix `crates/apr-cli/src/commands/distill.rs::run_cuda_backend`: * New env var `APR_DISTILL_TEACHER_BACKEND` with values: - `auto` (default) → CudaTrainerTeacher (cuBLAS, F32 dequant) - `cudatrainer` → CudaTrainerTeacher (explicit) - `realizar-q4k` → RealizarQ4KTeacher (memory-constrained-device fallback) * Q4K detection still happens, but only controls the fallback path's availability — the DEFAULT dispatch is cuBLAS for all teacher types. * Generic `TruncatingTeacher` wrapper applies PMAT-703 vocab alignment uniformly to both backends. Replaces the per-backend truncation that PR #1877 was planning to add inside `RealizarQ4KTeacher::from_apr_path_with_target_vocab`. `contracts/apr-distill-teacher-backend-selection-v1.yaml`: * 3 equations (backend_dispatch, forward_latency_invariant, bug_b_demotion) * 4 falsifiers: default routes to CudaTrainer; env override reaches Realizar; 7B 500-step completes < 30 min; forward parity within Q4K noise floor * 2 Kani harnesses; qa_gate F-BACKEND-SELECT-001 * Validates clean: `pv validate` reports 0 errors, 0 warnings. ## Verification on gx10 Dispatch log shows the new path firing: [PMAT-704] backend=auto → CudaTrainerTeacher (cuBLAS) [override with APR_DISTILL_TEACHER_BACKEND=realizar-q4k for memory-constrained dGPU] [PMAT-703] vocab alignment: teacher native=152064, student=151936 → truncating teacher logits to student vocab [PMAT-704] teacher backend = CudaTrainerTeacher [Q4K/Q6K (dequant to F32 at GPU upload; cuBLAS GEMM)] [CUDA] cuBLAS initialized — forward TF32 tensor cores (41x vs SIMD) ✓ Loaded pre-trained weights successfully (APR) ✓ 28 transformer blocks uploaded to GPU GPU utilization observed at 96% during training (was 0% on the RealizarQ4KTeacher path). cuBLAS is dispatching correctly; the cascade is no longer hanging at the first step. ## Cascade context This is the fifth fix in the PMAT-701 family: - #1863 Bug A: allocator autodetect Grace Blackwell - #1869 Bug B: RealizarQ4KTeacher (now demoted to opt-in fallback) - #1874 Defect 3 / PMAT-702: apr eval no-fake-pass on broken models - #1877 Bug B's vocab alignment (superseded by TruncatingTeacher in this PR) - This PR: cuBLAS default + opt-in Realizar fallback (PMAT-704) The original Bug B contract (cuda-q4k-frozen-teacher-v1.yaml) is **demoted, not retracted**: its math is correct as a memory-constrained fallback path; its DEFAULT-PATH claim was wrong on unified-memory devices. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Post-mortem of the PMAT-701 cascade revealed PR #1869 (Bug B) was a wrong turn. It routed Q4K teachers to `RealizarQ4KTeacher`, which runs layer-norm + attention + softmax on CPU (only Q4K matmuls dispatch to GPU). The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof the path is unusable as a real distillation teacher.
Five-whys
Root cause: conflated the cuMemAlloc 30 GB ceiling (real, fixed by Bug A) with a step-0 SIGKILL on the explicit-managed path (phantom, never verified). The F32 dequant fits in 128 GB unified memory; cuBLAS is the fast path and the right default.
Fix
`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:
Contract
`contracts/apr-distill-teacher-backend-selection-v1.yaml` (validates clean):
The original Bug B contract (`cuda-q4k-frozen-teacher-v1.yaml`) is demoted, not retracted — its math holds as a memory-constrained fallback path; its DEFAULT-PATH claim was wrong on unified-memory devices.
Verification on gx10
Dispatch log:
```
[PMAT-704] backend=auto → CudaTrainerTeacher (cuBLAS) [override with APR_DISTILL_TEACHER_BACKEND=realizar-q4k for memory-constrained dGPU]
[PMAT-703] vocab alignment: teacher native=152064, student=151936 → truncating teacher logits to student vocab
[PMAT-704] teacher backend = CudaTrainerTeacher [Q4K/Q6K (dequant to F32 at GPU upload; cuBLAS GEMM)]
[CUDA] cuBLAS initialized — forward TF32 tensor cores (41x vs SIMD)
✓ 28 transformer blocks uploaded to GPU
```
GPU utilization observed at 96% during training (was 0% on the Realizar path). cuBLAS is dispatching correctly; the cascade is no longer hanging at the first step.
Cascade context
Fifth fix in the PMAT-701 family:
Test plan
🤖 Generated with Claude Code