Skip to content

fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704)#1879

Open
noahgift wants to merge 1 commit into
mainfrom
fix/distill-teacher-backend-selection-pmat-704
Open

fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704)#1879
noahgift wants to merge 1 commit into
mainfrom
fix/distill-teacher-backend-selection-pmat-704

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Post-mortem of the PMAT-701 cascade revealed PR #1869 (Bug B) was a wrong turn. It routed Q4K teachers to `RealizarQ4KTeacher`, which runs layer-norm + attention + softmax on CPU (only Q4K matmuls dispatch to GPU). The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof the path is unusable as a real distillation teacher.

Five-whys

  1. 7B 500-step validation hung at step 0 → `RealizarQ4KTeacher` forward is CPU-bound; GPU stayed at 0% the entire run.
  2. `OwnedQuantizedModelCuda::forward_cuda` is mostly CPU SIMD → only individual Q4K matmuls dispatch to GPU.
  3. PR fix(distill): RealizarQ4KTeacher — Q4K-native frozen-teacher path (PMAT-701 Bug B) #1869 picked that path → to avoid F32 dequant at upload (claimed 28 GB inflation → Linux OOM-kill).
  4. That claim was wrong post-PMAT-701 → single SIGKILL observation with `MANAGED_MEMORY=1` was never verified as OOM-killer via dmesg; the cuBLAS path was never re-tested under PMAT-701 Bug A's autodetect default.
  5. Why I committed before verifying → cascade momentum; a one-line dispatch flip would have rejected the hypothesis in 5 minutes; a multi-PR architectural detour shipped instead.

Root cause: conflated the cuMemAlloc 30 GB ceiling (real, fixed by Bug A) with a step-0 SIGKILL on the explicit-managed path (phantom, never verified). The F32 dequant fits in 128 GB unified memory; cuBLAS is the fast path and the right default.

Fix

`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:

  • New env var `APR_DISTILL_TEACHER_BACKEND` ∈ {`auto` (default), `cudatrainer`, `realizar-q4k`}.
  • Default dispatch: `CudaTrainerTeacher` (cuBLAS, F32 dequant) for all teacher types.
  • `RealizarQ4KTeacher` retained as opt-in fallback for memory-constrained dGPUs.
  • Generic `TruncatingTeacher` wrapper applies PMAT-703 vocab alignment uniformly to both backends (supersedes fix(distill): vocab-align teacher logits for Qwen2.5-Coder 7B → 0.5B KD (PMAT-703) #1877's planned per-backend truncation).

Contract

`contracts/apr-distill-teacher-backend-selection-v1.yaml` (validates clean):

  • 3 equations: backend dispatch, forward latency invariant, Bug B demotion
  • 4 falsifiers: default routes to CudaTrainer; env override reaches Realizar; 7B 500-step completes < 30 min; forward parity within Q4K noise floor
  • 2 Kani harnesses; qa_gate F-BACKEND-SELECT-001

The original Bug B contract (`cuda-q4k-frozen-teacher-v1.yaml`) is demoted, not retracted — its math holds as a memory-constrained fallback path; its DEFAULT-PATH claim was wrong on unified-memory devices.

Verification on gx10

Dispatch log:

```
[PMAT-704] backend=auto → CudaTrainerTeacher (cuBLAS) [override with APR_DISTILL_TEACHER_BACKEND=realizar-q4k for memory-constrained dGPU]
[PMAT-703] vocab alignment: teacher native=152064, student=151936 → truncating teacher logits to student vocab
[PMAT-704] teacher backend = CudaTrainerTeacher [Q4K/Q6K (dequant to F32 at GPU upload; cuBLAS GEMM)]
[CUDA] cuBLAS initialized — forward TF32 tensor cores (41x vs SIMD)
✓ 28 transformer blocks uploaded to GPU
```

GPU utilization observed at 96% during training (was 0% on the Realizar path). cuBLAS is dispatching correctly; the cascade is no longer hanging at the first step.

Cascade context

Fifth fix in the PMAT-701 family:

Test plan

  • `cargo build --release --features cuda -p apr-cli` — clean
  • `cargo fmt -p apr-cli --check` — clean
  • `pv validate contracts/apr-distill-teacher-backend-selection-v1.yaml` — clean
  • Dispatch markers fire correctly on gx10
  • GPU utilization 96% during training (vs 0% on Bug B's path)
  • CI: `ci / gate` + `workspace-test` green
  • Validation run completion + per-step loss capture as evidence (in flight; Monitor task beaokloec)

🤖 Generated with Claude Code

…LAS) — revert Bug B's slow path (PMAT-704)

Post-mortem of the PMAT-701 cascade revealed PR #1869 (Bug B) routed Q4K
teachers to `RealizarQ4KTeacher`, a CPU-heavy forward path (layer-norm +
attention + softmax all on CPU; only Q4K matmuls dispatch to GPU). The
7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU
at 0% utilization — empirical proof the path is unusable as a real
distillation teacher.

Five-whys (evidence/distill-7b-cublas-cudatrainer/findings.json):

* Why 1: 7B 500-step validation hung at step 0 → RealizarQ4KTeacher's
  forward runs most ops on CPU; GPU stayed at 0%.
* Why 2: realizar's `OwnedQuantizedModelCuda::forward_cuda` is mostly
  CPU SIMD; only individual Q4K matmuls dispatch to GPU.
* Why 3: PR #1869 picked that path to avoid an F32 dequant at upload
  (claimed "28 GB inflation + student → Linux OOM-kill").
* Why 4: That claim was based on a single SIGKILL observation with
  MANAGED_MEMORY=1 explicit, never verified via dmesg as actual OOM-killer,
  never re-tested under PMAT-701 Bug A's autodetect default.
* Why 5: Cascade momentum + incomplete root-cause discipline. The cheap
  experiment (one-line dispatch flip) would have rejected the hypothesis
  in 5 minutes; a multi-PR architectural detour shipped instead.

Root cause: conflated the cuMemAlloc 30 GB ceiling (real, fixed by Bug A)
with a step-0 SIGKILL on the explicit-managed path (phantom, never
verified). The F32 dequant fits in 128 GB unified memory; cuBLAS is the
fast path and the right default.

## Fix

`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:

* New env var `APR_DISTILL_TEACHER_BACKEND` with values:
  - `auto` (default) → CudaTrainerTeacher (cuBLAS, F32 dequant)
  - `cudatrainer` → CudaTrainerTeacher (explicit)
  - `realizar-q4k` → RealizarQ4KTeacher (memory-constrained-device fallback)
* Q4K detection still happens, but only controls the fallback path's
  availability — the DEFAULT dispatch is cuBLAS for all teacher types.
* Generic `TruncatingTeacher` wrapper applies PMAT-703 vocab alignment
  uniformly to both backends. Replaces the per-backend truncation that
  PR #1877 was planning to add inside `RealizarQ4KTeacher::from_apr_path_with_target_vocab`.

`contracts/apr-distill-teacher-backend-selection-v1.yaml`:
* 3 equations (backend_dispatch, forward_latency_invariant, bug_b_demotion)
* 4 falsifiers: default routes to CudaTrainer; env override reaches Realizar;
  7B 500-step completes < 30 min; forward parity within Q4K noise floor
* 2 Kani harnesses; qa_gate F-BACKEND-SELECT-001
* Validates clean: `pv validate` reports 0 errors, 0 warnings.

## Verification on gx10

Dispatch log shows the new path firing:

  [PMAT-704] backend=auto → CudaTrainerTeacher (cuBLAS) [override with APR_DISTILL_TEACHER_BACKEND=realizar-q4k for memory-constrained dGPU]
  [PMAT-703] vocab alignment: teacher native=152064, student=151936 → truncating teacher logits to student vocab
  [PMAT-704] teacher backend = CudaTrainerTeacher [Q4K/Q6K (dequant to F32 at GPU upload; cuBLAS GEMM)]
  [CUDA] cuBLAS initialized — forward TF32 tensor cores (41x vs SIMD)
  ✓ Loaded pre-trained weights successfully (APR)
  ✓ 28 transformer blocks uploaded to GPU

GPU utilization observed at 96% during training (was 0% on the
RealizarQ4KTeacher path). cuBLAS is dispatching correctly; the cascade
is no longer hanging at the first step.

## Cascade context

This is the fifth fix in the PMAT-701 family:
  - #1863 Bug A: allocator autodetect Grace Blackwell
  - #1869 Bug B: RealizarQ4KTeacher (now demoted to opt-in fallback)
  - #1874 Defect 3 / PMAT-702: apr eval no-fake-pass on broken models
  - #1877 Bug B's vocab alignment (superseded by TruncatingTeacher in this PR)
  - This PR: cuBLAS default + opt-in Realizar fallback (PMAT-704)

The original Bug B contract (cuda-q4k-frozen-teacher-v1.yaml) is
**demoted, not retracted**: its math is correct as a memory-constrained
fallback path; its DEFAULT-PATH claim was wrong on unified-memory
devices.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant