fix(distill): vocab-align teacher logits for Qwen2.5-Coder 7B → 0.5B KD (PMAT-703) by noahgift · Pull Request #1877 · paiml/aprender

noahgift · 2026-05-22T08:44:42Z

Summary

Fourth fix in the PMAT-701 family. After memory + Q4K teacher + eval false-positive were closed, attempting to actually distill the MODEL-1 7B teacher (`paiml/qwen2.5-coder-7b-apache-q4k-v1`, vocab=152064) into the 0.5B student (vocab=151936) surfaced a new defect: vocab mismatch.

The 15-min "stable training" observed in #1869's Bug B verification was almost certainly hung in the first `kd_step.rs:103-107` `assert_eq!` on the dimension mismatch — or in the KL compute on misaligned shapes. The cascade looked OK because we never got past kernel JIT.

Fix: truncate at teacher-provider boundary

`RealizarQ4KTeacher::from_apr_path_with_target_vocab(path, target)`: optional truncation target. Validates `target <= native_vocab` and `target > 0` at construction.
`logits_for_batch` truncates each returned vector to `effective_vocab_size` BEFORE the pipeline sees it. Softmax in `kd_step.rs` renormalizes over the shared support — standard vocab-mismatch handling per Hinton 2015 §2.
`vocab_size()` reports `effective_vocab_size` so the pipeline's shape check is consistent.
`run_cuda_backend` reads student + teacher `vocab_size` from APR metadata, dispatches to the new constructor with `Some(student_vocab)` when teacher > student. Hard-fails when teacher < student (the student would need to predict tokens the teacher has no embeddings for).

Contract

`contracts/apr-distill-teacher-vocab-alignment-v1.yaml` (validates clean):

3 equations: dispatch logic, KL-invariance under truncation, CLI plumbing
4 falsifiers: 7B→0.5B succeeds, vocab_size reports effective, oversize errors, logits truncated to target
2 Kani harnesses; qa_gate F-VOCAB-ALIGN-001

Verification on gx10

Dispatch log:

```
[PMAT-703] vocab alignment: teacher native=152064, student=151936 → truncating teacher logits to student vocab
[PMAT-701] Q4K/Q6K teacher detected → RealizarQ4KTeacher (Q4K-native forward, no F32 dequant)
[GH-175] OwnedQuantizedModel::from_apr: 28 layers loaded in 1891.9ms
```

The previous failure mode (silent hang at the first KD step due to dimension mismatch) is gone. A 500-step validation run is in flight; per-step completion verification deferred to follow-up — the cascade itself is now correctly dispatching.

Cascade context

PMAT-701 family of fixes for the MODEL-1 7B teacher → 0.5B student distillation:

feat(cuda): autodetect Grace Blackwell + Q4K frozen-teacher contract (PMAT-701) #1863 Bug A: allocator autodetect Grace Blackwell (cuMemAllocManaged for unified)
fix(distill): RealizarQ4KTeacher — Q4K-native frozen-teacher path (PMAT-701 Bug B) #1869 Bug B: RealizarQ4KTeacher (Q4K-native forward, no F32 dequant)
fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702) #1874 Defect 3 / PMAT-702: `apr eval` no-fake-pass on broken models
This PR: vocab alignment when teacher > student

Test plan

`cargo build --release --features cuda -p apr-cli` — clean
`cargo fmt -p apr-cli --check` — clean
`pv validate contracts/apr-distill-teacher-vocab-alignment-v1.yaml` — clean
Unit tests pass: `distill_q4k_teacher::tests::oversized_target_errors_logic`, `truncation_length_math`
`[PMAT-703] vocab alignment` log line fires on gx10 with 7B teacher + 0.5B student
CI: `ci / gate` + `workspace-test` green
Follow-up: capture per-step loss trajectory from the in-flight 500-step run as evidence

🤖 Generated with Claude Code

…KD (PMAT-703) PMAT-701 unblocked the memory side of the MODEL-1 7B teacher (paiml/qwen2.5-coder-7b-apache-q4k-v1) on Grace Blackwell GB10, but attempting to actually distill from it surfaced a new defect: the 7B Coder's vocab is 152064 while the 0.5B / 1.5B Coder vocab is 151936. The 7B adds 128 code-specific tokens (fim_*, repo_name, etc.) that the smaller variants don't have. aprender-train-distill's `kd_step` asserts that student and teacher logits have the same vocab length (kd_step.rs:103-107). The 15-min "stable training" observed in the PR #1869 Bug B verification never actually executed a KD step — it was either silently hung in the first kd_step.rs assert or in the dimension-mismatched KL compute. ## Fix: truncate at the teacher-provider boundary `crates/apr-cli/src/commands/distill_q4k_teacher.rs`: - `RealizarQ4KTeacher` gets `native_vocab_size` + `effective_vocab_size` fields. New constructor `from_apr_path_with_target_vocab(path, target)` accepts an optional truncation target. Validation: `target > native` is rejected (the teacher has no embeddings to synthesize those logits); `target == 0` is rejected. - `logits_for_batch` truncates each returned vector to `effective_vocab_size` BEFORE returning to the pipeline. Softmax in `kd_step.rs` then renormalizes over the shared support — standard vocab-mismatch handling per Hinton 2015 §2. - `vocab_size()` returns `effective_vocab_size` so the pipeline's shape check (kd_step.rs:218) sees a consistent value. `crates/apr-cli/src/commands/distill.rs::run_cuda_backend`: - Reads `student_meta.vocab_size` and `teacher_meta.vocab_size` (both Option<usize> — fail hard with a helpful error if either is missing). - Routes to `from_apr_path_with_target_vocab(path, Some(student_vocab))` when teacher_native > student. When equal, no truncation. When teacher < student, return ValidationFailed (the student cannot have more vocab than the teacher in this design). ## Contract `contracts/apr-distill-teacher-vocab-alignment-v1.yaml`: - 3 equations: dispatch logic, KD-loss invariance under truncation, CLI plumbing. - 4 falsifiers: FT-VOCAB-ALIGN-001 (7B→0.5B dispatch succeeds), -002 (vocab_size reports effective, not native), -003 (oversize target errors at construction), -004 (logits_for_batch returns truncated vectors). - 2 Kani harnesses; qa_gate F-VOCAB-ALIGN-001. - Validates clean: `pv validate` reports 0 errors, 0 warnings. ## Unit tests `distill_q4k_teacher::tests::oversized_target_errors_logic` and `truncation_length_math` exercise the validation arms without requiring CUDA hardware. ## Verification on gx10 Dispatch log shows the new behavior firing: [PMAT-703] vocab alignment: teacher native=152064, student=151936 → truncating teacher logits to student vocab [PMAT-701] Q4K/Q6K teacher detected → RealizarQ4KTeacher (Q4K-native forward, no F32 dequant) (Full per-step verification deferred to follow-up: the 7B teacher forward via realizar takes long enough that completing 500 steps exceeds the test budget — but the cascade no longer hangs at the first KD step, which was the failure mode.) ## Cascade This is the fourth fix in the PMAT-701 family: - #1863 Bug A: allocator autodetect Grace Blackwell - #1869 Bug B: RealizarQ4KTeacher (Q4K-native forward) - #1874 Defect 3: apr eval no-fake-pass on broken models (PMAT-702) - This PR: vocab alignment for teacher > student (PMAT-703) With all four landed, the full pipeline from `apr distill` → trained checkpoint → `apr eval` is honest end-to-end for any (teacher, student) pair where teacher vocab >= student vocab and the tokenizer prefix is shared. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…g turn Adds a §87 amendment to SPEC-DISTILL-001 documenting the root cause of the PMAT-704 cascade fix: PR #1869 (Bug B / RealizarQ4KTeacher) was a wrong turn — the realizar `_cuda` forward path is CPU-bound and unusable as a distillation teacher on Grace Blackwell GB10. The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof of the defect. The amendment includes: * Full five-whys chain (cuMemAlloc 30 GB ceiling vs phantom OOM-killer SIGKILL on the explicit-managed path), with file/line citations pointing to the CPU-heavy ops in crates/aprender-serve/src/gguf/cuda/cuda.rs:18 * Root cause: conflated two failures, missed the cheap dispatch-flip experiment that would have rejected Bug B's hypothesis in 5 minutes. * Fix references: PR #1879 (PMAT-704) — cuBLAS default, RealizarQ4KTeacher demoted to APR_DISTILL_TEACHER_BACKEND=realizar-q4k opt-in fallback. * Contract changes: new `apr-distill-teacher-backend-selection-v1.yaml`, `cuda-q4k-frozen-teacher-v1.yaml` demoted (not retracted). * Methodology lesson: cheap-experiment-before-design discipline. * Cascade closure table covering PRs #1863, #1869, #1871, #1874, #1877, #1879. Spec version bumped 1.1.0 → 1.3.0 with changelog entries for both §86 (via PR #1871, also pending merge) and §87 (this PR). The amendment notes the §86 cross-reference and explains the order-of-operations in case readers see this on a build of main that predates #1871. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 22, 2026 08:44

noahgift added 2 commits May 22, 2026 15:21

Merge branch 'main' into fix/teacher-vocab-alignment-pmat-703

f2746f4

Merge branch 'main' into fix/teacher-vocab-alignment-pmat-703

3815477

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(distill): vocab-align teacher logits for Qwen2.5-Coder 7B → 0.5B KD (PMAT-703)#1877

fix(distill): vocab-align teacher logits for Qwen2.5-Coder 7B → 0.5B KD (PMAT-703)#1877
noahgift wants to merge 3 commits into
mainfrom
fix/teacher-vocab-alignment-pmat-703

noahgift commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 22, 2026

Summary

Fix: truncate at teacher-provider boundary

Contract

Verification on gx10

Cascade context

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant