fix(distill): vocab-align teacher logits for Qwen2.5-Coder 7B → 0.5B KD (PMAT-703)#1877
Open
noahgift wants to merge 3 commits into
Open
fix(distill): vocab-align teacher logits for Qwen2.5-Coder 7B → 0.5B KD (PMAT-703)#1877noahgift wants to merge 3 commits into
noahgift wants to merge 3 commits into
Conversation
…KD (PMAT-703) PMAT-701 unblocked the memory side of the MODEL-1 7B teacher (paiml/qwen2.5-coder-7b-apache-q4k-v1) on Grace Blackwell GB10, but attempting to actually distill from it surfaced a new defect: the 7B Coder's vocab is 152064 while the 0.5B / 1.5B Coder vocab is 151936. The 7B adds 128 code-specific tokens (fim_*, repo_name, etc.) that the smaller variants don't have. aprender-train-distill's `kd_step` asserts that student and teacher logits have the same vocab length (kd_step.rs:103-107). The 15-min "stable training" observed in the PR #1869 Bug B verification never actually executed a KD step — it was either silently hung in the first kd_step.rs assert or in the dimension-mismatched KL compute. ## Fix: truncate at the teacher-provider boundary `crates/apr-cli/src/commands/distill_q4k_teacher.rs`: - `RealizarQ4KTeacher` gets `native_vocab_size` + `effective_vocab_size` fields. New constructor `from_apr_path_with_target_vocab(path, target)` accepts an optional truncation target. Validation: `target > native` is rejected (the teacher has no embeddings to synthesize those logits); `target == 0` is rejected. - `logits_for_batch` truncates each returned vector to `effective_vocab_size` BEFORE returning to the pipeline. Softmax in `kd_step.rs` then renormalizes over the shared support — standard vocab-mismatch handling per Hinton 2015 §2. - `vocab_size()` returns `effective_vocab_size` so the pipeline's shape check (kd_step.rs:218) sees a consistent value. `crates/apr-cli/src/commands/distill.rs::run_cuda_backend`: - Reads `student_meta.vocab_size` and `teacher_meta.vocab_size` (both Option<usize> — fail hard with a helpful error if either is missing). - Routes to `from_apr_path_with_target_vocab(path, Some(student_vocab))` when teacher_native > student. When equal, no truncation. When teacher < student, return ValidationFailed (the student cannot have more vocab than the teacher in this design). ## Contract `contracts/apr-distill-teacher-vocab-alignment-v1.yaml`: - 3 equations: dispatch logic, KD-loss invariance under truncation, CLI plumbing. - 4 falsifiers: FT-VOCAB-ALIGN-001 (7B→0.5B dispatch succeeds), -002 (vocab_size reports effective, not native), -003 (oversize target errors at construction), -004 (logits_for_batch returns truncated vectors). - 2 Kani harnesses; qa_gate F-VOCAB-ALIGN-001. - Validates clean: `pv validate` reports 0 errors, 0 warnings. ## Unit tests `distill_q4k_teacher::tests::oversized_target_errors_logic` and `truncation_length_math` exercise the validation arms without requiring CUDA hardware. ## Verification on gx10 Dispatch log shows the new behavior firing: [PMAT-703] vocab alignment: teacher native=152064, student=151936 → truncating teacher logits to student vocab [PMAT-701] Q4K/Q6K teacher detected → RealizarQ4KTeacher (Q4K-native forward, no F32 dequant) (Full per-step verification deferred to follow-up: the 7B teacher forward via realizar takes long enough that completing 500 steps exceeds the test budget — but the cascade no longer hangs at the first KD step, which was the failure mode.) ## Cascade This is the fourth fix in the PMAT-701 family: - #1863 Bug A: allocator autodetect Grace Blackwell - #1869 Bug B: RealizarQ4KTeacher (Q4K-native forward) - #1874 Defect 3: apr eval no-fake-pass on broken models (PMAT-702) - This PR: vocab alignment for teacher > student (PMAT-703) With all four landed, the full pipeline from `apr distill` → trained checkpoint → `apr eval` is honest end-to-end for any (teacher, student) pair where teacher vocab >= student vocab and the tokenizer prefix is shared. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 22, 2026
noahgift
added a commit
that referenced
this pull request
May 22, 2026
…g turn Adds a §87 amendment to SPEC-DISTILL-001 documenting the root cause of the PMAT-704 cascade fix: PR #1869 (Bug B / RealizarQ4KTeacher) was a wrong turn — the realizar `_cuda` forward path is CPU-bound and unusable as a distillation teacher on Grace Blackwell GB10. The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof of the defect. The amendment includes: * Full five-whys chain (cuMemAlloc 30 GB ceiling vs phantom OOM-killer SIGKILL on the explicit-managed path), with file/line citations pointing to the CPU-heavy ops in crates/aprender-serve/src/gguf/cuda/cuda.rs:18 * Root cause: conflated two failures, missed the cheap dispatch-flip experiment that would have rejected Bug B's hypothesis in 5 minutes. * Fix references: PR #1879 (PMAT-704) — cuBLAS default, RealizarQ4KTeacher demoted to APR_DISTILL_TEACHER_BACKEND=realizar-q4k opt-in fallback. * Contract changes: new `apr-distill-teacher-backend-selection-v1.yaml`, `cuda-q4k-frozen-teacher-v1.yaml` demoted (not retracted). * Methodology lesson: cheap-experiment-before-design discipline. * Cascade closure table covering PRs #1863, #1869, #1871, #1874, #1877, #1879. Spec version bumped 1.1.0 → 1.3.0 with changelog entries for both §86 (via PR #1871, also pending merge) and §87 (this PR). The amendment notes the §86 cross-reference and explains the order-of-operations in case readers see this on a build of main that predates #1871. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fourth fix in the PMAT-701 family. After memory + Q4K teacher + eval false-positive were closed, attempting to actually distill the MODEL-1 7B teacher (`paiml/qwen2.5-coder-7b-apache-q4k-v1`, vocab=152064) into the 0.5B student (vocab=151936) surfaced a new defect: vocab mismatch.
The 15-min "stable training" observed in #1869's Bug B verification was almost certainly hung in the first `kd_step.rs:103-107` `assert_eq!` on the dimension mismatch — or in the KL compute on misaligned shapes. The cascade looked OK because we never got past kernel JIT.
Fix: truncate at teacher-provider boundary
Contract
`contracts/apr-distill-teacher-vocab-alignment-v1.yaml` (validates clean):
Verification on gx10
Dispatch log:
```
[PMAT-703] vocab alignment: teacher native=152064, student=151936 → truncating teacher logits to student vocab
[PMAT-701] Q4K/Q6K teacher detected → RealizarQ4KTeacher (Q4K-native forward, no F32 dequant)
[GH-175] OwnedQuantizedModel::from_apr: 28 layers loaded in 1891.9ms
```
The previous failure mode (silent hang at the first KD step due to dimension mismatch) is gone. A 500-step validation run is in flight; per-step completion verification deferred to follow-up — the cascade itself is now correctly dispatching.
Cascade context
PMAT-701 family of fixes for the MODEL-1 7B teacher → 0.5B student distillation:
Test plan
🤖 Generated with Claude Code