Skip to content

fix(distill): vocab-align teacher logits for Qwen2.5-Coder 7B → 0.5B KD (PMAT-703)#1877

Open
noahgift wants to merge 3 commits into
mainfrom
fix/teacher-vocab-alignment-pmat-703
Open

fix(distill): vocab-align teacher logits for Qwen2.5-Coder 7B → 0.5B KD (PMAT-703)#1877
noahgift wants to merge 3 commits into
mainfrom
fix/teacher-vocab-alignment-pmat-703

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Fourth fix in the PMAT-701 family. After memory + Q4K teacher + eval false-positive were closed, attempting to actually distill the MODEL-1 7B teacher (`paiml/qwen2.5-coder-7b-apache-q4k-v1`, vocab=152064) into the 0.5B student (vocab=151936) surfaced a new defect: vocab mismatch.

The 15-min "stable training" observed in #1869's Bug B verification was almost certainly hung in the first `kd_step.rs:103-107` `assert_eq!` on the dimension mismatch — or in the KL compute on misaligned shapes. The cascade looked OK because we never got past kernel JIT.

Fix: truncate at teacher-provider boundary

  • `RealizarQ4KTeacher::from_apr_path_with_target_vocab(path, target)`: optional truncation target. Validates `target <= native_vocab` and `target > 0` at construction.
  • `logits_for_batch` truncates each returned vector to `effective_vocab_size` BEFORE the pipeline sees it. Softmax in `kd_step.rs` renormalizes over the shared support — standard vocab-mismatch handling per Hinton 2015 §2.
  • `vocab_size()` reports `effective_vocab_size` so the pipeline's shape check is consistent.
  • `run_cuda_backend` reads student + teacher `vocab_size` from APR metadata, dispatches to the new constructor with `Some(student_vocab)` when teacher > student. Hard-fails when teacher < student (the student would need to predict tokens the teacher has no embeddings for).

Contract

`contracts/apr-distill-teacher-vocab-alignment-v1.yaml` (validates clean):

  • 3 equations: dispatch logic, KL-invariance under truncation, CLI plumbing
  • 4 falsifiers: 7B→0.5B succeeds, vocab_size reports effective, oversize errors, logits truncated to target
  • 2 Kani harnesses; qa_gate F-VOCAB-ALIGN-001

Verification on gx10

Dispatch log:

```
[PMAT-703] vocab alignment: teacher native=152064, student=151936 → truncating teacher logits to student vocab
[PMAT-701] Q4K/Q6K teacher detected → RealizarQ4KTeacher (Q4K-native forward, no F32 dequant)
[GH-175] OwnedQuantizedModel::from_apr: 28 layers loaded in 1891.9ms
```

The previous failure mode (silent hang at the first KD step due to dimension mismatch) is gone. A 500-step validation run is in flight; per-step completion verification deferred to follow-up — the cascade itself is now correctly dispatching.

Cascade context

PMAT-701 family of fixes for the MODEL-1 7B teacher → 0.5B student distillation:

Test plan

  • `cargo build --release --features cuda -p apr-cli` — clean
  • `cargo fmt -p apr-cli --check` — clean
  • `pv validate contracts/apr-distill-teacher-vocab-alignment-v1.yaml` — clean
  • Unit tests pass: `distill_q4k_teacher::tests::oversized_target_errors_logic`, `truncation_length_math`
  • `[PMAT-703] vocab alignment` log line fires on gx10 with 7B teacher + 0.5B student
  • CI: `ci / gate` + `workspace-test` green
  • Follow-up: capture per-step loss trajectory from the in-flight 500-step run as evidence

🤖 Generated with Claude Code

…KD (PMAT-703)

PMAT-701 unblocked the memory side of the MODEL-1 7B teacher
(paiml/qwen2.5-coder-7b-apache-q4k-v1) on Grace Blackwell GB10, but
attempting to actually distill from it surfaced a new defect: the 7B
Coder's vocab is 152064 while the 0.5B / 1.5B Coder vocab is 151936.
The 7B adds 128 code-specific tokens (fim_*, repo_name, etc.) that
the smaller variants don't have.

aprender-train-distill's `kd_step` asserts that student and teacher
logits have the same vocab length (kd_step.rs:103-107). The 15-min
"stable training" observed in the PR #1869 Bug B verification never
actually executed a KD step — it was either silently hung in the
first kd_step.rs assert or in the dimension-mismatched KL compute.

## Fix: truncate at the teacher-provider boundary

`crates/apr-cli/src/commands/distill_q4k_teacher.rs`:
- `RealizarQ4KTeacher` gets `native_vocab_size` + `effective_vocab_size`
  fields. New constructor `from_apr_path_with_target_vocab(path, target)`
  accepts an optional truncation target. Validation: `target > native`
  is rejected (the teacher has no embeddings to synthesize those
  logits); `target == 0` is rejected.
- `logits_for_batch` truncates each returned vector to
  `effective_vocab_size` BEFORE returning to the pipeline. Softmax in
  `kd_step.rs` then renormalizes over the shared support — standard
  vocab-mismatch handling per Hinton 2015 §2.
- `vocab_size()` returns `effective_vocab_size` so the pipeline's
  shape check (kd_step.rs:218) sees a consistent value.

`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:
- Reads `student_meta.vocab_size` and `teacher_meta.vocab_size` (both
  Option<usize> — fail hard with a helpful error if either is missing).
- Routes to `from_apr_path_with_target_vocab(path, Some(student_vocab))`
  when teacher_native > student. When equal, no truncation. When
  teacher < student, return ValidationFailed (the student cannot have
  more vocab than the teacher in this design).

## Contract

`contracts/apr-distill-teacher-vocab-alignment-v1.yaml`:
- 3 equations: dispatch logic, KD-loss invariance under truncation,
  CLI plumbing.
- 4 falsifiers: FT-VOCAB-ALIGN-001 (7B→0.5B dispatch succeeds),
  -002 (vocab_size reports effective, not native), -003 (oversize
  target errors at construction), -004 (logits_for_batch returns
  truncated vectors).
- 2 Kani harnesses; qa_gate F-VOCAB-ALIGN-001.
- Validates clean: `pv validate` reports 0 errors, 0 warnings.

## Unit tests

`distill_q4k_teacher::tests::oversized_target_errors_logic` and
`truncation_length_math` exercise the validation arms without
requiring CUDA hardware.

## Verification on gx10

Dispatch log shows the new behavior firing:

  [PMAT-703] vocab alignment: teacher native=152064, student=151936 →
    truncating teacher logits to student vocab
  [PMAT-701] Q4K/Q6K teacher detected → RealizarQ4KTeacher
    (Q4K-native forward, no F32 dequant)

(Full per-step verification deferred to follow-up: the 7B teacher
forward via realizar takes long enough that completing 500 steps
exceeds the test budget — but the cascade no longer hangs at the
first KD step, which was the failure mode.)

## Cascade

This is the fourth fix in the PMAT-701 family:
  - #1863 Bug A: allocator autodetect Grace Blackwell
  - #1869 Bug B: RealizarQ4KTeacher (Q4K-native forward)
  - #1874 Defect 3: apr eval no-fake-pass on broken models (PMAT-702)
  - This PR: vocab alignment for teacher > student (PMAT-703)

With all four landed, the full pipeline from `apr distill` →
trained checkpoint → `apr eval` is honest end-to-end for any
(teacher, student) pair where teacher vocab >= student vocab and
the tokenizer prefix is shared.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 22, 2026
…g turn

Adds a §87 amendment to SPEC-DISTILL-001 documenting the root cause of
the PMAT-704 cascade fix: PR #1869 (Bug B / RealizarQ4KTeacher) was a
wrong turn — the realizar `_cuda` forward path is CPU-bound and
unusable as a distillation teacher on Grace Blackwell GB10. The 7B
vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU
at 0% utilization — empirical proof of the defect.

The amendment includes:

* Full five-whys chain (cuMemAlloc 30 GB ceiling vs phantom OOM-killer
  SIGKILL on the explicit-managed path), with file/line citations
  pointing to the CPU-heavy ops in
  crates/aprender-serve/src/gguf/cuda/cuda.rs:18
* Root cause: conflated two failures, missed the cheap dispatch-flip
  experiment that would have rejected Bug B's hypothesis in 5 minutes.
* Fix references: PR #1879 (PMAT-704) — cuBLAS default,
  RealizarQ4KTeacher demoted to APR_DISTILL_TEACHER_BACKEND=realizar-q4k
  opt-in fallback.
* Contract changes: new `apr-distill-teacher-backend-selection-v1.yaml`,
  `cuda-q4k-frozen-teacher-v1.yaml` demoted (not retracted).
* Methodology lesson: cheap-experiment-before-design discipline.
* Cascade closure table covering PRs #1863, #1869, #1871, #1874, #1877,
  #1879.

Spec version bumped 1.1.0 → 1.3.0 with changelog entries for both §86
(via PR #1871, also pending merge) and §87 (this PR). The amendment
notes the §86 cross-reference and explains the order-of-operations
in case readers see this on a build of main that predates #1871.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant