chore(distill): default to MODEL-1 7B teacher + SPEC-DISTILL-001 §86 (PMAT-701 follow-up)#1871
Open
noahgift wants to merge 2 commits into
Open
chore(distill): default to MODEL-1 7B teacher + SPEC-DISTILL-001 §86 (PMAT-701 follow-up)#1871noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
…(PMAT-701 follow-up) The Phase 4 Stage D 50K + 10K runs (2026-05-20/21) silently inherited the Phase 3 smoke workaround of TEACHER_REPO == STUDENT_INIT == 0.5B. Result: no KD signal, 30 h of compute that fine-tuned the base model toward gibberish on a small corpus. Documented in `evidence/distill-7b-teacher-loadtest-gx10/findings.json` + this spec amendment. Now that PMAT-701 Bug A (PR #1863) and Bug B (PR #1869) have landed, the 7B Q4K teacher is feasible on Grace Blackwell GB10: * PR #1863: trueno-gpu allocator autodetects unified-memory devices (Grace, Tegra) and routes to cuMemAllocManaged so the full 128 GB pool is reachable. * PR #1869: new RealizarQ4KTeacher keeps Q4K teacher weights quantized on the GPU (no F32 dequant at upload), eliminating the OOM-kill that was killing the first training step. This PR flips the dispatch script's default and codifies the why in spec §86: * `scripts/dispatch-distill-phase-3-gx10.sh` — TEACHER_REPO default changes from `Qwen/Qwen2.5-Coder-0.5B-Instruct` (smoke fallback) to `paiml/qwen2.5-coder-7b-apache-q4k-v1` (the MODEL-1 teacher the spec was designed around). Smoke-only callers override with the env var. * `docs/specifications/aprender-train/distillation-epic-spec.md` — adds §86 documenting the 5-whys, the fix references, and a new falsifier F-DISTILL-V2-001-TEACHER-DIVERGENCE that rejects future Phase-4-class dispatches where teacher == student unless an explicit override is set. * Spec version bumped to 1.2.0 with changelog entry. The §86 amendment also notes that the existing 50K + 10K Stage D runs do NOT count toward AC-DISTILL-003 — they're discharged as no-KD baselines, and a re-dispatched 50K run with the 7B teacher is required for a real Phase 4 verdict. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Now that PMAT-701 Bug A (#1863) and Bug B (#1869) have landed, the MODEL-1 7B teacher (`paiml/qwen2.5-coder-7b-apache-q4k-v1`) is feasible on Grace Blackwell GB10. This PR is the small `chore` change that flips the dispatch script default + records the why in SPEC-DISTILL-001 §86.
Changes
Why this matters
The Phase 4 Stage D 50K (25 h) and 10K (5 h) runs in 2026-05-20/21 silently inherited the Phase 3 smoke workaround of TEACHER_REPO == STUDENT_INIT == 0.5B. KD signal was ~zero (KL between identical distributions); 30 hours of compute fine-tuned the base model toward gibberish on a synthetic-ish corpus. The §86 amendment makes that mistake hard to repeat.
Test plan
🤖 Generated with Claude Code