chore(distill): Stage D dispatch wrapper with PMAT-701 lessons baked in#1883
Open
noahgift wants to merge 2 commits into
Open
chore(distill): Stage D dispatch wrapper with PMAT-701 lessons baked in#1883noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
…s baked in `scripts/dispatch-distill-stage-d.sh` is the operator entrypoint for Phase 4 Stage D production training. Captures the PMAT-701 cascade post-mortem lessons in a single dispatchable wrapper: * **cuBLAS default** (PMAT-704 / #1879). `APR_DISTILL_TEACHER_BACKEND=auto` by default; operators can opt into the slower memory-constrained Realizar path via `APR_DISTILL_TEACHER_BACKEND=realizar-q4k`. * **Per-step monitoring** (PMAT-705 / #1881). `APR_DISTILL_LOG_EVERY=50` default — visible loss progress without log spam. Operators can set =1 for verbose mode or =0 to silence. * **PMAT-699 P0 checkpointing** every 5000 steps (durability — survives kill / crash). * **PMAT-703 vocab alignment** auto-applies inside the cuda backend when teacher.vocab > student.vocab (no operator config needed). * **Disk preflight**: requires ≥ 15 GB free on /home/noah (Stage D 50K writes ~12 GB of checkpoints; PMAT-704 cascade post-mortem caught gx10 at 98 % full). Fails fast with cleanup candidates listed. * **Teacher / student validation**: requires stamped APR metadata (apr-leaderboard checkpoint by default — the dispatch-script's `apr import --preserve-q4k` path fails the cuda backend's metadata-required check, surfaced by PMAT-704 incident). * **Process-alive check**: 10 s post-dispatch verification catches early validation errors so the operator doesn't walk away from a failed dispatch. The wrapper is intentionally separate from `dispatch-distill-phase-3-gx10.sh` which remains the Phase 3 smoke entrypoint. Stage D is production scope and shouldn't inherit smoke defaults (see SPEC-DISTILL-001 §86 + memory `feedback_smoke_defaults_leak_into_production.md`). ## Override env vars * `STEPS` (default 50000) * `BATCH_SIZE` (default 32) * `LR` (default 1.5e-5) * `T` (default 4.0) * `ALPHA` (default 0.3) * `DATASET_DIR` (unset → synthetic; set to a `.bin` shard dir for real corpus) * `APR_DISTILL_LOG_EVERY` (default 50) * `APR_DISTILL_CHECKPOINT_EVERY` (default 5000) * `APR_DISTILL_TEACHER_BACKEND` (default `auto`) * `DISK_FREE_REQUIRED_GB` (default 15) * `DRY_RUN=1` to plan only ## QA * `bash -n scripts/dispatch-distill-stage-d.sh` — syntax-ok * `bashrs lint scripts/dispatch-distill-stage-d.sh` — 0 errors (warnings are df-non-determinism + path-traversal-ln, both expected for an operator-supplied path dispatcher) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
`scripts/dispatch-distill-stage-d.sh` is the operator entrypoint for Phase 4 Stage D production training. Captures the PMAT-701 cascade post-mortem lessons in a single dispatchable wrapper.
What it bakes in
Override env vars
`STEPS`, `BATCH_SIZE`, `LR`, `T`, `ALPHA`, `DATASET_DIR`, `APR_DISTILL_LOG_EVERY`, `APR_DISTILL_CHECKPOINT_EVERY`, `APR_DISTILL_TEACHER_BACKEND`, `DISK_FREE_REQUIRED_GB`, `DRY_RUN`.
Intentionally separate from `dispatch-distill-phase-3-gx10.sh` (smoke). SPEC-DISTILL-001 §86 + `feedback_smoke_defaults_leak_into_production.md` codified why these should NOT share defaults.
QA
Cascade context
Companion to the PMAT-701 family of fixes. Ready to dispatch once #1879 (PMAT-704 cuBLAS default) and #1881 (PMAT-705 ProgressCallback) land — without those, this wrapper would default to the slow / silent path.
🤖 Generated with Claude Code