feat(distill): wire ProgressCallback into Pipeline — close training-monitoring gap (PMAT-705)#1881
Open
noahgift wants to merge 3 commits into
Open
feat(distill): wire ProgressCallback into Pipeline — close training-monitoring gap (PMAT-705)#1881noahgift wants to merge 3 commits into
noahgift wants to merge 3 commits into
Conversation
…onitoring gap (PMAT-705)
The PMAT-704 cascade post-mortem revealed a hard observability gap: long
distill dispatches were silent between "blocks uploaded" and
"Distillation complete." There was no way to distinguish "training
progressing silently" from "training hung." A 1.5 h gx10 hang at step 0
was invisible until terminal state.
Root cause: `aprender-train` ships a full callback infrastructure
(`ProgressCallback`, `MonitorCallback`, `CheckpointCallback`,
`EarlyStoppingCallback`, etc.) but `aprender-train-distill::Pipeline`
had zero callback references. Per-step loss was computed in `kd_step`
and discarded.
## Fix
`crates/aprender-train-distill/src/pipeline.rs`:
* `Pipeline` gains `callbacks: Vec<Box<dyn entrenar::train::TrainerCallback>>`
field and `with_callback()` builder method.
* Training loop fires the full lifecycle:
- `on_train_begin` once before the first step.
- `on_epoch_begin` at each epoch boundary.
- `on_step_end` after grad application with current loss + step
counter (the load-bearing observability point).
- `on_epoch_end` at each epoch boundary.
- `on_train_end` once after the loop.
* `CallbackAction::Stop` from any callback terminates the training
loop within one step (early-stopping support).
`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:
* Attaches a default `ProgressCallback` honoring the env var
`APR_DISTILL_LOG_EVERY` (default 10). `APR_DISTILL_LOG_EVERY=0`
disables per-step logs but keeps epoch boundary lines.
## Contract
`contracts/distill-pipeline-observability-v1.yaml`:
* 3 equations: callback_lifecycle, progress_log_format, default_attachment.
* 4 falsifiers (FT-OBSERV-001..004): per-step callback fires; log_every
honored; APR_DISTILL_LOG_EVERY=0 silences per-step; callback overhead
bounded.
* 2 Kani harnesses (callback_ordering, Stop_terminates_loop).
* qa_gate F-OBSERV-001.
Validates clean: `pv validate` reports 0 errors, 0 warnings.
## Unit tests
`crates/aprender-train-distill/src/pipeline.rs` tests module:
* `pmat_705_observability::falsify_observ_001_per_step_callback_fires`
* `pmat_705_observability::falsify_observ_001_step_count_matches`
* `pmat_705_observability::callback_ordering_preserved`
All 3 PASS. Use a recording callback that captures every lifecycle
event and verify event counts match the expected (epochs ×
steps_per_epoch) shape.
## gx10 dispatch verification
Dispatch on gx10 emits:
[PMAT-705] ProgressCallback attached (log every 1 steps)
…confirming the wiring fires. Full per-step output on the 7B teacher
path requires PMAT-704 (#1879) to also land — current main routes Q4K
teachers through Bug B's CPU-bound `RealizarQ4KTeacher` which hangs at
step 0 (no progress to log). The PMAT-705 wiring is correct
independent of teacher backend; the unit tests prove it.
## Cascade context
Sixth fix in the PMAT-701 family:
- #1863 PMAT-701 Bug A: allocator autodetect Grace Blackwell
- #1869 PMAT-701 Bug B: RealizarQ4KTeacher (demoted by #1879)
- #1871 SPEC-DISTILL §86 + dispatch script default
- #1874 PMAT-702: apr eval no-fake-pass on broken models
- #1877 PMAT-703: teacher vocab alignment (superseded by #1879's TruncatingTeacher)
- #1879 PMAT-704: cuBLAS default + opt-in Realizar fallback
- #1880 SPEC-DISTILL §87: PMAT-704 post-mortem
- **This PR**: PMAT-705 — ProgressCallback wired into Pipeline
Together: the distill → eval pipeline is honest end-to-end AND
observable end-to-end.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the training-monitoring gap surfaced by the PMAT-704 cascade. A 1.5 h gx10 hang at step 0 went unnoticed because the distill pipeline emits zero output between "blocks uploaded" and "Distillation complete."
Root cause: `aprender-train` ships a full callback infrastructure (`ProgressCallback`, `MonitorCallback`, `CheckpointCallback`, etc.) but `aprender-train-distill::Pipeline` had zero callback references. Per-step loss was computed in `kd_step` and discarded.
Fix
`crates/aprender-train-distill/src/pipeline.rs`:
`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:
Contract
`contracts/distill-pipeline-observability-v1.yaml` (validates clean):
Unit tests
3 new tests in `pipeline::tests::pmat_705_observability`, all PASS:
```
falsify_observ_001_per_step_callback_fires .... ok
falsify_observ_001_step_count_matches ......... ok
callback_ordering_preserved ................... ok
```
Use a recording callback that captures every lifecycle event and verify counts match (epochs × steps_per_epoch).
gx10 verification
Dispatch emits:
```
[PMAT-705] ProgressCallback attached (log every 1 steps)
```
Full per-step output on the 7B teacher path needs #1879 (PMAT-704) to land too — current main routes Q4K through Bug B's CPU-bound path which hangs at step 0 (no steps → no logs). The PMAT-705 wiring is correct independent of teacher backend; unit tests prove it.
Cascade
Sixth fix in the PMAT-701 family:
Together the distill → eval pipeline is honest AND observable end-to-end.
Test plan
🤖 Generated with Claude Code