Skip to content

feat(distill): wire ProgressCallback into Pipeline — close training-monitoring gap (PMAT-705)#1881

Open
noahgift wants to merge 3 commits into
mainfrom
feat/distill-pipeline-progress-callback-pmat-705
Open

feat(distill): wire ProgressCallback into Pipeline — close training-monitoring gap (PMAT-705)#1881
noahgift wants to merge 3 commits into
mainfrom
feat/distill-pipeline-progress-callback-pmat-705

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Closes the training-monitoring gap surfaced by the PMAT-704 cascade. A 1.5 h gx10 hang at step 0 went unnoticed because the distill pipeline emits zero output between "blocks uploaded" and "Distillation complete."

Root cause: `aprender-train` ships a full callback infrastructure (`ProgressCallback`, `MonitorCallback`, `CheckpointCallback`, etc.) but `aprender-train-distill::Pipeline` had zero callback references. Per-step loss was computed in `kd_step` and discarded.

Fix

`crates/aprender-train-distill/src/pipeline.rs`:

  • `Pipeline` gains `callbacks: Vec<Box>` field + `with_callback()` builder.
  • Training loop wires full lifecycle: `on_train_begin` → `on_epoch_begin` → `on_step_end` × N (with current loss + step counter) → `on_epoch_end` → `on_train_end`.
  • `CallbackAction::Stop` terminates the loop within one step (early-stopping support).

`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:

  • Attaches default `ProgressCallback` honoring `APR_DISTILL_LOG_EVERY` env (default 10). Set to 0 to disable per-step logs (epoch boundaries still log).

Contract

`contracts/distill-pipeline-observability-v1.yaml` (validates clean):

  • 3 equations: callback lifecycle, progress log format, default attachment
  • 4 falsifiers: per-step callback fires; log_every honored; APR_DISTILL_LOG_EVERY=0 silences; overhead bounded
  • 2 Kani harnesses (callback ordering, Stop terminates loop)
  • qa_gate F-OBSERV-001

Unit tests

3 new tests in `pipeline::tests::pmat_705_observability`, all PASS:

```
falsify_observ_001_per_step_callback_fires .... ok
falsify_observ_001_step_count_matches ......... ok
callback_ordering_preserved ................... ok
```

Use a recording callback that captures every lifecycle event and verify counts match (epochs × steps_per_epoch).

gx10 verification

Dispatch emits:

```
[PMAT-705] ProgressCallback attached (log every 1 steps)
```

Full per-step output on the 7B teacher path needs #1879 (PMAT-704) to land too — current main routes Q4K through Bug B's CPU-bound path which hangs at step 0 (no steps → no logs). The PMAT-705 wiring is correct independent of teacher backend; unit tests prove it.

Cascade

Sixth fix in the PMAT-701 family:

Together the distill → eval pipeline is honest AND observable end-to-end.

Test plan

  • `cargo build --release --features cuda -p apr-cli` — clean
  • `cargo test -p aprender-train-distill --lib pipeline::tests::pmat_705` — 3/3 PASS
  • `cargo fmt -p aprender-train-distill -p apr-cli --check` — clean
  • `pv validate contracts/distill-pipeline-observability-v1.yaml` — clean
  • gx10 dispatch confirms `[PMAT-705] ProgressCallback attached` marker fires
  • CI: `ci / gate` + `workspace-test` green
  • Follow-up: capture per-step loss trajectory from a 7B run once fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704) #1879 lands

🤖 Generated with Claude Code

…onitoring gap (PMAT-705)

The PMAT-704 cascade post-mortem revealed a hard observability gap: long
distill dispatches were silent between "blocks uploaded" and
"Distillation complete." There was no way to distinguish "training
progressing silently" from "training hung." A 1.5 h gx10 hang at step 0
was invisible until terminal state.

Root cause: `aprender-train` ships a full callback infrastructure
(`ProgressCallback`, `MonitorCallback`, `CheckpointCallback`,
`EarlyStoppingCallback`, etc.) but `aprender-train-distill::Pipeline`
had zero callback references. Per-step loss was computed in `kd_step`
and discarded.

## Fix

`crates/aprender-train-distill/src/pipeline.rs`:

* `Pipeline` gains `callbacks: Vec<Box<dyn entrenar::train::TrainerCallback>>`
  field and `with_callback()` builder method.
* Training loop fires the full lifecycle:
  - `on_train_begin` once before the first step.
  - `on_epoch_begin` at each epoch boundary.
  - `on_step_end` after grad application with current loss + step
    counter (the load-bearing observability point).
  - `on_epoch_end` at each epoch boundary.
  - `on_train_end` once after the loop.
* `CallbackAction::Stop` from any callback terminates the training
  loop within one step (early-stopping support).

`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:

* Attaches a default `ProgressCallback` honoring the env var
  `APR_DISTILL_LOG_EVERY` (default 10). `APR_DISTILL_LOG_EVERY=0`
  disables per-step logs but keeps epoch boundary lines.

## Contract

`contracts/distill-pipeline-observability-v1.yaml`:

* 3 equations: callback_lifecycle, progress_log_format, default_attachment.
* 4 falsifiers (FT-OBSERV-001..004): per-step callback fires; log_every
  honored; APR_DISTILL_LOG_EVERY=0 silences per-step; callback overhead
  bounded.
* 2 Kani harnesses (callback_ordering, Stop_terminates_loop).
* qa_gate F-OBSERV-001.

Validates clean: `pv validate` reports 0 errors, 0 warnings.

## Unit tests

`crates/aprender-train-distill/src/pipeline.rs` tests module:

* `pmat_705_observability::falsify_observ_001_per_step_callback_fires`
* `pmat_705_observability::falsify_observ_001_step_count_matches`
* `pmat_705_observability::callback_ordering_preserved`

All 3 PASS. Use a recording callback that captures every lifecycle
event and verify event counts match the expected (epochs ×
steps_per_epoch) shape.

## gx10 dispatch verification

Dispatch on gx10 emits:

  [PMAT-705] ProgressCallback attached (log every 1 steps)

…confirming the wiring fires. Full per-step output on the 7B teacher
path requires PMAT-704 (#1879) to also land — current main routes Q4K
teachers through Bug B's CPU-bound `RealizarQ4KTeacher` which hangs at
step 0 (no progress to log). The PMAT-705 wiring is correct
independent of teacher backend; the unit tests prove it.

## Cascade context

Sixth fix in the PMAT-701 family:

  - #1863 PMAT-701 Bug A: allocator autodetect Grace Blackwell
  - #1869 PMAT-701 Bug B: RealizarQ4KTeacher (demoted by #1879)
  - #1871 SPEC-DISTILL §86 + dispatch script default
  - #1874 PMAT-702: apr eval no-fake-pass on broken models
  - #1877 PMAT-703: teacher vocab alignment (superseded by #1879's TruncatingTeacher)
  - #1879 PMAT-704: cuBLAS default + opt-in Realizar fallback
  - #1880 SPEC-DISTILL §87: PMAT-704 post-mortem
  - **This PR**: PMAT-705 — ProgressCallback wired into Pipeline

Together: the distill → eval pipeline is honest end-to-end AND
observable end-to-end.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 22, 2026 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant