fix(validate): gate --quality threshold on implemented checks (closes #1866) by noahgift · Pull Request #1870 · paiml/aprender

noahgift · 2026-05-22T07:10:42Z

Summary

Fixes #1866 — apr validate --quality reported Grade F (3/100, exit 5) on every valid APR file. Root cause: 22 of 25 checks in the 100-point QA checklist are stubbed Skip(\"Not implemented\"), but the threshold gate compared total_score against the full 100-point denominator. Working models were mathematically incapable of clearing 50/100 until every stub was filled in.

Fix

New `ValidationReport` methods (`validation.rs`)

implemented_max() -> u8 — count of checks whose status is not Skip
implemented_score_pct() -> Option<f64> — pct of implemented checks that passed; None when the entire suite is stubbed (treat as informational)

Updated threshold gate (`validate.rs`)

Some(pct < 50) → ValidationFailed (clear breakage signal)
Some(pct >= 50) → pass
None (fully stubbed) → pass as informational; apr qa is the canonical pass/fail gate

Contract

New contracts/apr-validate-quality-threshold-v1.yaml:

equation implemented_denominator
equation implemented_score_pct
equation threshold_gate_on_implemented
FALSIFY-VALIDATE-QUALITY-001: None-when-all-stubbed (PASS)
FALSIFY-VALIDATE-QUALITY-002: 100-when-all-pass (PASS, matches apr validate --quality: 22/25 checks 'Pending — Not implemented' → working models score 3/100, exit 5 #1866 reproducer)
FALSIFY-VALIDATE-QUALITY-003: no bare report.total_score < 50 in gate (PASS)

Tests

4 new unit tests, all pass:

test_implemented_score_pct_none_when_all_stubbed
test_implemented_score_pct_100_when_all_pass (replicates apr validate --quality: 22/25 checks 'Pending — Not implemented' → working models score 3/100, exit 5 #1866 1.5B reproducer in unit form)
test_implemented_score_pct_mixed (4 runnable, 2 pass = exactly 50%)
test_implemented_score_pct_below_threshold (4 runnable, 1 pass = 25% < 50%)

End-to-end verification

$ apr validate /home/noah/models/qwen2.5-coder-1.5b-instruct-q4k.apr --quality --strict
[3/3 implemented checks pass, 22 stubbed]
TOTAL: 3/100  Grade: F   ← informational display unchanged
$ echo $?
0   ← was 5 ("Score 3/100 (below 50% threshold)")

The display still shows the 3/100 aspirational score (so users know how much of the QA checklist is implemented), but the threshold gate no longer fails on stubbed checks.

Out of scope

The --strict is not yet implemented warning at validate.rs:294,454 remains; tracked for follow-up.

Test plan

4 new unit tests pass
FALSIFY-VALIDATE-QUALITY-{001,002,003} all PASS
Contract YAML parses
Real reproducer: 1.5B Q4K APR exits 0 (was 5)
apr qa remains the canonical pass/fail gate (unchanged)
CI: workspace-test, fmt, contracts, deny

🤖 Generated with Claude Code

…1866) `apr validate --quality --strict` returned `Grade F 3/100 exit=5` on every valid APR file, including models that `apr qa` says ✓ ALL GATES PASSED and that produce correct inference. Root cause: 22 of 25 checks in the 100-point QA checklist are stubbed `Skip("Not implemented")`, but the pass/fail gate compared `total_score` against the full 100-point denominator. Working models were thus mathematically incapable of clearing 50/100 until every stub was filled in. Fix: gate on the percentage of *implemented* (non-Skip) checks instead of the aspirational 100-point ceiling. New methods on `ValidationReport`: - `implemented_max()` — count of checks whose status is not `Skip` - `implemented_score_pct() -> Option<f64>` — pct of implemented checks that passed; `None` when all are stubbed (treat as informational) Threshold gate in `validate.rs`: when `implemented_score_pct() == None` (fully stubbed) the suite is informational, not a hard fail. When it returns `Some(pct)`, the < 50% gate fires with an error message that names the implemented denominator, not the aspirational 100. Contract: new `contracts/apr-validate-quality-threshold-v1.yaml`: - equation `implemented_denominator` - equation `implemented_score_pct` - equation `threshold_gate_on_implemented` - 3 falsifiers (None-when-all-stubbed, 100-when-all-pass, no bare `report.total_score < 50` comparison in the gate). Tests: 4 new unit tests on `ValidationReport`: - `test_implemented_score_pct_none_when_all_stubbed` - `test_implemented_score_pct_100_when_all_pass` (matches #1866 repro) - `test_implemented_score_pct_mixed` - `test_implemented_score_pct_below_threshold` Verified end-to-end: $ apr validate /home/noah/models/qwen2.5-coder-1.5b-instruct-q4k.apr \ --quality --strict [3/3 implemented checks pass, 22 stubbed] TOTAL: 3/100 Grade: F ← informational display unchanged $ echo $? 0 ← was 5 (was: "Score 3/100 (below 50% threshold)") `apr qa` remains the canonical pass/fail gate per CLAUDE.md; `apr validate --quality` complements it as a structural-integrity audit. Note: `--strict is not yet implemented` warning at validate.rs:294,454 remains for follow-up (separate ticket). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…itstream-io) (#1878) `cargo deny check advisories` started failing on every PR (and on main) 2026-05-22 with: error[unmaintained]: core2 is unmaintained, all versions yanked ├ ID: RUSTSEC-2026-0105 ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0105 The dep is pulled in transitively via `bitstream-io` (image/media decoding stack — `cargo tree` shows `bitstream-io v4.9.0 → core2 v0.4.0`). No first-party use; no drop-in replacement until upstream `bitstream-io` migrates off core2. This commit unblocks the in-flight PR cascade (#1867 #1868 #1870 #1873 #1875 #1876) which all failed CI's `ci / lint` step on this advisory. The deny entry is structured per the existing pattern in this file (id + human reason mentioning the transitive path) so revisiting the ignore in 6-12 months is straightforward. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…ly cron (#1875) Adds an end-to-end "Qwen story" that exercises every core apr command group against the Qwen scale ladder (0.5B → 1.5B → 7B → 30B-MoE). The story is the single canonical demo in README.md AND a regression gate via runnable script + falsification contract + nightly cron. ## Beats 1. **Discover** (Registry) — pull, list 2. **Trust** (QA) — qa, validate, lint 3. **Explore** (Inspection) — inspect, tensors, tree 4. **Adapt** (Model ops) — export, diff, convert/quantize 5. **Use** (Inference) — run, chat, code 6. **Serve** (REST) — serve run + curl /v1/chat/completions OpenAI-compat 7. **Operate** (Profiling) — profile, gpu, serve plan (7B Q4K GGUF) 8. **Scale** (MoE) — inspect, tensors on 30B-MoE qwen3moe ## Pmat bug-hunt layer When run with `PMAT_HUNT=1` (default), each beat emits a structured manifest of high-risk untested code in the command-handler modules it just exercised: -- pmat bug-hunt manifest (run chat code) -- gap crates/apr-cli/src/commands/run.rs:resolve_model_alias (impact=42.3) churn crates/apr-cli/src/commands/code.rs:dispatch_agent (commits=11) fault crates/aprender-serve/src/api/cuda_chat_backend.rs:try_qwen3_moe (unwrap) The nightly cron uploads this manifest as an artifact, compares against the previous successful run, and opens (or comments on) a tracking issue when growth exceeds 5 lines — so untested branches in command handlers can't accumulate quietly. ## Files - `scripts/qwen-story.sh` (336 LOC) — runnable story with proper exit-code capture (`OUT=$(cmd); EC=$?` everywhere; no pipe-then-`$?` per memory rule) - `contracts/qwen-story-v1.yaml` — 3 equations + 8 falsifiers, all PASS locally (script exists+executable, 8 beats, run_cmd helper, pmat_hunt per beat, README link, daily cron file, bashrs clean, Beat 7 skips `apr qa` on 7B Q4K due to #1864) - `README.md` — new `## A Qwen story` section replacing the flat `## CLI examples` block. Fixes two README bugs surfaced during dogfood: `apr profile --roofline` (no such flag; just `apr profile <file>`) and `apr bench --assert-tps` (flag is on `apr qa`, not `bench`). - `.github/workflows/qwen-story-daily.yml` — self-hosted GPU runner, 04:17 UTC cron + workflow_dispatch, uploads pmat manifest + story log artifacts, files tracking issue when story regresses or manifest grows. ## Verification $ bash scripts/qwen-story.sh # local smoke -- Beat 1: Discover (Registry) -- ✓ PASS B1 list -- Beat 2: Trust (QA gates) -- ✓ PASS B2 apr qa ✗ FAIL B2 apr validate --quality - exit=5 (after #1866 fix this should be 0) -- Beat 3: Explore (Inspection) -- ✓ PASS B3 apr inspect --json (arch=qwen2) ✓ PASS B3 apr tensors --json (339 tensors) ✓ PASS B3 apr tree -- Beat 4: Adapt (Model ops) -- ✗ FAIL B4 apr export - PANIC (exit=101) - #1865 regression -- Beat 5: Use (Inference) -- ✓ PASS B5 apr run (Rust code completion) ✓ PASS B5 apr code -p -- Beat 6: Serve (REST API) -- ✓ PASS B6 apr serve run (port=22915) ✓ PASS B6 /v1/chat/completions (got OK...) -- Beat 7: Operate (Profiling) -- ✓ PASS B7 apr profile ✓ PASS B7 apr gpu --json ✓ PASS B7 apr serve plan -- 7B VRAM budget -- Beat 8: Scale (MoE introspection) -- ✓ PASS B8 apr inspect --json (arch=qwen3moe) ✓ PASS B8 apr tensors --json (579 tensors) 14 PASS / 2 FAIL / 0 SKIP The 2 FAILs are EXPECTED until the in-flight fixes land: - B2 validate --quality: closed by #1870 - B4 export panic: closed by #1868 Once those PRs merge, this story will be 16 PASS / 0 FAIL / 0 SKIP on a host with all 4 Qwen models cached. ## Follow-up A separate PR will add `/dogfood` Gate 18 that invokes this script (kept separate to avoid conflict with PR #1872 which is already adding Gates 13-17 to the dogfood skill). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 22, 2026 07:10

Merge branch 'main' into fix/apr-validate-quality-threshold-1866

ad994cd

noahgift mentioned this pull request May 22, 2026

feat(qwen-story): 8-beat E2E narrative + pmat bug-hunt + daily cron #1875

Merged

7 tasks

Merge branch 'main' into fix/apr-validate-quality-threshold-1866

173f0a9

noahgift mentioned this pull request May 22, 2026

chore(deny): ignore RUSTSEC-2026-0105 (core2 yanked, transitive via bitstream-io) #1878

Merged

3 tasks

noahgift added 3 commits May 22, 2026 11:46

Merge branch 'main' into fix/apr-validate-quality-threshold-1866

40001b4

Merge branch 'main' into fix/apr-validate-quality-threshold-1866

4bdb1a4

Merge branch 'main' into fix/apr-validate-quality-threshold-1866

a4d2044

noahgift added 2 commits May 22, 2026 13:28

Merge branch 'main' into fix/apr-validate-quality-threshold-1866

db77da1

Merge branch 'main' into fix/apr-validate-quality-threshold-1866

a298834

noahgift merged commit 41b1a99 into main May 22, 2026
10 checks passed

noahgift deleted the fix/apr-validate-quality-threshold-1866 branch May 22, 2026 12:14

This was referenced May 22, 2026

spec(SPEC-CUBLAS-FP8-7B-FIX-001): epic to root-cause cuBLAS FP8 7B gibberish (holds v0.35.0) #1882

Closed

Qwen2.5-7B Q4_K GPU inference produces gibberish — 'ampiezza' (wgpu) / '<|im_start|>' (cuBLAS) — regression vs #374 / #559 #1864

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(validate): gate --quality threshold on implemented checks (closes #1866)#1870

fix(validate): gate --quality threshold on implemented checks (closes #1866)#1870
noahgift merged 8 commits into
mainfrom
fix/apr-validate-quality-threshold-1866

noahgift commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 22, 2026

Summary

Fix

New ValidationReport methods (validation.rs)

Updated threshold gate (validate.rs)

Contract

Tests

End-to-end verification

Out of scope

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New `ValidationReport` methods (`validation.rs`)

Updated threshold gate (`validate.rs`)