Skip to content

fix(validate): gate --quality threshold on implemented checks (closes #1866)#1870

Merged
noahgift merged 8 commits into
mainfrom
fix/apr-validate-quality-threshold-1866
May 22, 2026
Merged

fix(validate): gate --quality threshold on implemented checks (closes #1866)#1870
noahgift merged 8 commits into
mainfrom
fix/apr-validate-quality-threshold-1866

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Fixes #1866apr validate --quality reported Grade F (3/100, exit 5) on every valid APR file. Root cause: 22 of 25 checks in the 100-point QA checklist are stubbed Skip(\"Not implemented\"), but the threshold gate compared total_score against the full 100-point denominator. Working models were mathematically incapable of clearing 50/100 until every stub was filled in.

Fix

New ValidationReport methods (validation.rs)

  • implemented_max() -> u8 — count of checks whose status is not Skip
  • implemented_score_pct() -> Option<f64> — pct of implemented checks that passed; None when the entire suite is stubbed (treat as informational)

Updated threshold gate (validate.rs)

  • Some(pct < 50)ValidationFailed (clear breakage signal)
  • Some(pct >= 50) → pass
  • None (fully stubbed) → pass as informational; apr qa is the canonical pass/fail gate

Contract

New contracts/apr-validate-quality-threshold-v1.yaml:

Tests

4 new unit tests, all pass:

End-to-end verification

$ apr validate /home/noah/models/qwen2.5-coder-1.5b-instruct-q4k.apr --quality --strict
[3/3 implemented checks pass, 22 stubbed]
TOTAL: 3/100  Grade: F   ← informational display unchanged
$ echo $?
0   ← was 5 ("Score 3/100 (below 50% threshold)")

The display still shows the 3/100 aspirational score (so users know how much of the QA checklist is implemented), but the threshold gate no longer fails on stubbed checks.

Out of scope

The --strict is not yet implemented warning at validate.rs:294,454 remains; tracked for follow-up.

Test plan

  • 4 new unit tests pass
  • FALSIFY-VALIDATE-QUALITY-{001,002,003} all PASS
  • Contract YAML parses
  • Real reproducer: 1.5B Q4K APR exits 0 (was 5)
  • apr qa remains the canonical pass/fail gate (unchanged)
  • CI: workspace-test, fmt, contracts, deny

🤖 Generated with Claude Code

…1866)

`apr validate --quality --strict` returned `Grade F 3/100 exit=5` on
every valid APR file, including models that `apr qa` says ✓ ALL GATES
PASSED and that produce correct inference. Root cause: 22 of 25 checks
in the 100-point QA checklist are stubbed `Skip("Not implemented")`, but
the pass/fail gate compared `total_score` against the full 100-point
denominator. Working models were thus mathematically incapable of
clearing 50/100 until every stub was filled in.

Fix: gate on the percentage of *implemented* (non-Skip) checks instead
of the aspirational 100-point ceiling.

New methods on `ValidationReport`:
- `implemented_max()` — count of checks whose status is not `Skip`
- `implemented_score_pct() -> Option<f64>` — pct of implemented checks
  that passed; `None` when all are stubbed (treat as informational)

Threshold gate in `validate.rs`: when `implemented_score_pct() == None`
(fully stubbed) the suite is informational, not a hard fail. When it
returns `Some(pct)`, the < 50% gate fires with an error message that
names the implemented denominator, not the aspirational 100.

Contract: new `contracts/apr-validate-quality-threshold-v1.yaml`:
- equation `implemented_denominator`
- equation `implemented_score_pct`
- equation `threshold_gate_on_implemented`
- 3 falsifiers (None-when-all-stubbed, 100-when-all-pass, no bare
  `report.total_score < 50` comparison in the gate).

Tests: 4 new unit tests on `ValidationReport`:
- `test_implemented_score_pct_none_when_all_stubbed`
- `test_implemented_score_pct_100_when_all_pass` (matches #1866 repro)
- `test_implemented_score_pct_mixed`
- `test_implemented_score_pct_below_threshold`

Verified end-to-end:
    $ apr validate /home/noah/models/qwen2.5-coder-1.5b-instruct-q4k.apr \
        --quality --strict
    [3/3 implemented checks pass, 22 stubbed]
    TOTAL: 3/100  Grade: F   ← informational display unchanged
    $ echo $?
    0   ← was 5 (was: "Score 3/100 (below 50% threshold)")

`apr qa` remains the canonical pass/fail gate per CLAUDE.md; `apr
validate --quality` complements it as a structural-integrity audit.

Note: `--strict is not yet implemented` warning at validate.rs:294,454
remains for follow-up (separate ticket).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 22, 2026 07:10
noahgift added a commit that referenced this pull request May 22, 2026
…itstream-io) (#1878)

`cargo deny check advisories` started failing on every PR (and on main)
2026-05-22 with:

    error[unmaintained]: core2 is unmaintained, all versions yanked
    ├ ID: RUSTSEC-2026-0105
    ├ Advisory: https://rustsec.org/advisories/RUSTSEC-2026-0105

The dep is pulled in transitively via `bitstream-io` (image/media decoding
stack — `cargo tree` shows `bitstream-io v4.9.0 → core2 v0.4.0`). No
first-party use; no drop-in replacement until upstream `bitstream-io`
migrates off core2.

This commit unblocks the in-flight PR cascade (#1867 #1868 #1870 #1873
#1875 #1876) which all failed CI's `ci / lint` step on this advisory.
The deny entry is structured per the existing pattern in this file (id +
human reason mentioning the transitive path) so revisiting the ignore in
6-12 months is straightforward.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 22, 2026
…ly cron (#1875)

Adds an end-to-end "Qwen story" that exercises every core apr command
group against the Qwen scale ladder (0.5B → 1.5B → 7B → 30B-MoE). The
story is the single canonical demo in README.md AND a regression gate
via runnable script + falsification contract + nightly cron.

## Beats

1. **Discover** (Registry) — pull, list
2. **Trust** (QA) — qa, validate, lint
3. **Explore** (Inspection) — inspect, tensors, tree
4. **Adapt** (Model ops) — export, diff, convert/quantize
5. **Use** (Inference) — run, chat, code
6. **Serve** (REST) — serve run + curl /v1/chat/completions OpenAI-compat
7. **Operate** (Profiling) — profile, gpu, serve plan (7B Q4K GGUF)
8. **Scale** (MoE) — inspect, tensors on 30B-MoE qwen3moe

## Pmat bug-hunt layer

When run with `PMAT_HUNT=1` (default), each beat emits a structured
manifest of high-risk untested code in the command-handler modules it
just exercised:

    -- pmat bug-hunt manifest (run chat code) --
        gap   crates/apr-cli/src/commands/run.rs:resolve_model_alias (impact=42.3)
        churn crates/apr-cli/src/commands/code.rs:dispatch_agent (commits=11)
        fault crates/aprender-serve/src/api/cuda_chat_backend.rs:try_qwen3_moe (unwrap)

The nightly cron uploads this manifest as an artifact, compares against
the previous successful run, and opens (or comments on) a tracking issue
when growth exceeds 5 lines — so untested branches in command handlers
can't accumulate quietly.

## Files

- `scripts/qwen-story.sh` (336 LOC) — runnable story with proper exit-code
  capture (`OUT=$(cmd); EC=$?` everywhere; no pipe-then-`$?` per memory rule)
- `contracts/qwen-story-v1.yaml` — 3 equations + 8 falsifiers, all PASS
  locally (script exists+executable, 8 beats, run_cmd helper, pmat_hunt
  per beat, README link, daily cron file, bashrs clean, Beat 7 skips
  `apr qa` on 7B Q4K due to #1864)
- `README.md` — new `## A Qwen story` section replacing the flat
  `## CLI examples` block. Fixes two README bugs surfaced during dogfood:
  `apr profile --roofline` (no such flag; just `apr profile <file>`)
  and `apr bench --assert-tps` (flag is on `apr qa`, not `bench`).
- `.github/workflows/qwen-story-daily.yml` — self-hosted GPU runner,
  04:17 UTC cron + workflow_dispatch, uploads pmat manifest + story log
  artifacts, files tracking issue when story regresses or manifest grows.

## Verification

    $ bash scripts/qwen-story.sh   # local smoke
    -- Beat 1: Discover (Registry) --
    ✓ PASS  B1 list
    -- Beat 2: Trust (QA gates) --
    ✓ PASS  B2 apr qa
    ✗ FAIL  B2 apr validate --quality  -  exit=5 (after #1866 fix this should be 0)
    -- Beat 3: Explore (Inspection) --
    ✓ PASS  B3 apr inspect --json (arch=qwen2)
    ✓ PASS  B3 apr tensors --json (339 tensors)
    ✓ PASS  B3 apr tree
    -- Beat 4: Adapt (Model ops) --
    ✗ FAIL  B4 apr export  -  PANIC (exit=101)  -  #1865 regression
    -- Beat 5: Use (Inference) --
    ✓ PASS  B5 apr run (Rust code completion)
    ✓ PASS  B5 apr code -p
    -- Beat 6: Serve (REST API) --
    ✓ PASS  B6 apr serve run (port=22915)
    ✓ PASS  B6 /v1/chat/completions (got OK...)
    -- Beat 7: Operate (Profiling) --
    ✓ PASS  B7 apr profile
    ✓ PASS  B7 apr gpu --json
    ✓ PASS  B7 apr serve plan -- 7B VRAM budget
    -- Beat 8: Scale (MoE introspection) --
    ✓ PASS  B8 apr inspect --json (arch=qwen3moe)
    ✓ PASS  B8 apr tensors --json (579 tensors)
    14 PASS / 2 FAIL / 0 SKIP

The 2 FAILs are EXPECTED until the in-flight fixes land:
- B2 validate --quality: closed by #1870
- B4 export panic: closed by #1868

Once those PRs merge, this story will be 16 PASS / 0 FAIL / 0 SKIP on a
host with all 4 Qwen models cached.

## Follow-up

A separate PR will add `/dogfood` Gate 18 that invokes this script (kept
separate to avoid conflict with PR #1872 which is already adding Gates
13-17 to the dogfood skill).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 41b1a99 into main May 22, 2026
10 checks passed
@noahgift noahgift deleted the fix/apr-validate-quality-threshold-1866 branch May 22, 2026 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

apr validate --quality: 22/25 checks 'Pending — Not implemented' → working models score 3/100, exit 5

1 participant