Phase 0: per-tier composite eval + first GuitarSet baseline#11
Open
pgil256 wants to merge 9 commits into
Open
Conversation
added 9 commits
May 19, 2026 14:25
First Phase 0 chunk per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md §1.1. Foundations for the composite-eval workflow; no production behavior changes. - tabvision.eval.parsers.registry: ParserFn protocol + register_parser / get_parser / list_parsers. Each source-specific annotation format gets a parser that registers itself at import time; composite-eval dispatches by Manifest.clip.annotation_format. - tabvision.eval.parsers.guitarset_jams: thin wrapper exposing the existing tabvision.eval.guitarset_audio.parse_guitarset_jams under the new uniform interface. No logic duplication. - tabvision.eval.bootstrap: bootstrap_ci() returning a BootstrapResult (statistic, lower, upper, n_observations, n_bootstrap, confidence). Implements the per-tier acceptance gate from the strategy doc §5 (lower_95_CI >= target, not just mean >= target). - 21 unit tests, all passing. Existing test_guitarset_audio_eval.py unchanged and still green. Ruff + mypy clean on the new files.
…tar-techs parser Phase 0 items 1-2 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md. Manifest (tabvision/tabvision/eval/manifest.py): - Add 'annotation_format' to REQUIRED_CLIP_FIELDS so composite-eval can route each clip to the correct parser via the registry. - Add SYNTHETIC_SOURCE_PREFIXES + cross-contamination guard: clips whose source starts with 'synthtab/', 'dadagp/', or 'synthetic/' are rejected in 'validation' and 'test' splits. Permitted in 'train'. Implements R8 from the strategy doc §7. Guitar-TECHS parser (tabvision/tabvision/eval/parsers/guitar_techs_midi.py): - Parses 6-track MIDI (one track per string, low E first) into list[TabEvent] via pretty_midi. Per-string fret derived from MIDI pitch minus open-string pitch. Drops out-of-range frets. - Optional 'track_to_string' kwarg for releases with a different ordering. Default = identity (low E = 0, high E = 5). - 9 unit tests using pretty_midi-built fixtures; importorskip when pretty_midi not installed. Updated manifest placeholder TOML schema with annotation_format and synthetic-source guard documentation. 4 new manifest validator tests. All 15 new tests pass; existing test_eval_manifest.py / test_parsers_registry.py still green. Ruff + mypy clean.
Phase 0 item 3 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md. Six-bucket decomposition matching the apr-28 methodology in tabvision-server/tools/outputs/errors-2026-04-28_185743.md, ported to operate on v1 §8 TabEvent lists: - correct: string + fret + onset all match within tolerance - wrong_position_same_pitch: pitch matches, position doesn't - pitch_off: onset matches but pitch and position differ - timing_only: pos or pitch matches outside strict tolerance but within extended tolerance - missed_onset: gold event with no nearby predicted event - extra_detection: predicted event unmatched by either pass (The seventh apr-28 bucket, muted_undetectable, needs a muted/X flag the v1 TabEvent contract does not yet carry; deferred.) Two-pass greedy matcher prioritizes (a) strict-tolerance closest onset, then (b) extended-tolerance pos-or-pitch match for timing_only. share_of_loss() returns per-bucket percentages of recoverable loss. aggregate_decompositions() sums per-track decompositions for the per-tier rollup that composite.py will produce. 16 unit tests covering each bucket in isolation, the mixed scenario, share-of-loss math, aggregation, and edge cases (multiple gold at same time, greedy onset-closest selection, invalid tolerances). Ruff + mypy clean.
Phase 0 item 4 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md. tabvision.eval.composite.run_composite_eval: - Reads + validates a multi-source manifest, dispatches each clip through the registered parser, runs a user-supplied predictor over the media, and computes onset / pitch / tab F1 + 95% bootstrap CIs per tier plus the 6-bucket error decomposition. - Predictor is injected so the harness is testable without the heavy audio backend; CLI wires up tabvision.pipeline.run_pipeline. - Train-split clips skipped by default (DEFAULT_EVAL_SPLITS = validation + test). - CompositeReport.tab_f1_acceptance(targets) classifies each tier as pass / gap / fail / missing based on the lower_95_CI >= target gate from strategy doc §5. tabvision.eval.metrics: added public event_f1() + EventF1Result for onset-only and onset+pitch matching. The private _score_event_f1 in guitarset_audio is left untouched (Phase 0 ground rule: no production behavior changes). 11 integration smoke tests covering perfect predictor (all tiers pass), shifted predictor (wrong_position_same_pitch dominates), train-split skipping, manifest validation failures, parser-format lookup failures, TABVISION_DATA_ROOT substitution via env + function arg, empty gold edge case, and the acceptance helper. Ruff + mypy clean.
Phase 0 item 5 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md.
tabvision.eval.composite:
- DEFAULT_TIER_TARGETS = {0.85/0.90/0.87/0.80} from SPEC §1.4.1.
- format_baseline_markdown(report, targets, ...) renders the per-tier
baseline table with pass/gap/fail/missing status, per-source
breakdown, and methodology footer per Phase 0 impl plan §4.1.
- format_decomposition_markdown(report) renders the aggregate +
per-tier 7-bucket (currently 6) error breakdown per §4.2.
- make_run_pipeline_predictor(...) wraps tabvision.pipeline.run_pipeline
with lazy import — composite-eval --help works without the
audio-highres extras installed.
- main() — argparse CLI exposed as 'tabvision-composite-eval'.
Supports --backend, --position-prior (or 'none'), --melodic-prior,
--enable-video, --bootstrap-{n,seed}, --onset-tolerance-s,
--splits, --media-root, --annotation-root, --eval-harness-sha.
Single run can emit both the baseline and decomposition reports
via --decomposition-output, so the separate decompose_tab_errors.py
script listed in the Phase 0 plan is consolidated into this one CLI.
tabvision/scripts/eval/composite_eval.py: 5-line shim that invokes
the module's main().
7 unit tests on the formatters: required sections, pass/gap/fail/missing
classification, methodology fields, decomposition aggregate sums,
default-target coverage. All 20 composite tests + 73 Phase 0 eval tests
pass. Ruff + mypy clean.
Phase 0 item 6a per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md. tabvision.eval.manifest_builder: - scan_guitarset(root, validation_player) — discovers <root>/annotation/*.jams paired with <root>/audio_mono-mic/*_mic.wav; maps _comp/_solo suffix to clean_acoustic_strummed/single_line tier. - scan_guitar_techs(root) — stub returning [] until the dataset is acquired and its on-disk layout is verified. - apply_limits(entries, max_clips_per_tier, total_limit) — deterministic per-tier cap + total cap, sorted by clip id first so re-runs produce byte-stable output. - build_manifest(splits=...) — full pipeline; supports filtering by split so smoke runs target the validation set directly. - render_toml(entries, header_comment) — TOML output with proper escaping and a generated-by header. - _refuse_synthetic_in_eval_splits — pre-write guard mirroring the validator's R8 cross-contamination check. - main() CLI: --guitarset, --guitar-techs, --output, --splits, --max-clips-per-tier, --limit. Returns rc=1 on no clips, rc=2 on validation failure, rc=0 on success. tabvision/scripts/eval/build_composite_manifest.py — thin CLI shim. Hygiene pass per PR feedback: - manifest.toml schema comment now lists guitar_techs_midi alongside guitarset_jams under 'known formats'. - Error-decomposition framing in composite.py and error_decomposition.py now uses 'six-bucket port of the apr-28 7-bucket harness' instead of '7-bucket' (we only populate 6 — muted_undetectable is deferred). - composite.py and manifest_builder.py both gain if __name__ == '__main__' blocks so 'python -m tabvision.eval.composite' and 'python -m tabvision.eval.manifest_builder' invoke main() cleanly. 20 manifest-builder tests pass (scan, limits, render, summarise, build_manifest, --splits filter, end-to-end CLI). Full Phase 0 test suite still green. Ruff + mypy clean. Smoke-validated against on-disk GuitarSet: --max-clips-per-tier 2 --splits validation produces a 4-clip manifest that the composite eval CLI processes end-to-end via the real highres backend + guitarset-v1 prior, emitting baseline + decomposition reports with sensible numbers (strummed Tab F1 ~0.75, single-line ~0.29 on this tiny sample).
Closes the Phase 0 acceptance gate for the 2 tiers reachable from on-disk data (clean acoustic single-line + strummed via GuitarSet held-out validation). Clean electric and distorted electric remain 'missing' pending Guitar-TECHS / EGDB acquisition. Matcher fix (tabvision/tabvision/eval/error_decomposition.py): - decompose_errors() now uses priority-based selection within each onset tolerance window: same (string, fret) > same pitch_midi > onset-closest. Previously a greedy onset-only matcher mis-paired chord-cluster events whose on-the-wire ordering differed from ground truth, inflating pitch_off on strummed (3387 → 486 with the fix). event_f1's pitch-matching semantics are now mirrored in the decomposition. - Added test_chord_cluster_priority_pitch_over_onset and test_chord_cluster_priority_falls_back_to_position_match_then_pitch to lock the new behavior. Reports (docs/EVAL_REPORTS/*): - composite_baseline_2026-05-13.md — first artifact under SPEC §1.4.1: per-tier Tab F1 + Onset/Pitch F1 + 95% bootstrap CI + pass/gap/fail/missing status. Headline: both covered tiers FAIL by ~25-35 pp (single-line mean 0.5076, strummed 0.6708). - tab_f1_error_decomposition_2026-05-13.md — companion 6-bucket breakdown. Headline: wrong_position_same_pitch dominates loss on every tier — 77% of single-line, 50% of strummed, 57% aggregate. Confirms the strategy doc §2 diagnostic. Eval manifest (tabvision/data/eval/composite.toml): - 60 player-05 validation clips, byte-stable output of the manifest builder. Strummed and single-line tiers fully covered. LICENSES.md: - GuitarSet: marked '✅ used for 2026-05-13 baseline'. - Guitar-TECHS: added as planned acquisition (CC-BY-4.0). - EGDB: status updated; author email pending. - GOAT: marked ❌ DROPPED (request-only research-only). - SynthTab: marked ❌ DROPPED from default pipeline (CC-BY-NC-4.0). - User clips: marked ⛔ banned per D10. - DadaGP: marked research/dev only; not in default pipeline. DECISIONS.md: single 2026-05-13 entry summarising D1-D11 from the design plan, with per-tier targets table and the 2026-05-13 baseline numbers inlined so the decision record stands alone. 104 tests pass; ruff + mypy clean.
…ording
Three small fixes flagged in review of the Phase 0 baseline:
(a) Portable manifest. tabvision.eval.manifest_builder now accepts
--data-root PATH; render_toml rewrites media/annotation paths
that fall under that root as '/<rest>'. The
composite-eval CLI already expanded that token via env var or
--media-root/--annotation-root, so checked-in manifests are now
portable across developer machines. Re-generated
tabvision/data/eval/composite.toml with the new flag so the
committed manifest no longer carries /home/gilhooleyp/... paths.
+3 unit tests covering the rewrite + the no-data-root path.
(b) Real SHA in the baseline report. The 'Eval-harness SHA' field
in docs/EVAL_REPORTS/composite_baseline_2026-05-13.md now cites
2ec4849 (the commit that landed both the baseline and the
chord-cluster matcher fix), instead of the ad-hoc
'354571b-matcher-fix' label used at run time.
(c) Stale '7-bucket' wording cleared in the planning docs and one
test docstring. The implementation is a six-bucket port; only
references to the original apr-28 7-bucket harness keep the
historical name.
Verification ran in WSL:
- ruff: passes on changed files.
- mypy: clean on the 8 Phase 0 eval source files (parsers/, bootstrap,
error_decomposition, composite, manifest_builder). Broader
tabvision-wide mypy hits older Phase 5 diagnostics not in this PR's
scope.
- 107 tests pass across the focused Phase 0 + existing eval suite.
No production behavior change; the manifest still resolves to the
same 60 player-05 validation clips.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 0 of the Tab F1 per-tier acceptance plan (strategy doc:
docs/plans/2026-05-12-tab-f1-to-spec-design.md, Phase 0 impl plan:docs/plans/2026-05-13-tab-f1-phase-0-implementation.md). Establishes the multi-source composite eval harness and produces the first per-tier baseline against on-disk GuitarSet (2 of 4 tiers covered).Two artifacts are the headline.
docs/EVAL_REPORTS/composite_baseline_2026-05-13.md— per-tier Tab F1 + 95% bootstrap CIs + pass/gap/fail/missing status:Onset F1 (0.94 / 0.92) and Pitch F1 (0.93 / 0.90) clear SPEC on both covered tiers — audio is at spec; only Tab F1 is short.
docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md— six-bucket breakdown.wrong_position_same_pitchdominates every covered tier: 77.5% of single-line loss, 49.7% of strummed, 57.3% aggregate. Confirms the strategy doc §2 diagnostic: audio is at spec; the gap is string/fret assignment, worst on single-line.Matcher-fix worth flagging (commit
9a7e957): first full run hadpitch_off= 53.5% of loss, way above1 − Pitch F1. Diagnosed as a chord-cluster mispairing in the greedy-by-onset matcher.decompose_errorsnow uses priority-based matching (same pos > same pitch > onset-closest), mirroringevent_f1. Re-run droppedpitch_offto 11.8% — matches the math. Locked in by two new tests (test_chord_cluster_priority_*).What's in the diff
Infrastructure under
tabvision/tabvision/eval/:parsers/{registry,guitarset_jams,guitar_techs_midi}— uniform parser interface; auto-registered on import. Composite-eval dispatches byManifest.clip.annotation_format.manifest.py— extended with requiredannotation_formatfield +SYNTHETIC_IN_EVAL_SPLITguard (R8 from strategy doc §7).bootstrap.py— bootstrap CI helper.lower_95_CI >= targetis the acceptance gate per strategy doc §5.error_decomposition.py— six-bucket port of the apr-28 7-bucket harness on §8TabEvent(the seventh bucketmuted_undetectableis deferred untilTabEventcarries a muted/X flag).composite.py—run_composite_eval()+format_baseline_markdown()+format_decomposition_markdown()+tabvision-composite-evalCLI.manifest_builder.py— discovers on-disk datasets + emits portable TOML via--data-root(rewrites paths under that root as$TABVISION_DATA_ROOT/<rest>).Plus
tabvision/data/eval/composite.toml— checked-in portable manifest covering the 60 player-05 validation clips.Documentation:
LICENSES.md— Guitar-TECHS (planned), GOAT (dropped, request-only), SynthTab (dropped, CC-BY-NC), DadaGP (research-only, not in default pipeline), personal clips (banned per D10), EGDB (license-pending email).docs/DECISIONS.md— single 2026-05-13 entry inventoring D1–D11 with the baseline numbers inlined.Verification
107 focused tests pass:
Lint + types:
ruff check tabvision/eval/parsers tabvision/eval/bootstrap.py \ tabvision/eval/error_decomposition.py tabvision/eval/manifest.py \ tabvision/eval/composite.py tabvision/eval/manifest_builder.py mypy tabvision/eval/parsers tabvision/eval/bootstrap.py \ tabvision/eval/error_decomposition.py tabvision/eval/composite.py \ tabvision/eval/manifest_builder.py # both cleanOut of scope (Phase 0 user actions, don't gate this PR)
Per Phase 0 impl plan §8, three things only you can do; without them, the electric-tier rows stay
missing:Test plan
impl/tab-f1-phase-1frommainand start Phase 1 (pitch ceiling lift — cheap moves, ~2-3 days local CPU).