Phase 0: per-tier composite eval + first GuitarSet baseline by pgil256 · Pull Request #11 · pgil256/tab_vision

pgil256 · 2026-05-19T18:26:40Z

Summary

Phase 0 of the Tab F1 per-tier acceptance plan (strategy doc: docs/plans/2026-05-12-tab-f1-to-spec-design.md, Phase 0 impl plan: docs/plans/2026-05-13-tab-f1-phase-0-implementation.md). Establishes the multi-source composite eval harness and produces the first per-tier baseline against on-disk GuitarSet (2 of 4 tiers covered).

Two artifacts are the headline.

docs/EVAL_REPORTS/composite_baseline_2026-05-13.md — per-tier Tab F1 + 95% bootstrap CIs + pass/gap/fail/missing status:

Tier	Tab F1 mean	Lower-95 CI	Target	Status
clean_acoustic_single_line	0.5076	0.4448	0.85	fail
clean_acoustic_strummed	0.6708	0.6015	0.90	fail
clean_electric	—	—	0.87	missing (pending Guitar-TECHS)
distorted_electric	—	—	0.80	missing (pending EGDB)

Onset F1 (0.94 / 0.92) and Pitch F1 (0.93 / 0.90) clear SPEC on both covered tiers — audio is at spec; only Tab F1 is short.

docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md — six-bucket breakdown. wrong_position_same_pitch dominates every covered tier: 77.5% of single-line loss, 49.7% of strummed, 57.3% aggregate. Confirms the strategy doc §2 diagnostic: audio is at spec; the gap is string/fret assignment, worst on single-line.

Matcher-fix worth flagging (commit 9a7e957): first full run had pitch_off = 53.5% of loss, way above 1 − Pitch F1. Diagnosed as a chord-cluster mispairing in the greedy-by-onset matcher. decompose_errors now uses priority-based matching (same pos > same pitch > onset-closest), mirroring event_f1. Re-run dropped pitch_off to 11.8% — matches the math. Locked in by two new tests (test_chord_cluster_priority_*).

What's in the diff

Infrastructure under tabvision/tabvision/eval/:

parsers/{registry,guitarset_jams,guitar_techs_midi} — uniform parser interface; auto-registered on import. Composite-eval dispatches by Manifest.clip.annotation_format.
manifest.py — extended with required annotation_format field + SYNTHETIC_IN_EVAL_SPLIT guard (R8 from strategy doc §7).
bootstrap.py — bootstrap CI helper. lower_95_CI >= target is the acceptance gate per strategy doc §5.
error_decomposition.py — six-bucket port of the apr-28 7-bucket harness on §8 TabEvent (the seventh bucket muted_undetectable is deferred until TabEvent carries a muted/X flag).
composite.py — run_composite_eval() + format_baseline_markdown() + format_decomposition_markdown() + tabvision-composite-eval CLI.
manifest_builder.py — discovers on-disk datasets + emits portable TOML via --data-root (rewrites paths under that root as $TABVISION_DATA_ROOT/<rest>).

Plus tabvision/data/eval/composite.toml — checked-in portable manifest covering the 60 player-05 validation clips.

Documentation:

LICENSES.md — Guitar-TECHS (planned), GOAT (dropped, request-only), SynthTab (dropped, CC-BY-NC), DadaGP (research-only, not in default pipeline), personal clips (banned per D10), EGDB (license-pending email).
docs/DECISIONS.md — single 2026-05-13 entry inventoring D1–D11 with the baseline numbers inlined.

Verification

107 focused tests pass:

cd tabvision && ../tabvision-server/venv/bin/python -m pytest \
  tests/unit/test_bootstrap_ci.py \
  tests/unit/test_parsers_registry.py \
  tests/unit/test_parser_guitar_techs_midi.py \
  tests/unit/test_error_decomposition.py \
  tests/unit/test_eval_manifest.py \
  tests/unit/test_manifest_builder.py \
  tests/unit/test_composite_report_formatting.py \
  tests/unit/test_guitarset_audio_eval.py \
  tests/integration/test_composite_eval_smoke.py
# 107 passed

Lint + types:

ruff check tabvision/eval/parsers tabvision/eval/bootstrap.py \
  tabvision/eval/error_decomposition.py tabvision/eval/manifest.py \
  tabvision/eval/composite.py tabvision/eval/manifest_builder.py
mypy tabvision/eval/parsers tabvision/eval/bootstrap.py \
  tabvision/eval/error_decomposition.py tabvision/eval/composite.py \
  tabvision/eval/manifest_builder.py
# both clean

Note: broader project-wide mypy tabvision is not clean — pre-existing diagnostics in older Phase 5 modules — but the Phase 0 surface added by this PR is.

Out of scope (Phase 0 user actions, don't gate this PR)

Per Phase 0 impl plan §8, three things only you can do; without them, the electric-tier rows stay missing:

Free-tier compute account signups (Lightning Studios / Kaggle / Colab / W&B).
Email the EGDB author per the template in strategy doc §8.2.
Download Guitar-TECHS from Zenodo (~5 GB).

Test plan

Sign off on the per-tier baseline numbers (or push back on what was measured / how).
Sign off on the priority-based matcher logic + the chord-cluster regression tests.
After merge: cut impl/tab-f1-phase-1 from main and start Phase 1 (pitch ceiling lift — cheap moves, ~2-3 days local CPU).

First Phase 0 chunk per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md §1.1. Foundations for the composite-eval workflow; no production behavior changes. - tabvision.eval.parsers.registry: ParserFn protocol + register_parser / get_parser / list_parsers. Each source-specific annotation format gets a parser that registers itself at import time; composite-eval dispatches by Manifest.clip.annotation_format. - tabvision.eval.parsers.guitarset_jams: thin wrapper exposing the existing tabvision.eval.guitarset_audio.parse_guitarset_jams under the new uniform interface. No logic duplication. - tabvision.eval.bootstrap: bootstrap_ci() returning a BootstrapResult (statistic, lower, upper, n_observations, n_bootstrap, confidence). Implements the per-tier acceptance gate from the strategy doc §5 (lower_95_CI >= target, not just mean >= target). - 21 unit tests, all passing. Existing test_guitarset_audio_eval.py unchanged and still green. Ruff + mypy clean on the new files.

…tar-techs parser Phase 0 items 1-2 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md. Manifest (tabvision/tabvision/eval/manifest.py): - Add 'annotation_format' to REQUIRED_CLIP_FIELDS so composite-eval can route each clip to the correct parser via the registry. - Add SYNTHETIC_SOURCE_PREFIXES + cross-contamination guard: clips whose source starts with 'synthtab/', 'dadagp/', or 'synthetic/' are rejected in 'validation' and 'test' splits. Permitted in 'train'. Implements R8 from the strategy doc §7. Guitar-TECHS parser (tabvision/tabvision/eval/parsers/guitar_techs_midi.py): - Parses 6-track MIDI (one track per string, low E first) into list[TabEvent] via pretty_midi. Per-string fret derived from MIDI pitch minus open-string pitch. Drops out-of-range frets. - Optional 'track_to_string' kwarg for releases with a different ordering. Default = identity (low E = 0, high E = 5). - 9 unit tests using pretty_midi-built fixtures; importorskip when pretty_midi not installed. Updated manifest placeholder TOML schema with annotation_format and synthetic-source guard documentation. 4 new manifest validator tests. All 15 new tests pass; existing test_eval_manifest.py / test_parsers_registry.py still green. Ruff + mypy clean.

Phase 0 item 3 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md. Six-bucket decomposition matching the apr-28 methodology in tabvision-server/tools/outputs/errors-2026-04-28_185743.md, ported to operate on v1 §8 TabEvent lists: - correct: string + fret + onset all match within tolerance - wrong_position_same_pitch: pitch matches, position doesn't - pitch_off: onset matches but pitch and position differ - timing_only: pos or pitch matches outside strict tolerance but within extended tolerance - missed_onset: gold event with no nearby predicted event - extra_detection: predicted event unmatched by either pass (The seventh apr-28 bucket, muted_undetectable, needs a muted/X flag the v1 TabEvent contract does not yet carry; deferred.) Two-pass greedy matcher prioritizes (a) strict-tolerance closest onset, then (b) extended-tolerance pos-or-pitch match for timing_only. share_of_loss() returns per-bucket percentages of recoverable loss. aggregate_decompositions() sums per-track decompositions for the per-tier rollup that composite.py will produce. 16 unit tests covering each bucket in isolation, the mixed scenario, share-of-loss math, aggregation, and edge cases (multiple gold at same time, greedy onset-closest selection, invalid tolerances). Ruff + mypy clean.

Phase 0 item 4 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md. tabvision.eval.composite.run_composite_eval: - Reads + validates a multi-source manifest, dispatches each clip through the registered parser, runs a user-supplied predictor over the media, and computes onset / pitch / tab F1 + 95% bootstrap CIs per tier plus the 6-bucket error decomposition. - Predictor is injected so the harness is testable without the heavy audio backend; CLI wires up tabvision.pipeline.run_pipeline. - Train-split clips skipped by default (DEFAULT_EVAL_SPLITS = validation + test). - CompositeReport.tab_f1_acceptance(targets) classifies each tier as pass / gap / fail / missing based on the lower_95_CI >= target gate from strategy doc §5. tabvision.eval.metrics: added public event_f1() + EventF1Result for onset-only and onset+pitch matching. The private _score_event_f1 in guitarset_audio is left untouched (Phase 0 ground rule: no production behavior changes). 11 integration smoke tests covering perfect predictor (all tiers pass), shifted predictor (wrong_position_same_pitch dominates), train-split skipping, manifest validation failures, parser-format lookup failures, TABVISION_DATA_ROOT substitution via env + function arg, empty gold edge case, and the acceptance helper. Ruff + mypy clean.

Phase 0 item 5 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md. tabvision.eval.composite: - DEFAULT_TIER_TARGETS = {0.85/0.90/0.87/0.80} from SPEC §1.4.1. - format_baseline_markdown(report, targets, ...) renders the per-tier baseline table with pass/gap/fail/missing status, per-source breakdown, and methodology footer per Phase 0 impl plan §4.1. - format_decomposition_markdown(report) renders the aggregate + per-tier 7-bucket (currently 6) error breakdown per §4.2. - make_run_pipeline_predictor(...) wraps tabvision.pipeline.run_pipeline with lazy import — composite-eval --help works without the audio-highres extras installed. - main() — argparse CLI exposed as 'tabvision-composite-eval'. Supports --backend, --position-prior (or 'none'), --melodic-prior, --enable-video, --bootstrap-{n,seed}, --onset-tolerance-s, --splits, --media-root, --annotation-root, --eval-harness-sha. Single run can emit both the baseline and decomposition reports via --decomposition-output, so the separate decompose_tab_errors.py script listed in the Phase 0 plan is consolidated into this one CLI. tabvision/scripts/eval/composite_eval.py: 5-line shim that invokes the module's main(). 7 unit tests on the formatters: required sections, pass/gap/fail/missing classification, methodology fields, decomposition aggregate sums, default-target coverage. All 20 composite tests + 73 Phase 0 eval tests pass. Ruff + mypy clean.

Phase 0 item 6a per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md. tabvision.eval.manifest_builder: - scan_guitarset(root, validation_player) — discovers <root>/annotation/*.jams paired with <root>/audio_mono-mic/*_mic.wav; maps _comp/_solo suffix to clean_acoustic_strummed/single_line tier. - scan_guitar_techs(root) — stub returning [] until the dataset is acquired and its on-disk layout is verified. - apply_limits(entries, max_clips_per_tier, total_limit) — deterministic per-tier cap + total cap, sorted by clip id first so re-runs produce byte-stable output. - build_manifest(splits=...) — full pipeline; supports filtering by split so smoke runs target the validation set directly. - render_toml(entries, header_comment) — TOML output with proper escaping and a generated-by header. - _refuse_synthetic_in_eval_splits — pre-write guard mirroring the validator's R8 cross-contamination check. - main() CLI: --guitarset, --guitar-techs, --output, --splits, --max-clips-per-tier, --limit. Returns rc=1 on no clips, rc=2 on validation failure, rc=0 on success. tabvision/scripts/eval/build_composite_manifest.py — thin CLI shim. Hygiene pass per PR feedback: - manifest.toml schema comment now lists guitar_techs_midi alongside guitarset_jams under 'known formats'. - Error-decomposition framing in composite.py and error_decomposition.py now uses 'six-bucket port of the apr-28 7-bucket harness' instead of '7-bucket' (we only populate 6 — muted_undetectable is deferred). - composite.py and manifest_builder.py both gain if __name__ == '__main__' blocks so 'python -m tabvision.eval.composite' and 'python -m tabvision.eval.manifest_builder' invoke main() cleanly. 20 manifest-builder tests pass (scan, limits, render, summarise, build_manifest, --splits filter, end-to-end CLI). Full Phase 0 test suite still green. Ruff + mypy clean. Smoke-validated against on-disk GuitarSet: --max-clips-per-tier 2 --splits validation produces a 4-clip manifest that the composite eval CLI processes end-to-end via the real highres backend + guitarset-v1 prior, emitting baseline + decomposition reports with sensible numbers (strummed Tab F1 ~0.75, single-line ~0.29 on this tiny sample).

Closes the Phase 0 acceptance gate for the 2 tiers reachable from on-disk data (clean acoustic single-line + strummed via GuitarSet held-out validation). Clean electric and distorted electric remain 'missing' pending Guitar-TECHS / EGDB acquisition. Matcher fix (tabvision/tabvision/eval/error_decomposition.py): - decompose_errors() now uses priority-based selection within each onset tolerance window: same (string, fret) > same pitch_midi > onset-closest. Previously a greedy onset-only matcher mis-paired chord-cluster events whose on-the-wire ordering differed from ground truth, inflating pitch_off on strummed (3387 → 486 with the fix). event_f1's pitch-matching semantics are now mirrored in the decomposition. - Added test_chord_cluster_priority_pitch_over_onset and test_chord_cluster_priority_falls_back_to_position_match_then_pitch to lock the new behavior. Reports (docs/EVAL_REPORTS/*): - composite_baseline_2026-05-13.md — first artifact under SPEC §1.4.1: per-tier Tab F1 + Onset/Pitch F1 + 95% bootstrap CI + pass/gap/fail/missing status. Headline: both covered tiers FAIL by ~25-35 pp (single-line mean 0.5076, strummed 0.6708). - tab_f1_error_decomposition_2026-05-13.md — companion 6-bucket breakdown. Headline: wrong_position_same_pitch dominates loss on every tier — 77% of single-line, 50% of strummed, 57% aggregate. Confirms the strategy doc §2 diagnostic. Eval manifest (tabvision/data/eval/composite.toml): - 60 player-05 validation clips, byte-stable output of the manifest builder. Strummed and single-line tiers fully covered. LICENSES.md: - GuitarSet: marked '✅ used for 2026-05-13 baseline'. - Guitar-TECHS: added as planned acquisition (CC-BY-4.0). - EGDB: status updated; author email pending. - GOAT: marked ❌ DROPPED (request-only research-only). - SynthTab: marked ❌ DROPPED from default pipeline (CC-BY-NC-4.0). - User clips: marked ⛔ banned per D10. - DadaGP: marked research/dev only; not in default pipeline. DECISIONS.md: single 2026-05-13 entry summarising D1-D11 from the design plan, with per-tier targets table and the 2026-05-13 baseline numbers inlined so the decision record stands alone. 104 tests pass; ruff + mypy clean.

…ording Three small fixes flagged in review of the Phase 0 baseline: (a) Portable manifest. tabvision.eval.manifest_builder now accepts --data-root PATH; render_toml rewrites media/annotation paths that fall under that root as '/<rest>'. The composite-eval CLI already expanded that token via env var or --media-root/--annotation-root, so checked-in manifests are now portable across developer machines. Re-generated tabvision/data/eval/composite.toml with the new flag so the committed manifest no longer carries /home/gilhooleyp/... paths. +3 unit tests covering the rewrite + the no-data-root path. (b) Real SHA in the baseline report. The 'Eval-harness SHA' field in docs/EVAL_REPORTS/composite_baseline_2026-05-13.md now cites 2ec4849 (the commit that landed both the baseline and the chord-cluster matcher fix), instead of the ad-hoc '354571b-matcher-fix' label used at run time. (c) Stale '7-bucket' wording cleared in the planning docs and one test docstring. The implementation is a six-bucket port; only references to the original apr-28 7-bucket harness keep the historical name. Verification ran in WSL: - ruff: passes on changed files. - mypy: clean on the 8 Phase 0 eval source files (parsers/, bootstrap, error_decomposition, composite, manifest_builder). Broader tabvision-wide mypy hits older Phase 5 diagnostics not in this PR's scope. - 107 tests pass across the focused Phase 0 + existing eval suite. No production behavior change; the manifest still resolves to the same 60 player-05 validation clips.

vercel · 2026-05-19T18:26:45Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
tab_vision	Ready	Preview, Comment	May 19, 2026 6:26pm

Patrick Gilhooley added 9 commits May 19, 2026 14:25

chore(eval): re-point baseline report SHA to post-rebase 9a7e957

1dc3c87

pgil256 mentioned this pull request May 19, 2026

Plan: Phase 1 implementation (pitch ceiling lift) — blocked on #11 #12

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 0: per-tier composite eval + first GuitarSet baseline#11

Phase 0: per-tier composite eval + first GuitarSet baseline#11
pgil256 wants to merge 9 commits into
mainfrom
impl/tab-f1-phase-0

pgil256 commented May 19, 2026

Uh oh!

vercel Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pgil256 commented May 19, 2026

Summary

What's in the diff

Verification

Out of scope (Phase 0 user actions, don't gate this PR)

Test plan

Uh oh!

vercel Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant