Skip to content

Phase 0: per-tier composite eval + first GuitarSet baseline#11

Open
pgil256 wants to merge 9 commits into
mainfrom
impl/tab-f1-phase-0
Open

Phase 0: per-tier composite eval + first GuitarSet baseline#11
pgil256 wants to merge 9 commits into
mainfrom
impl/tab-f1-phase-0

Conversation

@pgil256
Copy link
Copy Markdown
Owner

@pgil256 pgil256 commented May 19, 2026

Summary

Phase 0 of the Tab F1 per-tier acceptance plan (strategy doc: docs/plans/2026-05-12-tab-f1-to-spec-design.md, Phase 0 impl plan: docs/plans/2026-05-13-tab-f1-phase-0-implementation.md). Establishes the multi-source composite eval harness and produces the first per-tier baseline against on-disk GuitarSet (2 of 4 tiers covered).

Two artifacts are the headline.

docs/EVAL_REPORTS/composite_baseline_2026-05-13.md — per-tier Tab F1 + 95% bootstrap CIs + pass/gap/fail/missing status:

Tier Tab F1 mean Lower-95 CI Target Status
clean_acoustic_single_line 0.5076 0.4448 0.85 fail
clean_acoustic_strummed 0.6708 0.6015 0.90 fail
clean_electric 0.87 missing (pending Guitar-TECHS)
distorted_electric 0.80 missing (pending EGDB)

Onset F1 (0.94 / 0.92) and Pitch F1 (0.93 / 0.90) clear SPEC on both covered tiers — audio is at spec; only Tab F1 is short.

docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md — six-bucket breakdown. wrong_position_same_pitch dominates every covered tier: 77.5% of single-line loss, 49.7% of strummed, 57.3% aggregate. Confirms the strategy doc §2 diagnostic: audio is at spec; the gap is string/fret assignment, worst on single-line.

Matcher-fix worth flagging (commit 9a7e957): first full run had pitch_off = 53.5% of loss, way above 1 − Pitch F1. Diagnosed as a chord-cluster mispairing in the greedy-by-onset matcher. decompose_errors now uses priority-based matching (same pos > same pitch > onset-closest), mirroring event_f1. Re-run dropped pitch_off to 11.8% — matches the math. Locked in by two new tests (test_chord_cluster_priority_*).

What's in the diff

Infrastructure under tabvision/tabvision/eval/:

  • parsers/{registry,guitarset_jams,guitar_techs_midi} — uniform parser interface; auto-registered on import. Composite-eval dispatches by Manifest.clip.annotation_format.
  • manifest.py — extended with required annotation_format field + SYNTHETIC_IN_EVAL_SPLIT guard (R8 from strategy doc §7).
  • bootstrap.py — bootstrap CI helper. lower_95_CI >= target is the acceptance gate per strategy doc §5.
  • error_decomposition.py — six-bucket port of the apr-28 7-bucket harness on §8 TabEvent (the seventh bucket muted_undetectable is deferred until TabEvent carries a muted/X flag).
  • composite.pyrun_composite_eval() + format_baseline_markdown() + format_decomposition_markdown() + tabvision-composite-eval CLI.
  • manifest_builder.py — discovers on-disk datasets + emits portable TOML via --data-root (rewrites paths under that root as $TABVISION_DATA_ROOT/<rest>).

Plus tabvision/data/eval/composite.toml — checked-in portable manifest covering the 60 player-05 validation clips.

Documentation:

  • LICENSES.md — Guitar-TECHS (planned), GOAT (dropped, request-only), SynthTab (dropped, CC-BY-NC), DadaGP (research-only, not in default pipeline), personal clips (banned per D10), EGDB (license-pending email).
  • docs/DECISIONS.md — single 2026-05-13 entry inventoring D1–D11 with the baseline numbers inlined.

Verification

107 focused tests pass:

cd tabvision && ../tabvision-server/venv/bin/python -m pytest \
  tests/unit/test_bootstrap_ci.py \
  tests/unit/test_parsers_registry.py \
  tests/unit/test_parser_guitar_techs_midi.py \
  tests/unit/test_error_decomposition.py \
  tests/unit/test_eval_manifest.py \
  tests/unit/test_manifest_builder.py \
  tests/unit/test_composite_report_formatting.py \
  tests/unit/test_guitarset_audio_eval.py \
  tests/integration/test_composite_eval_smoke.py
# 107 passed

Lint + types:

ruff check tabvision/eval/parsers tabvision/eval/bootstrap.py \
  tabvision/eval/error_decomposition.py tabvision/eval/manifest.py \
  tabvision/eval/composite.py tabvision/eval/manifest_builder.py
mypy tabvision/eval/parsers tabvision/eval/bootstrap.py \
  tabvision/eval/error_decomposition.py tabvision/eval/composite.py \
  tabvision/eval/manifest_builder.py
# both clean

Note: broader project-wide mypy tabvision is not clean — pre-existing diagnostics in older Phase 5 modules — but the Phase 0 surface added by this PR is.

Out of scope (Phase 0 user actions, don't gate this PR)

Per Phase 0 impl plan §8, three things only you can do; without them, the electric-tier rows stay missing:

  • Free-tier compute account signups (Lightning Studios / Kaggle / Colab / W&B).
  • Email the EGDB author per the template in strategy doc §8.2.
  • Download Guitar-TECHS from Zenodo (~5 GB).

Test plan

  • Sign off on the per-tier baseline numbers (or push back on what was measured / how).
  • Sign off on the priority-based matcher logic + the chord-cluster regression tests.
  • After merge: cut impl/tab-f1-phase-1 from main and start Phase 1 (pitch ceiling lift — cheap moves, ~2-3 days local CPU).

Patrick Gilhooley added 9 commits May 19, 2026 14:25
First Phase 0 chunk per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md
§1.1. Foundations for the composite-eval workflow; no production
behavior changes.

- tabvision.eval.parsers.registry: ParserFn protocol +
  register_parser / get_parser / list_parsers. Each source-specific
  annotation format gets a parser that registers itself at import
  time; composite-eval dispatches by Manifest.clip.annotation_format.
- tabvision.eval.parsers.guitarset_jams: thin wrapper exposing the
  existing tabvision.eval.guitarset_audio.parse_guitarset_jams under
  the new uniform interface. No logic duplication.
- tabvision.eval.bootstrap: bootstrap_ci() returning a BootstrapResult
  (statistic, lower, upper, n_observations, n_bootstrap, confidence).
  Implements the per-tier acceptance gate from the strategy doc §5
  (lower_95_CI >= target, not just mean >= target).
- 21 unit tests, all passing. Existing test_guitarset_audio_eval.py
  unchanged and still green.

Ruff + mypy clean on the new files.
…tar-techs parser

Phase 0 items 1-2 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md.

Manifest (tabvision/tabvision/eval/manifest.py):
- Add 'annotation_format' to REQUIRED_CLIP_FIELDS so composite-eval
  can route each clip to the correct parser via the registry.
- Add SYNTHETIC_SOURCE_PREFIXES + cross-contamination guard: clips
  whose source starts with 'synthtab/', 'dadagp/', or 'synthetic/'
  are rejected in 'validation' and 'test' splits. Permitted in
  'train'. Implements R8 from the strategy doc §7.

Guitar-TECHS parser (tabvision/tabvision/eval/parsers/guitar_techs_midi.py):
- Parses 6-track MIDI (one track per string, low E first) into
  list[TabEvent] via pretty_midi. Per-string fret derived from
  MIDI pitch minus open-string pitch. Drops out-of-range frets.
- Optional 'track_to_string' kwarg for releases with a different
  ordering. Default = identity (low E = 0, high E = 5).
- 9 unit tests using pretty_midi-built fixtures; importorskip when
  pretty_midi not installed.

Updated manifest placeholder TOML schema with annotation_format and
synthetic-source guard documentation. 4 new manifest validator tests.
All 15 new tests pass; existing test_eval_manifest.py / test_parsers_registry.py
still green. Ruff + mypy clean.
Phase 0 item 3 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md.

Six-bucket decomposition matching the apr-28 methodology in
tabvision-server/tools/outputs/errors-2026-04-28_185743.md, ported
to operate on v1 §8 TabEvent lists:

- correct: string + fret + onset all match within tolerance
- wrong_position_same_pitch: pitch matches, position doesn't
- pitch_off: onset matches but pitch and position differ
- timing_only: pos or pitch matches outside strict tolerance but
  within extended tolerance
- missed_onset: gold event with no nearby predicted event
- extra_detection: predicted event unmatched by either pass

(The seventh apr-28 bucket, muted_undetectable, needs a muted/X flag
the v1 TabEvent contract does not yet carry; deferred.)

Two-pass greedy matcher prioritizes (a) strict-tolerance closest
onset, then (b) extended-tolerance pos-or-pitch match for timing_only.
share_of_loss() returns per-bucket percentages of recoverable loss.
aggregate_decompositions() sums per-track decompositions for the
per-tier rollup that composite.py will produce.

16 unit tests covering each bucket in isolation, the mixed scenario,
share-of-loss math, aggregation, and edge cases (multiple gold at
same time, greedy onset-closest selection, invalid tolerances).
Ruff + mypy clean.
Phase 0 item 4 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md.

tabvision.eval.composite.run_composite_eval:
- Reads + validates a multi-source manifest, dispatches each clip
  through the registered parser, runs a user-supplied predictor over
  the media, and computes onset / pitch / tab F1 + 95% bootstrap CIs
  per tier plus the 6-bucket error decomposition.
- Predictor is injected so the harness is testable without the heavy
  audio backend; CLI wires up tabvision.pipeline.run_pipeline.
- Train-split clips skipped by default (DEFAULT_EVAL_SPLITS =
  validation + test).
- CompositeReport.tab_f1_acceptance(targets) classifies each tier as
  pass / gap / fail / missing based on the lower_95_CI >= target gate
  from strategy doc §5.

tabvision.eval.metrics: added public event_f1() + EventF1Result for
onset-only and onset+pitch matching. The private _score_event_f1 in
guitarset_audio is left untouched (Phase 0 ground rule: no production
behavior changes).

11 integration smoke tests covering perfect predictor (all tiers pass),
shifted predictor (wrong_position_same_pitch dominates), train-split
skipping, manifest validation failures, parser-format lookup failures,
TABVISION_DATA_ROOT substitution via env + function arg, empty gold
edge case, and the acceptance helper. Ruff + mypy clean.
Phase 0 item 5 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md.

tabvision.eval.composite:
- DEFAULT_TIER_TARGETS = {0.85/0.90/0.87/0.80} from SPEC §1.4.1.
- format_baseline_markdown(report, targets, ...) renders the per-tier
  baseline table with pass/gap/fail/missing status, per-source
  breakdown, and methodology footer per Phase 0 impl plan §4.1.
- format_decomposition_markdown(report) renders the aggregate +
  per-tier 7-bucket (currently 6) error breakdown per §4.2.
- make_run_pipeline_predictor(...) wraps tabvision.pipeline.run_pipeline
  with lazy import — composite-eval --help works without the
  audio-highres extras installed.
- main() — argparse CLI exposed as 'tabvision-composite-eval'.
  Supports --backend, --position-prior (or 'none'), --melodic-prior,
  --enable-video, --bootstrap-{n,seed}, --onset-tolerance-s,
  --splits, --media-root, --annotation-root, --eval-harness-sha.
  Single run can emit both the baseline and decomposition reports
  via --decomposition-output, so the separate decompose_tab_errors.py
  script listed in the Phase 0 plan is consolidated into this one CLI.

tabvision/scripts/eval/composite_eval.py: 5-line shim that invokes
the module's main().

7 unit tests on the formatters: required sections, pass/gap/fail/missing
classification, methodology fields, decomposition aggregate sums,
default-target coverage. All 20 composite tests + 73 Phase 0 eval tests
pass. Ruff + mypy clean.
Phase 0 item 6a per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md.

tabvision.eval.manifest_builder:
- scan_guitarset(root, validation_player) — discovers <root>/annotation/*.jams
  paired with <root>/audio_mono-mic/*_mic.wav; maps _comp/_solo suffix
  to clean_acoustic_strummed/single_line tier.
- scan_guitar_techs(root) — stub returning [] until the dataset is
  acquired and its on-disk layout is verified.
- apply_limits(entries, max_clips_per_tier, total_limit) — deterministic
  per-tier cap + total cap, sorted by clip id first so re-runs produce
  byte-stable output.
- build_manifest(splits=...) — full pipeline; supports filtering by
  split so smoke runs target the validation set directly.
- render_toml(entries, header_comment) — TOML output with proper
  escaping and a generated-by header.
- _refuse_synthetic_in_eval_splits — pre-write guard mirroring the
  validator's R8 cross-contamination check.
- main() CLI: --guitarset, --guitar-techs, --output, --splits,
  --max-clips-per-tier, --limit. Returns rc=1 on no clips, rc=2 on
  validation failure, rc=0 on success.

tabvision/scripts/eval/build_composite_manifest.py — thin CLI shim.

Hygiene pass per PR feedback:
- manifest.toml schema comment now lists guitar_techs_midi alongside
  guitarset_jams under 'known formats'.
- Error-decomposition framing in composite.py and error_decomposition.py
  now uses 'six-bucket port of the apr-28 7-bucket harness' instead
  of '7-bucket' (we only populate 6 — muted_undetectable is deferred).
- composite.py and manifest_builder.py both gain if __name__ ==
  '__main__' blocks so 'python -m tabvision.eval.composite' and
  'python -m tabvision.eval.manifest_builder' invoke main() cleanly.

20 manifest-builder tests pass (scan, limits, render, summarise,
build_manifest, --splits filter, end-to-end CLI). Full Phase 0 test
suite still green. Ruff + mypy clean.

Smoke-validated against on-disk GuitarSet: --max-clips-per-tier 2
--splits validation produces a 4-clip manifest that the composite
eval CLI processes end-to-end via the real highres backend +
guitarset-v1 prior, emitting baseline + decomposition reports with
sensible numbers (strummed Tab F1 ~0.75, single-line ~0.29 on this
tiny sample).
Closes the Phase 0 acceptance gate for the 2 tiers reachable from
on-disk data (clean acoustic single-line + strummed via GuitarSet
held-out validation). Clean electric and distorted electric remain
'missing' pending Guitar-TECHS / EGDB acquisition.

Matcher fix (tabvision/tabvision/eval/error_decomposition.py):
- decompose_errors() now uses priority-based selection within each
  onset tolerance window: same (string, fret) > same pitch_midi >
  onset-closest. Previously a greedy onset-only matcher mis-paired
  chord-cluster events whose on-the-wire ordering differed from
  ground truth, inflating pitch_off on strummed (3387 → 486 with
  the fix). event_f1's pitch-matching semantics are now mirrored
  in the decomposition.
- Added test_chord_cluster_priority_pitch_over_onset and
  test_chord_cluster_priority_falls_back_to_position_match_then_pitch
  to lock the new behavior.

Reports (docs/EVAL_REPORTS/*):
- composite_baseline_2026-05-13.md — first artifact under
  SPEC §1.4.1: per-tier Tab F1 + Onset/Pitch F1 + 95% bootstrap CI
  + pass/gap/fail/missing status. Headline: both covered tiers
  FAIL by ~25-35 pp (single-line mean 0.5076, strummed 0.6708).
- tab_f1_error_decomposition_2026-05-13.md — companion 6-bucket
  breakdown. Headline: wrong_position_same_pitch dominates loss
  on every tier — 77% of single-line, 50% of strummed, 57% aggregate.
  Confirms the strategy doc §2 diagnostic.

Eval manifest (tabvision/data/eval/composite.toml):
- 60 player-05 validation clips, byte-stable output of the manifest
  builder. Strummed and single-line tiers fully covered.

LICENSES.md:
- GuitarSet: marked '✅ used for 2026-05-13 baseline'.
- Guitar-TECHS: added as planned acquisition (CC-BY-4.0).
- EGDB: status updated; author email pending.
- GOAT: marked ❌ DROPPED (request-only research-only).
- SynthTab: marked ❌ DROPPED from default pipeline (CC-BY-NC-4.0).
- User clips: marked ⛔ banned per D10.
- DadaGP: marked research/dev only; not in default pipeline.

DECISIONS.md: single 2026-05-13 entry summarising D1-D11 from the
design plan, with per-tier targets table and the 2026-05-13 baseline
numbers inlined so the decision record stands alone.

104 tests pass; ruff + mypy clean.
…ording

Three small fixes flagged in review of the Phase 0 baseline:

(a) Portable manifest. tabvision.eval.manifest_builder now accepts
    --data-root PATH; render_toml rewrites media/annotation paths
    that fall under that root as '/<rest>'. The
    composite-eval CLI already expanded that token via env var or
    --media-root/--annotation-root, so checked-in manifests are now
    portable across developer machines. Re-generated
    tabvision/data/eval/composite.toml with the new flag so the
    committed manifest no longer carries /home/gilhooleyp/... paths.
    +3 unit tests covering the rewrite + the no-data-root path.

(b) Real SHA in the baseline report. The 'Eval-harness SHA' field
    in docs/EVAL_REPORTS/composite_baseline_2026-05-13.md now cites
    2ec4849 (the commit that landed both the baseline and the
    chord-cluster matcher fix), instead of the ad-hoc
    '354571b-matcher-fix' label used at run time.

(c) Stale '7-bucket' wording cleared in the planning docs and one
    test docstring. The implementation is a six-bucket port; only
    references to the original apr-28 7-bucket harness keep the
    historical name.

Verification ran in WSL:
- ruff: passes on changed files.
- mypy: clean on the 8 Phase 0 eval source files (parsers/, bootstrap,
  error_decomposition, composite, manifest_builder). Broader
  tabvision-wide mypy hits older Phase 5 diagnostics not in this PR's
  scope.
- 107 tests pass across the focused Phase 0 + existing eval suite.

No production behavior change; the manifest still resolves to the
same 60 player-05 validation clips.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 19, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
tab_vision Ready Ready Preview, Comment May 19, 2026 6:26pm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant