Skip to content

M6.5: word-level Media Overlay sync#29

Merged
roelvangils merged 1 commit into
mainfrom
feature/word-level-sync
May 6, 2026
Merged

M6.5: word-level Media Overlay sync#29
roelvangils merged 1 commit into
mainfrom
feature/word-level-sync

Conversation

@roelvangils
Copy link
Copy Markdown
Member

Summary

Karaoke-style highlight-along-with-audio in EPUB readers that honour Media Overlays (Thorium, Readium). Materially valuable for dyslexic readers, language learners, low-vision users tracking with magnifiers — a class of accessibility experience that no other open-source DAISY → EPUB toolchain ships.

Default-on with `--transcribe`; pass `--no-word-sync` to fall back to per-paragraph sync.

Pipeline (seven sequential commits in this single squash-merge)

  1. dpub-whisper data model — new `Word { start, end, text }`, `Segment.words: Vec`.
  2. BPE coalescer (`crates/dpub-whisper/src/words.rs`) — leading-space rule, punctuation attachment, special-token filter, degenerate-timing 50 ms minimum, segment-bound clamping. 8 unit tests.
  3. Wired into `Transcriber::transcribe` — extracts whisper.cpp's per-token timestamps via the existing `whisper-rs` API; smoke test asserts non-empty `words` when `text` is non-empty.
  4. Words thread through text-cleanup — `Paragraph` gains `words: Vec` and `audio_src: String`; capitalisation fix propagates to `words[0].text`; cleanup state machine flushes at audio-file boundaries (defends invariant: every word in a paragraph comes from the same audio file). 3 new unit tests; 12 existing tests preserved.
  5. ``s in XHTML — `render_cleaned_paragraphs` emits `

    word ...

    `. 3 new unit tests for span shape, fallback when words absent, XML escaping.
  6. Per-word SMIL ``s + `--no-word-sync` flag — section overlay rebuilt from cleaned paragraphs as nested `` wrapping per-word ``. `OverlayPar` model unchanged. 2 new unit tests.
  7. EPUBCheck-clean verified — new synthetic test `epubcheck_clean_with_word_level_overlay_when_available` validates the new overlay structure end-to-end. Reference book without `--transcribe` still 0/0/0 EPUBCheck-clean (the project's correctness baseline per CLAUDE.md).

Why now and why this shape

Whisper has been giving us per-token timestamps the whole time; we were throwing them away. M6.5 picks them up, runs a small ASCII-only coalescer to recover word boundaries from BPE tokens, and threads everything through the existing text-cleanup pipeline without changing the user-facing paragraph shape.

The OverlayPar model in `epub3-writer` already supports per-word granularity — no schema changes needed. Nested `` wrapping per-word ``s gives reading systems a structural place to scope highlight to "current paragraph" while still tracking the spoken word.

Per-paragraph IDs (`tx-NNN-MMM`) that landed in #12 are now actually referenced by the SMIL — that infrastructure was waiting for this PR.

Performance

  • Token extraction: two FFI calls per token. Reference book ≈ 84 k tokens → ~170 k extra calls. Negligible vs. multi-minute Whisper inference.
  • SMIL bytes grow ~50–100×; deflate compresses ~6:1. Net EPUB size grows ~1–2 MB on a book whose audio is already ~80–160 MB.
  • XHTML grows ~3× per paragraph (each word now has a `` wrapper). Still small absolute numbers.

Test plan

  • `cargo test --workspace` — all 25 suites green.
  • `cargo clippy --all-targets -- -D warnings` — clean.
  • EPUBCheck-gated tests in `epub3-writer` (3): minimal book, with cover, with word-level overlay.
  • Real-book `dpub convert --validate` (no `--transcribe`) on the reference DAISY book → EPUBCheck 5.3.0: 0 fatals / 0 errors / 0 warnings / 0 usages.
  • Manual end-to-end with `--transcribe` on the reference book is the only piece I haven't verified — that requires a multi-hour Whisper run on actual audio. Will land as a comment on this PR after the long-form run; the synthetic-fixture EPUBCheck-on-word-overlay test is the gating substitute for now.
  • Empirical Thorium playback test: open a transcribed EPUB, confirm word-by-word highlighting tracks the audio with no visible drift > ~300 ms (Whisper's token timings are inherently approximate).

Out of scope

  • Word-level sync for the `--no-text-cleanup` raw-segments path (kept exactly as today: per-segment `

    `, heading-level ``).

  • Promoting `audio_src` from `Paragraph` to `Word` — current invariant (one audio file per cleanup paragraph) is defended by the boundary-flush in cleanup; defer until a real book violates it.

Karaoke-style highlight-along-with-audio in EPUB readers that
honour Media Overlays (Thorium, Readium). Materially valuable for
dyslexic readers, language learners, and low-vision users tracking
with magnifiers — a class of accessibility experience that no
other open-source DAISY → EPUB toolchain ships.

Pipeline:

  1. dpub-whisper extracts per-token timestamps that whisper.cpp
     was already computing for free. New `Word` struct exposes
     {start_seconds, end_seconds, text}; `Segment.words` is
     populated alongside the existing `text` field.

  2. New coalescer module `dpub-whisper/src/words.rs` turns BPE
     subword pieces back into whole words. Algorithm:

       - Leading-space token starts a new word.
       - Token without leading space (or pure ASCII punctuation)
         attaches to current word.
       - Defensive case: if previous word ends in a sentence
         terminator and the next token starts with an alphabetic
         char, treat as new word even without leading space —
         recovers from whisper occasionally dropping the post-
         period space.
       - Drops special tokens ([_BEG_], <|notimestamps|>, etc.).
       - Clamps each word's range into [seg_t0, seg_t1] and gives
         degenerate-timing punctuation a 50ms minimum.

     Eight unit tests cover the rules.

  3. dpub-convert::text_cleanup threads `Vec<Word>` through the
     greedy paragraph builder. `Paragraph` gains
     `words: Vec<Word>` and `audio_src: String`. The cleanup
     state machine flushes at audio-file boundaries so each
     paragraph references one audio source. Capitalisation fix
     propagates to both the rendered text and `words[0].text`.
     Three new unit tests; all 12 existing tests preserved.

  4. dpub-convert::lib emits per-word `<span id="w-NNN-MMM-KKK">`
     inside each cleaned `<p id="tx-NNN-MMM">`. Single ASCII
     space between consecutive spans; XML escaping per word.
     Three new unit tests for span shape, fallback when words
     are absent, and escaping.

  5. dpub-convert::lib rebuilds the section overlay from the
     cleaned paragraphs in word-sync mode: nested
     `<seq epub:textref="...#tx-...">` wrapping per-word
     `<par>` entries. The existing `OverlayPar` model already
     supported this granularity unchanged. Two new unit tests
     for nested-seq shape and skip-paragraphs-without-words.

  6. New `dpub convert --no-word-sync` flag opts out (falls back
     to per-paragraph sync). Plumbed via
     `ConvertOptions.no_word_sync`. Mutually compatible with all
     existing flags.

  7. EPUBCheck-clean verified end-to-end. New synthetic test
     `epubcheck_clean_with_word_level_overlay_when_available`
     constructs a publication with hand-built per-word `<par>`
     entries and asserts EPUBCheck reports 0 errors, 0 warnings.
     Reference book without --transcribe still 0/0/0 (project's
     correctness baseline).

Performance: token extraction is two FFI calls per token —
negligible vs. multi-minute Whisper inference. SMIL bytes grow
~50–100× but compress ~6:1 with deflate; net EPUB size grows
~1–2 MB on a book whose audio is already ~80–160 MB.

Documentation: README roadmap ticks M6.5; CHANGELOG entry under
[Unreleased].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roelvangils roelvangils merged commit d28d16c into main May 6, 2026
4 of 5 checks passed
@roelvangils roelvangils deleted the feature/word-level-sync branch May 6, 2026 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant