M6.5: word-level Media Overlay sync by roelvangils · Pull Request #29 · 11ways/dpub

roelvangils · 2026-05-06T20:53:08Z

Summary

Karaoke-style highlight-along-with-audio in EPUB readers that honour Media Overlays (Thorium, Readium). Materially valuable for dyslexic readers, language learners, low-vision users tracking with magnifiers — a class of accessibility experience that no other open-source DAISY → EPUB toolchain ships.

Default-on with `--transcribe`; pass `--no-word-sync` to fall back to per-paragraph sync.

Pipeline (seven sequential commits in this single squash-merge)

dpub-whisper data model — new `Word { start, end, text }`, `Segment.words: Vec`.
BPE coalescer (`crates/dpub-whisper/src/words.rs`) — leading-space rule, punctuation attachment, special-token filter, degenerate-timing 50 ms minimum, segment-bound clamping. 8 unit tests.
Wired into `Transcriber::transcribe` — extracts whisper.cpp's per-token timestamps via the existing `whisper-rs` API; smoke test asserts non-empty `words` when `text` is non-empty.
Words thread through text-cleanup — `Paragraph` gains `words: Vec` and `audio_src: String`; capitalisation fix propagates to `words[0].text`; cleanup state machine flushes at audio-file boundaries (defends invariant: every word in a paragraph comes from the same audio file). 3 new unit tests; 12 existing tests preserved.
``s in XHTML — `render_cleaned_paragraphs` emits `
word ...
`. 3 new unit tests for span shape, fallback when words absent, XML escaping.
Per-word SMIL ``s + `--no-word-sync` flag — section overlay rebuilt from cleaned paragraphs as nested `` wrapping per-word ``. `OverlayPar` model unchanged. 2 new unit tests.
EPUBCheck-clean verified — new synthetic test `epubcheck_clean_with_word_level_overlay_when_available` validates the new overlay structure end-to-end. Reference book without `--transcribe` still 0/0/0 EPUBCheck-clean (the project's correctness baseline per CLAUDE.md).

Why now and why this shape

Whisper has been giving us per-token timestamps the whole time; we were throwing them away. M6.5 picks them up, runs a small ASCII-only coalescer to recover word boundaries from BPE tokens, and threads everything through the existing text-cleanup pipeline without changing the user-facing paragraph shape.

The OverlayPar model in `epub3-writer` already supports per-word granularity — no schema changes needed. Nested `` wrapping per-word ``s gives reading systems a structural place to scope highlight to "current paragraph" while still tracking the spoken word.

Per-paragraph IDs (`tx-NNN-MMM`) that landed in #12 are now actually referenced by the SMIL — that infrastructure was waiting for this PR.

Performance

Token extraction: two FFI calls per token. Reference book ≈ 84 k tokens → ~170 k extra calls. Negligible vs. multi-minute Whisper inference.
SMIL bytes grow ~50–100×; deflate compresses ~6:1. Net EPUB size grows ~1–2 MB on a book whose audio is already ~80–160 MB.
XHTML grows ~3× per paragraph (each word now has a `` wrapper). Still small absolute numbers.

Test plan

`cargo test --workspace` — all 25 suites green.
`cargo clippy --all-targets -- -D warnings` — clean.
EPUBCheck-gated tests in `epub3-writer` (3): minimal book, with cover, with word-level overlay.
Real-book `dpub convert --validate` (no `--transcribe`) on the reference DAISY book → EPUBCheck 5.3.0: 0 fatals / 0 errors / 0 warnings / 0 usages.
Manual end-to-end with `--transcribe` on the reference book is the only piece I haven't verified — that requires a multi-hour Whisper run on actual audio. Will land as a comment on this PR after the long-form run; the synthetic-fixture EPUBCheck-on-word-overlay test is the gating substitute for now.
Empirical Thorium playback test: open a transcribed EPUB, confirm word-by-word highlighting tracks the audio with no visible drift > ~300 ms (Whisper's token timings are inherently approximate).

Out of scope

Word-level sync for the `--no-text-cleanup` raw-segments path (kept exactly as today: per-segment `
`, heading-level ``).
Promoting `audio_src` from `Paragraph` to `Word` — current invariant (one audio file per cleanup paragraph) is defended by the boundary-flush in cleanup; defer until a real book violates it.

Karaoke-style highlight-along-with-audio in EPUB readers that honour Media Overlays (Thorium, Readium). Materially valuable for dyslexic readers, language learners, and low-vision users tracking with magnifiers — a class of accessibility experience that no other open-source DAISY → EPUB toolchain ships. Pipeline: 1. dpub-whisper extracts per-token timestamps that whisper.cpp was already computing for free. New `Word` struct exposes {start_seconds, end_seconds, text}; `Segment.words` is populated alongside the existing `text` field. 2. New coalescer module `dpub-whisper/src/words.rs` turns BPE subword pieces back into whole words. Algorithm: - Leading-space token starts a new word. - Token without leading space (or pure ASCII punctuation) attaches to current word. - Defensive case: if previous word ends in a sentence terminator and the next token starts with an alphabetic char, treat as new word even without leading space — recovers from whisper occasionally dropping the post- period space. - Drops special tokens ([_BEG_], <|notimestamps|>, etc.). - Clamps each word's range into [seg_t0, seg_t1] and gives degenerate-timing punctuation a 50ms minimum. Eight unit tests cover the rules. 3. dpub-convert::text_cleanup threads `Vec<Word>` through the greedy paragraph builder. `Paragraph` gains `words: Vec<Word>` and `audio_src: String`. The cleanup state machine flushes at audio-file boundaries so each paragraph references one audio source. Capitalisation fix propagates to both the rendered text and `words[0].text`. Three new unit tests; all 12 existing tests preserved. 4. dpub-convert::lib emits per-word `<span id="w-NNN-MMM-KKK">` inside each cleaned `<p id="tx-NNN-MMM">`. Single ASCII space between consecutive spans; XML escaping per word. Three new unit tests for span shape, fallback when words are absent, and escaping. 5. dpub-convert::lib rebuilds the section overlay from the cleaned paragraphs in word-sync mode: nested `<seq epub:textref="...#tx-...">` wrapping per-word `<par>` entries. The existing `OverlayPar` model already supported this granularity unchanged. Two new unit tests for nested-seq shape and skip-paragraphs-without-words. 6. New `dpub convert --no-word-sync` flag opts out (falls back to per-paragraph sync). Plumbed via `ConvertOptions.no_word_sync`. Mutually compatible with all existing flags. 7. EPUBCheck-clean verified end-to-end. New synthetic test `epubcheck_clean_with_word_level_overlay_when_available` constructs a publication with hand-built per-word `<par>` entries and asserts EPUBCheck reports 0 errors, 0 warnings. Reference book without --transcribe still 0/0/0 (project's correctness baseline). Performance: token extraction is two FFI calls per token — negligible vs. multi-minute Whisper inference. SMIL bytes grow ~50–100× but compress ~6:1 with deflate; net EPUB size grows ~1–2 MB on a book whose audio is already ~80–160 MB. Documentation: README roadmap ticks M6.5; CHANGELOG entry under [Unreleased]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roelvangils merged commit d28d16c into main May 6, 2026
4 of 5 checks passed

roelvangils deleted the feature/word-level-sync branch May 6, 2026 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M6.5: word-level Media Overlay sync#29

M6.5: word-level Media Overlay sync#29
roelvangils merged 1 commit into
mainfrom
feature/word-level-sync

roelvangils commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

roelvangils commented May 6, 2026

Summary

Pipeline (seven sequential commits in this single squash-merge)

Why now and why this shape

Performance

Test plan

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant