M6.5: word-level Media Overlay sync#29
Merged
Merged
Conversation
Karaoke-style highlight-along-with-audio in EPUB readers that
honour Media Overlays (Thorium, Readium). Materially valuable for
dyslexic readers, language learners, and low-vision users tracking
with magnifiers — a class of accessibility experience that no
other open-source DAISY → EPUB toolchain ships.
Pipeline:
1. dpub-whisper extracts per-token timestamps that whisper.cpp
was already computing for free. New `Word` struct exposes
{start_seconds, end_seconds, text}; `Segment.words` is
populated alongside the existing `text` field.
2. New coalescer module `dpub-whisper/src/words.rs` turns BPE
subword pieces back into whole words. Algorithm:
- Leading-space token starts a new word.
- Token without leading space (or pure ASCII punctuation)
attaches to current word.
- Defensive case: if previous word ends in a sentence
terminator and the next token starts with an alphabetic
char, treat as new word even without leading space —
recovers from whisper occasionally dropping the post-
period space.
- Drops special tokens ([_BEG_], <|notimestamps|>, etc.).
- Clamps each word's range into [seg_t0, seg_t1] and gives
degenerate-timing punctuation a 50ms minimum.
Eight unit tests cover the rules.
3. dpub-convert::text_cleanup threads `Vec<Word>` through the
greedy paragraph builder. `Paragraph` gains
`words: Vec<Word>` and `audio_src: String`. The cleanup
state machine flushes at audio-file boundaries so each
paragraph references one audio source. Capitalisation fix
propagates to both the rendered text and `words[0].text`.
Three new unit tests; all 12 existing tests preserved.
4. dpub-convert::lib emits per-word `<span id="w-NNN-MMM-KKK">`
inside each cleaned `<p id="tx-NNN-MMM">`. Single ASCII
space between consecutive spans; XML escaping per word.
Three new unit tests for span shape, fallback when words
are absent, and escaping.
5. dpub-convert::lib rebuilds the section overlay from the
cleaned paragraphs in word-sync mode: nested
`<seq epub:textref="...#tx-...">` wrapping per-word
`<par>` entries. The existing `OverlayPar` model already
supported this granularity unchanged. Two new unit tests
for nested-seq shape and skip-paragraphs-without-words.
6. New `dpub convert --no-word-sync` flag opts out (falls back
to per-paragraph sync). Plumbed via
`ConvertOptions.no_word_sync`. Mutually compatible with all
existing flags.
7. EPUBCheck-clean verified end-to-end. New synthetic test
`epubcheck_clean_with_word_level_overlay_when_available`
constructs a publication with hand-built per-word `<par>`
entries and asserts EPUBCheck reports 0 errors, 0 warnings.
Reference book without --transcribe still 0/0/0 (project's
correctness baseline).
Performance: token extraction is two FFI calls per token —
negligible vs. multi-minute Whisper inference. SMIL bytes grow
~50–100× but compress ~6:1 with deflate; net EPUB size grows
~1–2 MB on a book whose audio is already ~80–160 MB.
Documentation: README roadmap ticks M6.5; CHANGELOG entry under
[Unreleased].
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Karaoke-style highlight-along-with-audio in EPUB readers that honour Media Overlays (Thorium, Readium). Materially valuable for dyslexic readers, language learners, low-vision users tracking with magnifiers — a class of accessibility experience that no other open-source DAISY → EPUB toolchain ships.
Default-on with `--transcribe`; pass `--no-word-sync` to fall back to per-paragraph sync.
Pipeline (seven sequential commits in this single squash-merge)
word ...
`. 3 new unit tests for span shape, fallback when words absent, XML escaping.Why now and why this shape
Whisper has been giving us per-token timestamps the whole time; we were throwing them away. M6.5 picks them up, runs a small ASCII-only coalescer to recover word boundaries from BPE tokens, and threads everything through the existing text-cleanup pipeline without changing the user-facing paragraph shape.
The OverlayPar model in `epub3-writer` already supports per-word granularity — no schema changes needed. Nested `` wrapping per-word ``s gives reading systems a structural place to scope highlight to "current paragraph" while still tracking the spoken word.
Per-paragraph IDs (`tx-NNN-MMM`) that landed in #12 are now actually referenced by the SMIL — that infrastructure was waiting for this PR.
Performance
Test plan
Out of scope
`, heading-level ``).