11ways · roelvangils · May 7, 2026 · May 7, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,9 @@ All notable changes to this project will be documented in this file. The format
 - **`--auto-cover` for Dutch (and other) books no longer silently misses.** Open Library tags docs with ISO 639-2/B (e.g. `"dut"` for Dutch), while DAISY 2.02 metadata uses ISO 639-1 (`"nl"`); the previous literal `eq_ignore_ascii_case` dropped every plausible match. `dpub-meta` now treats 639-1, 639-2/B and 639-2/T as equivalent (`nl`/`dut`/`nld`, `fr`/`fre`/`fra`, `de`/`ger`/`deu`, etc.). Real-world miss this surfaced: "Het smelt" by Lize Spit. Regression test added.
 - **ISBN search hits are now trusted unconditionally.** When DAISY's `dc:identifier` is ISBN-shaped, the search-by-ISBN already disambiguates the edition, so the language and author filters on the result are noise — and would (incorrectly) reject the cover when Open Library lists a translator under `author_name`. Title+author search remains filtered.
 - **Open Library HTTP timeout raised from 8 s to 30 s.** `covers.openlibrary.org` redirects through archive.org and can take ~20 s on first hit for less-popular editions; 8 s caused spurious "lookup failed" misses.
+- **OPF manifest IDs no longer fail XML Name validation when DAISY filenames start with digits.** DAISY books frequently use `001_*.smil`, `002_*.smil` filenames; the previous code copied those stems into manifest `id` and `idref` attributes, which XML Names reject (must start with a letter or underscore). Stems beginning with a digit are now prefixed with `s-`. EPUBCheck no longer flags `RSC-005` for these books. The reference book ("Ontmoetingen in het donker") was unaffected because its filenames begin with letters.
+- **Empty `<seq>` elements no longer leak into Media Overlay SMIL files** (EPUBCheck `RSC-005` "element seq incomplete"). Empty paragraph wrappers are dropped at the writer level, and the heading-level overlay shell is preserved when ground-truth alignment would have produced an entirely empty word tree.
+- **Words with `clipBegin == clipEnd` no longer ship in SMIL** (EPUBCheck `MED-009`). Zero-duration words from interpolation are filtered out alongside the explicit Unsynced sentinel; their XHTML span is still emitted so the text remains readable.
 - **Whisper model download no longer times out on slow connections.** The HTTP agent used a 60-second total-request timeout, which was insufficient for the 1.5 GB `ggml-medium.bin` download. Now uses per-read timeouts (60 s idle) so downloads can take as long as needed as long as data keeps flowing. Additionally, downloads now retry up to 3 times on transient failures (CDN stalls, connection resets).
 
 ### Changed
@@ -17,6 +20,10 @@ All notable changes to this project will be documented in this file. The format
 
 ### Added
 
+- **Ground truth text alignment** (`--ground-truth <PATH>`). Pass a plain text or markdown file containing the real book text and dpub will align it word-by-word against Whisper's transcription, replacing Whisper's approximate text with the real prose while keeping the word-level audio sync. Section headings are matched against the DAISY NCC headings via Jaro-Winkler fuzzy matching, so a single file with the whole book works as long as the chapters are in the right order. Markdown vs plain text is auto-detected. Requires `--transcribe` (Whisper still runs to produce timestamps).
+- **`--ground-truth-strategy <drop|no-sync|bracket>`** controls how book content the narrator skipped (colophon, index, acknowledgements) is handled. `no-sync` (default) includes the text in the EPUB without a Media Overlay entry — visible, no karaoke highlight on those passages. `drop` excludes it entirely. `bracket` spans the available time gap proportionally for continuous (if imperfect) sync.
+- **Audiobook-specific boundary trimming.** Audiobook copyright preambles and outros (Whisper-only material) are detected automatically and discarded — they never leak into the first or last real word's timestamp. The detector requires a run of at least 5 consecutive matching words before it commits to the alignment, so a single coincidental match (e.g. the book title appearing in the preamble) can't trigger early alignment.
+- **New crate `dpub-align`** containing the alignment algorithm: word normalisation, Myers diff (via `similar`), Jaro-Winkler fuzzy promotion (≥ 0.85 → Equal), boundary anchor detection, and timestamp transfer with monotonicity enforcement. 33 unit tests.
 - **`--transcribe` auto-detects language from book metadata.** Passing `--transcribe` without a language code now reads `dc:language` from the DAISY NCC metadata and normalises it to ISO 639-1 for Whisper. Explicit `--transcribe nl` still works. Config file supports `"transcribe": true` for auto-detect or `"transcribe": "nl"` for a fixed default.
 - **Shared ISO 639 normaliser** (`dpub-util/lang`). Maps ISO 639-1, 639-2/B and 639-2/T codes to their canonical two-letter form. Used by both `dpub-meta` (cover lookup language filter) and `dpub-cli` (transcription auto-detect).
 - **Persistent config file** (`~/.config/dpub/config.json` on Unix, `%APPDATA%\dpub\config.json` on Windows). Lets users set defaults for `audio`, `bitrate`, `auto_cover`, `no_word_sync`, `rights`, `whisper_model`, `transcribe`, `validate`, `a11y`, `jobs`, and `log_level`. CLI flags always override config values.

diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,7 @@
 [workspace]
 resolver = "3"
 members = [
+    "crates/dpub-align",
     "crates/dpub-audio",
     "crates/dpub-core",
     "crates/dpub-cli",

diff --git a/crates/dpub-align/Cargo.toml b/crates/dpub-align/Cargo.toml
@@ -0,0 +1,22 @@
+[package]
+name = "dpub-align"
+version.workspace = true
+edition.workspace = true
+rust-version.workspace = true
+license.workspace = true
+repository.workspace = true
+description = "Align Whisper word-level timestamps to ground truth book text via Myers diff + fuzzy matching."
+
+[lints]
+workspace = true
+
+[dependencies]
+serde = { workspace = true }
+serde_json = { workspace = true }
+similar = "3"
+strsim = "0.11"
+thiserror = { workspace = true }
+tracing = { workspace = true }
+
+[dev-dependencies]
+dpub-core = { path = "../dpub-core" }
diff --git a/crates/dpub-align/examples/match_sections.rs b/crates/dpub-align/examples/match_sections.rs
@@ -0,0 +1,66 @@
+//! Dry-run helper: parse a DAISY 2.02 publication and a ground-truth
+//! file and report how many sections the heading matcher resolves —
+//! without running Whisper. Useful when validating a new ground-truth
+//! file against a book.
+//!
+//! Usage:
+//! ```text
+//! cargo run --release -p dpub-align --example match_sections -- \
+//!   /path/to/book/ncc.html /path/to/groundtruth.{txt,md,json}
+//! ```
+
+use std::path::Path;
+
+fn main() {
+    let mut args = std::env::args().skip(1);
+    let ncc = args.next().expect("usage: match_sections <ncc.html> <ground-truth>");
+    let gt = args.next().expect("usage: match_sections <ncc.html> <ground-truth>");
+
+    let book = dpub_core::Book::from_ncc(Path::new(&ncc)).expect("parse DAISY");
+    let raw = std::fs::read_to_string(&gt).expect("read ground truth");
+
+    let headings: Vec<(&str, usize)> = book
+        .master
+        .references
+        .iter()
+        .enumerate()
+        .map(|(i, r)| (r.title.as_str(), i))
+        .collect();
+
+    let sections = dpub_align::split_into_sections(&raw, &headings);
+    println!(
+        "Matched {} of {} DAISY sections",
+        sections.len(),
+        headings.len()
+    );
+    println!();
+
+    let matched: std::collections::HashSet<usize> = sections.iter().map(|s| s.ncc_index).collect();
+    println!("First 10 matches:");
+    for s in sections.iter().take(10) {
+        let title = headings[s.ncc_index].0;
+        let preview: String = s.text.chars().take(50).collect::<String>().replace('\n', " ");
+        println!(
+            "  [{:3}] {:30}  → {:5} chars  {:?}",
+            s.ncc_index,
+            title,
+            s.text.len(),
+            preview
+        );
+    }
+    println!();
+
+    let unmatched: Vec<&str> = headings
+        .iter()
+        .enumerate()
+        .filter(|(i, _)| !matched.contains(i))
+        .map(|(_, (t, _))| *t)
+        .collect();
+    println!("Unmatched headings ({} total):", unmatched.len());
+    for t in unmatched.iter().take(20) {
+        println!("  {t}");
+    }
+    if unmatched.len() > 20 {
+        println!("  ... and {} more", unmatched.len() - 20);
+    }
+}