Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ All notable changes to this project will be documented in this file. The format
- **`--auto-cover` for Dutch (and other) books no longer silently misses.** Open Library tags docs with ISO 639-2/B (e.g. `"dut"` for Dutch), while DAISY 2.02 metadata uses ISO 639-1 (`"nl"`); the previous literal `eq_ignore_ascii_case` dropped every plausible match. `dpub-meta` now treats 639-1, 639-2/B and 639-2/T as equivalent (`nl`/`dut`/`nld`, `fr`/`fre`/`fra`, `de`/`ger`/`deu`, etc.). Real-world miss this surfaced: "Het smelt" by Lize Spit. Regression test added.
- **ISBN search hits are now trusted unconditionally.** When DAISY's `dc:identifier` is ISBN-shaped, the search-by-ISBN already disambiguates the edition, so the language and author filters on the result are noise — and would (incorrectly) reject the cover when Open Library lists a translator under `author_name`. Title+author search remains filtered.
- **Open Library HTTP timeout raised from 8 s to 30 s.** `covers.openlibrary.org` redirects through archive.org and can take ~20 s on first hit for less-popular editions; 8 s caused spurious "lookup failed" misses.
- **OPF manifest IDs no longer fail XML Name validation when DAISY filenames start with digits.** DAISY books frequently use `001_*.smil`, `002_*.smil` filenames; the previous code copied those stems into manifest `id` and `idref` attributes, which XML Names reject (must start with a letter or underscore). Stems beginning with a digit are now prefixed with `s-`. EPUBCheck no longer flags `RSC-005` for these books. The reference book ("Ontmoetingen in het donker") was unaffected because its filenames begin with letters.
- **Empty `<seq>` elements no longer leak into Media Overlay SMIL files** (EPUBCheck `RSC-005` "element seq incomplete"). Empty paragraph wrappers are dropped at the writer level, and the heading-level overlay shell is preserved when ground-truth alignment would have produced an entirely empty word tree.
- **Words with `clipBegin == clipEnd` no longer ship in SMIL** (EPUBCheck `MED-009`). Zero-duration words from interpolation are filtered out alongside the explicit Unsynced sentinel; their XHTML span is still emitted so the text remains readable.
- **Whisper model download no longer times out on slow connections.** The HTTP agent used a 60-second total-request timeout, which was insufficient for the 1.5 GB `ggml-medium.bin` download. Now uses per-read timeouts (60 s idle) so downloads can take as long as needed as long as data keeps flowing. Additionally, downloads now retry up to 3 times on transient failures (CDN stalls, connection resets).

### Changed
Expand All @@ -17,6 +20,10 @@ All notable changes to this project will be documented in this file. The format

### Added

- **Ground truth text alignment** (`--ground-truth <PATH>`). Pass a plain text or markdown file containing the real book text and dpub will align it word-by-word against Whisper's transcription, replacing Whisper's approximate text with the real prose while keeping the word-level audio sync. Section headings are matched against the DAISY NCC headings via Jaro-Winkler fuzzy matching, so a single file with the whole book works as long as the chapters are in the right order. Markdown vs plain text is auto-detected. Requires `--transcribe` (Whisper still runs to produce timestamps).
- **`--ground-truth-strategy <drop|no-sync|bracket>`** controls how book content the narrator skipped (colophon, index, acknowledgements) is handled. `no-sync` (default) includes the text in the EPUB without a Media Overlay entry — visible, no karaoke highlight on those passages. `drop` excludes it entirely. `bracket` spans the available time gap proportionally for continuous (if imperfect) sync.
- **Audiobook-specific boundary trimming.** Audiobook copyright preambles and outros (Whisper-only material) are detected automatically and discarded — they never leak into the first or last real word's timestamp. The detector requires a run of at least 5 consecutive matching words before it commits to the alignment, so a single coincidental match (e.g. the book title appearing in the preamble) can't trigger early alignment.
- **New crate `dpub-align`** containing the alignment algorithm: word normalisation, Myers diff (via `similar`), Jaro-Winkler fuzzy promotion (≥ 0.85 → Equal), boundary anchor detection, and timestamp transfer with monotonicity enforcement. 33 unit tests.
- **`--transcribe` auto-detects language from book metadata.** Passing `--transcribe` without a language code now reads `dc:language` from the DAISY NCC metadata and normalises it to ISO 639-1 for Whisper. Explicit `--transcribe nl` still works. Config file supports `"transcribe": true` for auto-detect or `"transcribe": "nl"` for a fixed default.
- **Shared ISO 639 normaliser** (`dpub-util/lang`). Maps ISO 639-1, 639-2/B and 639-2/T codes to their canonical two-letter form. Used by both `dpub-meta` (cover lookup language filter) and `dpub-cli` (transcription auto-detect).
- **Persistent config file** (`~/.config/dpub/config.json` on Unix, `%APPDATA%\dpub\config.json` on Windows). Lets users set defaults for `audio`, `bitrate`, `auto_cover`, `no_word_sync`, `rights`, `whisper_model`, `transcribe`, `validate`, `a11y`, `jobs`, and `log_level`. CLI flags always override config values.
Expand Down
34 changes: 34 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
[workspace]
resolver = "3"
members = [
"crates/dpub-align",
"crates/dpub-audio",
"crates/dpub-core",
"crates/dpub-cli",
Expand Down
22 changes: 22 additions & 0 deletions crates/dpub-align/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
[package]
name = "dpub-align"
version.workspace = true
edition.workspace = true
rust-version.workspace = true
license.workspace = true
repository.workspace = true
description = "Align Whisper word-level timestamps to ground truth book text via Myers diff + fuzzy matching."

[lints]
workspace = true

[dependencies]
serde = { workspace = true }
serde_json = { workspace = true }
similar = "3"
strsim = "0.11"
thiserror = { workspace = true }
tracing = { workspace = true }

[dev-dependencies]
dpub-core = { path = "../dpub-core" }
66 changes: 66 additions & 0 deletions crates/dpub-align/examples/match_sections.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
//! Dry-run helper: parse a DAISY 2.02 publication and a ground-truth
//! file and report how many sections the heading matcher resolves —
//! without running Whisper. Useful when validating a new ground-truth
//! file against a book.
//!
//! Usage:
//! ```text
//! cargo run --release -p dpub-align --example match_sections -- \
//! /path/to/book/ncc.html /path/to/groundtruth.{txt,md,json}
//! ```

use std::path::Path;

fn main() {
let mut args = std::env::args().skip(1);
let ncc = args.next().expect("usage: match_sections <ncc.html> <ground-truth>");
let gt = args.next().expect("usage: match_sections <ncc.html> <ground-truth>");

let book = dpub_core::Book::from_ncc(Path::new(&ncc)).expect("parse DAISY");
let raw = std::fs::read_to_string(&gt).expect("read ground truth");

let headings: Vec<(&str, usize)> = book
.master
.references
.iter()
.enumerate()
.map(|(i, r)| (r.title.as_str(), i))
.collect();

let sections = dpub_align::split_into_sections(&raw, &headings);
println!(
"Matched {} of {} DAISY sections",
sections.len(),
headings.len()
);
println!();

let matched: std::collections::HashSet<usize> = sections.iter().map(|s| s.ncc_index).collect();
println!("First 10 matches:");
for s in sections.iter().take(10) {
let title = headings[s.ncc_index].0;
let preview: String = s.text.chars().take(50).collect::<String>().replace('\n', " ");
println!(
" [{:3}] {:30} → {:5} chars {:?}",
s.ncc_index,
title,
s.text.len(),
preview
);
}
println!();

let unmatched: Vec<&str> = headings
.iter()
.enumerate()
.filter(|(i, _)| !matched.contains(i))
.map(|(_, (t, _))| *t)
.collect();
println!("Unmatched headings ({} total):", unmatched.len());
for t in unmatched.iter().take(20) {
println!(" {t}");
}
if unmatched.len() > 20 {
println!(" ... and {} more", unmatched.len() - 20);
}
}
Loading
Loading