Skip to content

kb-labb/easytranscriber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

easytranscriber is an automatic speech recognition (ASR) library for transcription with precise word-level timestamps. While the transcription step itself is well-optimized in most ASR libraries, the surrounding components (data loading, emission extraction, forced alignment) often act as bottlenecks. easytranscriber optimizes these components and supports both ctranslate2 and Hugging Face transformers as inference backends. Notable features include:

  • GPU accelerated forced alignment, using Pytorch's forced alignment API. Forced alignment is based on a GPU implementation of the Viterbi algorithm (Pratap et al., 2024).
  • Parallel loading and pre-fetching of audio files for efficient data loading and batch processing.
  • Flexible text normalization for improved alignment quality. Users can supply custom regex-based text normalization functions to preprocess ASR outputs before alignment. A mapping from the original text to the normalized text is maintained internally. All of the applied normalizations and transformations are consequently non-destructive and reversible after alignment.
  • 35% to 102% faster inference compared to WhisperX. See the benchmarks for more details.
  • Batch inference support for wav2vec2 models (emission extraction).

Installation

With GPU support

pip install easytranscriber --extra-index-url https://download.pytorch.org/whl/cu128

Tip

Remove --extra-index-url if you want a CPU-only installation.

Using uv

When installing with uv, it will select the appropriate PyTorch version automatically (CPU for macOS, CUDA for Linux/Windows/ARM):

uv pip install easytranscriber

Usage

Below, an example is provided of how transcribe an audio file with easytranscriber. We transcribe the first chapter of an audiobook recording of "A Tale of Two Cities". The recording is sourced from LibriVox.

from pathlib import Path

from easyaligner.text import load_tokenizer
from huggingface_hub import snapshot_download

from easytranscriber.pipelines import pipeline
from easytranscriber.text.normalization import text_normalizer

# Download Tale of Two Cities book 1 chapter 1 LibriVox audiobook recording for testing
snapshot_download(
    "Lauler/easytranscriber_tutorials",
    repo_type="dataset",
    local_dir="data/tutorials",
    allow_patterns="tale-of-two-cities_short-en/*",
    # max_workers=4,
)

tokenizer = load_tokenizer("english") # For sentence tokenization in forced alignment
audio_files = [file.name for file in Path("data/tutorials/tale-of-two-cities_short-en").glob("*")]
pipeline(
    vad_model="pyannote",
    emissions_model="facebook/wav2vec2-base-960h",
    transcription_model="distil-whisper/distil-large-v3.5",
    audio_paths=audio_files,
    audio_dir="data/tutorials/tale-of-two-cities_short-en",
    language="en",
    tokenizer=tokenizer,
    text_normalizer_fn=text_normalizer,
    cache_dir="models",
)

easysearch

easysearch is a built-in lightweight search interface for browsing and querying your transcription outputs. It indexes transcription chunks into a SQLite database with full-text search and serves a web UI with audio playback and synchronized transcript highlighting.

pip install easytranscriber[search]
easysearch --alignments-dir output/alignments --audio-dir data/audio

See the search documentation for details on search syntax, indexing, and configuration options.

Benchmarks

We present throughput comparisons between easytranscriber and WhisperX. See the benchmarks directory for code and details.

WhisperX relies on single-threaded data loading and CPU-based forced alignment, creating a bottleneck that is especially pronounced on hardware with slower single-core performance.

Benchmarks

All easytranscriber benchmarks were run using the ctranslate2 backend for transcription.

  • PyTorch version: 2.8.0
  • CUDA: 12.8
  • WhisperX version: 3.7.6
  • Model: KBLab/kb-whisper-large
  • Language: Swedish (sv)

Documentation

The documentation is available at kb-labb.github.io/easytranscriber/.

Tip

Check out the easyaligner library for a user friendly pipeline for forced alignment of text and audio.

Acknowledgements

easytranscriber draws heavy inspiration from WhisperX (Bain et al., 2023).

The forced alignment component of easytranscriber is based on Pytorch's forced alignment API, which implements a GPU-accelerated version of the Viterbi algorithm as described in Pratap et al., 2024.

LibriVox for public domain audiobooks used as tutorial examples.

Citation

@online{rekathati2026,
  author = {Rekathati, Faton},
  title = {Easytranscriber: {Speech} Recognition with Precise
    Timestamps},
  date = {2026-02-26},
  url = {https://kb-labb.github.io/posts/2026-02-26-easytranscriber/},
  langid = {en}
}

About

Speech recognition with word-level timestamps, optimized for batch inference.

Resources

License

Stars

Watchers

Forks

Packages