Skip to content

borgholt/error-align

Repository files navigation

THIS PROJECT HAS MOVED

This project has moved to the corticph organization. The version on this page outdated.

You find the new project HERE.

ErrorAlign Logo

Python Versions Coverage Linting PyPI License


Text-to-text alignment algorithm for speech recognition error analysis. ErrorAlign helps you dig deeper into your speech recognition projects by accurately aligning each word in a reference transcript with the model-generated transcript. Unlike traditional methods, such as Levenshtein-based alignment, it is not restricted to simple one-to-one alignment, but can map a single reference word to multiple words or subwords in the model output. This enables quick and reliable identification of error patterns in rare words, names, or domain-specific terms that matter most for your application.

Contents | Installation | Quickstart | Work-in-Progress | Citation and Research |

pip install error-align
from error_align import error_align

ref = "Some things are worth noting!"
hyp = "Something worth nothing period?"

alignments = error_align(ref, hyp)

Resulting alignments:

Alignment(SUBSTITUTE: "Some" -> "Some"-),
Alignment(SUBSTITUTE: "things" -> -"thing"),
Alignment(DELETE: "are"),
Alignment(MATCH: "worth" == "worth"),
Alignment(SUBSTITUTE: "noting" -> "nothing"),
Alignment(INSERT: "period")
  • Optimization for longform text.
  • Efficient word-level first-pass.
  • C++ version with Python bindings.
@article{borgholt2021alignment,
  title={A Text-To-Text Alignment Algorithm for Better Evaluation of Modern Speech Recognition Systems},
  author={Borgholt, Lasse and Havtorn, Jakob and Igel, Christian and Maal{\o}e, Lars and Tan, Zheng-Hua},
  journal={arXiv preprint arXiv:2509.24478},
  year={2025}
}

To reproduce results from the paper:

  • Install with extra evaluation dependencies - only supported with Python 3.12:
    • pip install error-align[evaluation]
  • Clone this repository:
    • git clone https://github.com/borgholt/error-align.git
  • Navigate to the evaluation directory:
    • cd error-align/evaluation
  • Transcribe a dataset for evaluation. For example:
    • python transcribe_dataset.py --model_name whisper --dataset_name commonvoice --language_code fr
  • Run evaluation script on the output file. For example:
    • python evaluate_dataset.py --transcript_file transcribed_data/whisper_commonvoice_test_fr.parquet

Notes:

  • To reproduce results on the primock57 dataset, first run: python prepare_primock57.py.
  • Use the --help flag to see all available options for transcribe_dataset.py and evaluate_dataset.py.
  • All results reported in the paper are based on the test sets.

Collaborators:



About

Error alignment algorithm for automatic speech recogntion.

Resources

Stars

Watchers

Forks

Packages

No packages published