This project evaluates the speed, accuracy, and usability of four prominent Natural Language Processing (NLP) tools for Latin texts:
Two samples are used for testing:
- Early medieval glosses from the Vienna Bede
- Early medieval texts from the Late Latin Charter Treebank
To provide a reproducible and comparative analysis of Latin NLP tools for tokenisation, lemmatisation, and POS-tagging, as well as processing speed.
data/: Sample Latin texts (raw and preprocessed)notebooks/: Jupyter notebooks for experimentsscripts/: Python scripts for preprocessing and tool executionresults/: Accuracy/speed metrics and visualizations
- Clone the repo:
git clone https://github.com/YOUR_USERNAME/latin-nlp-comparison.git cd latin-nlp-comparison ```
- Create a virtual environment:
source env/bin/activate ```
- Install dependencies:
pip install -r requirements.txt
- Accuracy
- Tokenisation
- Lemmatisation
- POS
- Speed: length of time to process data
- Usability: Observational assessment of set-up complexity, packages required, interface, export options
See the GitHub Wiki for documentation, tool setup guides, and detailed findings.
Supervised by Bernhard Bauer at the University of Graz.