parallelium-scriptures-alignment-dataset/docs/alignment_workflow.md at main · ProMeText/parallelium-scriptures-alignment-dataset

⚙️ Alignment Workflow (Biblical Corpus)

This document outlines the preprocessing and alignment pipeline used to construct the Biblical section of the Parallelium dataset.

The workflow supports heterogeneous source formats and prepares the data for multilingual verse-level alignment, centered—when possible—on the Latin Vulgate.

🔄 Overview

Import ➝ Structure ➝ Filter ➝ Align (to Vulgate) ➝ Export

The dataset was constructed through a multi-stage pipeline, specifically designed to accommodate diverse source formats and the complexities of historical textual variation.

🧩 Pipeline Stages

Collection
- Source formats include XML, TXT, PDF, Word, and TSV
- Texts were selected for having existing verse-based divisions
- No orthographic normalization was performed — original spellings are preserved
Cleaning
- Removal of non-textual artifacts, markup noise, and encoding issues
- TEI/XML files were simplified or minimally processed for structure
Structuring
- Conversion into a unified JSON schema
- Monolingual JSON files prepared per text, then merged into multilingual alignment files
- Metadata added for book names, references, and version IDs
Alignment
- Verse-level alignment centered on the Latin Vulgate when available
- Where the Vulgate lacked a verse, alignment was constructed from available pairs
- Only verse units with at least one valid cross-language pair were retained
Export
- Final outputs include monolingual and multilingual JSON files
- Missing verses are represented with null values for transparency

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚙️ Alignment Workflow (Biblical Corpus)

🔄 Overview

🧩 Pipeline Stages

FilesExpand file tree

alignment_workflow.md

Latest commit

History

alignment_workflow.md

File metadata and controls

⚙️ Alignment Workflow (Biblical Corpus)

🔄 Overview

🧩 Pipeline Stages