Back to the Present: Ancient Speech in Our Days with RNNs & Transformers

This project explores the use of RNNs and Transformer-based models to generate text in ancient or classical languages, particularly focusing on texts such as El Quijote (in Spanish) and Tirant lo Blanc (in ancient Valencian). The ultimate goal is to bring historical speech styles back to life using modern neural architectures.

📚 Description

We compare three architectures:

Bigram baseline model
RNN implemented from scratch
LSTM implemented from scratch
Transformer implemented from scratch

Each model is trained on ancient literary texts to replicate their structure and style. We report training loss curves, text samples, and evaluate each model using perplexity on generated outputs.

📁 Project Structure

dlkth/: Source code including models, tokenizer, training pipeline.
data/: Text corpora (e.g., el_quijote.txt, valenciano.txt)
checkpoints/: Saved weights and training metadata.
reports/: Evaluation metrics (JSON) and figures (PDF).
modal_train.py: Modal-compatible training launcher.
modal_eval.py: Evaluation script to generate text + compute perplexities.

🧪 Training

To train models (on Modal):

make train

📊 Evaluation

Generate 100 text samples and compute mean + std perplexity:

make eval

The script:

Loads each .pt checkpoint in /vol/checkpoints
Matches it with its .json metadata
Reconstructs the tokenizer from the original dataset
Generates samples and computes perplexity
Saves summary report to /vol/reports/perplexity_summary.json

Example result:

[
  {
    "model": "transformer",
    "dataset": "valenciano",
    "mean_perplexity": 2.57,
    "std_perplexity": 0.47,
    "samples": [
      {"text": "En lo temps que lo rey anava...", "perplexity": 2.43}
    ]
  }
]

📈 Loss Curves

Loss curves are plotted for each model/dataset combination. The figure is saved to PDF using LaTeX formatting (NeurIPS-style):

notebook plot.ipynb

Outputs:

reports/loss_plot.pdf

💾 Setup

pip install -e .
make download

Requirements

torch
numpy
transformers
matplotlib
modal
tqdm
pandas

🗃 Volumes

We use two Modal Volumes:

checkpoints: stores model weights and metadata
reports: stores evaluation outputs (JSON, PDFs)

To sync locally:

make download

👥 Authors

Martín Bravo, Álvaro Mazcuñán Herreros, Adriana Rodríguez Vallejo KTH Royal Institute of Technology
Stockholm, Sweden

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
Documents		Documents
checkpoints		checkpoints
data		data
dlkth		dlkth
notebooks		notebooks
reports		reports
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
modal_eval.py		modal_eval.py
modal_rnn_grid_search.py		modal_rnn_grid_search.py
modal_rnn_hidden_size_experiment.py		modal_rnn_hidden_size_experiment.py
modal_rnn_vs_lstm.py		modal_rnn_vs_lstm.py
modal_train.py		modal_train.py
plot.ipynb		plot.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Back to the Present: Ancient Speech in Our Days with RNNs & Transformers

📚 Description

📁 Project Structure

🧪 Training

📊 Evaluation

📈 Loss Curves

💾 Setup

Requirements

🗃 Volumes

👥 Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

MartinEBravo/deep-learning-project

Folders and files

Latest commit

History

Repository files navigation

Back to the Present: Ancient Speech in Our Days with RNNs & Transformers

📚 Description

📁 Project Structure

🧪 Training

📊 Evaluation

📈 Loss Curves

💾 Setup

Requirements

🗃 Volumes

👥 Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages