Lineage classification and marker-driven genotyping — from assemblies or raw reads.
Quick Start · Commands · GUI · Docs · Citation
Paula Ruiz-Rodriguez1
and Mireia Coscolla1
1. Institute for Integrative Systems Biology, I2SysBio, University of Valencia-CSIC, Valencia, Spain
pathotypr is a Rust toolkit that classifies microbial genomes into lineages and genotypes them against user-defined marker panels. It works with both assembled genomes (FASTA) and raw sequencing reads (FASTQ), runs on a single laptop, and ships with a native desktop GUI.
Five commands, one binary:
| Command | What it does | Input |
|---|---|---|
train |
Build a Random Forest classifier from labeled genomes | FASTA |
predict |
Assign lineages using a trained model | FASTA + model |
classify |
Call known SNP markers in assemblies | FASTA + markers |
split-fastq |
Alignment-free genotyping from reads | FASTQ + markers |
match |
Find the closest reference genome | FASTQ + references |
Key features:
- 🦠 Organism-agnostic — bring your own markers for any pathogen
- ⚡ Fast — Rust + SIMD gzip + parallel k-mers (~1–2 s per sample)
- 🖥️ Desktop GUI — native app via Tauri, no server required
- 📊 Excel + TSV output with interactive visualizations in the GUI
Download the latest release for your platform:
| Platform | Download | Notes |
|---|---|---|
| 🍎 macOS (Apple Silicon) | Pathotypr_1.0.0_aarch64.dmg | M1 / M2 / M3 / M4 Macs |
| 🍎 macOS (Intel) | Pathotypr_1.0.0_x64.dmg | Pre-2020 Macs |
| 🐧 Linux (.deb) | Pathotypr_1.0.0_amd64.deb | Debian / Ubuntu |
| 🐧 Linux (.rpm) | Pathotypr-1.0.0-1.x86_64.rpm | Fedora / RHEL |
| 🐧 Linux (AppImage) | Pathotypr_1.0.0_amd64.AppImage | Any distro, no install needed |
| 🪟 Windows (installer) | Pathotypr_1.0.0_x64-setup.exe | Windows 10+ |
| 🪟 Windows (.msi) | Pathotypr_1.0.0_x64_en-US.msi | Windows 10+ (MSI) |
Note
macOS users: The app is not signed with an Apple Developer certificate. On first launch, right-click the app → Open → click Open in the dialog. See Apple support for details.
Windows users: Windows SmartScreen may show a warning for unrecognized apps. Click More info → Run anyway to proceed.
All releases: Releases page
conda create -n pathotypr -c bioconda pathotypr
conda activate pathotypr
pathotypr --helpgit clone https://github.com/PathoGenOmics-Lab/pathotypr.git
cd pathotypr
cargo build --release -p pathotypr-core --bin pathotypr
./target/release/pathotypr --helpSee docs/gui.md for building the Tauri desktop app from source.
Ready-to-use marker panels and a pre-trained Random Forest model for Mycobacterium tuberculosis complex (MTBC) are available on Zenodo:
| File | Description | Download |
|---|---|---|
pathotypr_lineage_markers_v1.0.0.tsv |
3,707 lineage SNPs (L1–L10, A1–A4) | ⬇ Download |
pathotypr_dr_markers_v1.0.0.tsv |
102,213 DR mutations (WHO catalogue 2021) | ⬇ Download |
pathotypr_rf_model_v1.0.0.pathotypr |
Pre-trained RF model (k=31, 100 trees) | ⬇ Download |
# Train a lineage model
pathotypr train -i labeled_genomes.fasta -o model.pathotypr.zst
# Predict lineages
pathotypr predict -i query.fasta -m model.pathotypr.zst -o predictions.tsv
# Classify markers in assemblies
pathotypr classify -m markers.tsv -r reference.fasta -i sample.fasta -o results
# Genotype from FASTQ reads
pathotypr split-fastq -m markers.tsv -r reference.fasta \
-i reads_R1.fastq.gz -i reads_R2.fastq.gz --paired -o genotype
# Find best reference match
pathotypr match -i reads_R1.fastq.gz reads_R2.fastq.gz \
-r references.fasta -o match.tsvAdd --excel to any command to also generate .xlsx files.
Each command has its own detailed documentation:
| Command | Docs | Summary |
|---|---|---|
train |
docs/train.md | Random Forest on k-mer feature-hashed vectors |
predict |
docs/predict.md | Streaming batch prediction with confidence scores |
classify |
docs/classify.md | Marker k-mer matching + GFF annotation + masked FASTA |
split-fastq |
docs/split-fastq.md | Alignment-free genotyping with Bloom filter acceleration |
match |
docs/match.md | K-mer containment scoring against reference databases |
Run pathotypr <command> --help for all options.
For in-depth descriptions of the algorithms, data structures, and design decisions behind each module, see docs/algorithms/:
| Document | Topic |
|---|---|
| Feature Hashing | The hashing trick: k-mers → fixed-size sparse vectors |
| Random Forest | Sparse CART trees with bootstrap aggregation |
| Training Pipeline | Vectorize → evaluate → train → OOB → export |
| Prediction | Streaming batch prediction with majority voting |
| Marker Genotyping | Diagnostic k-mers + Bloom filter for FASTQ scanning |
| Reference Matching | K-mer containment scoring with streaming batches |
| Assembly Classification | Marker calling on FASTA with GFF annotation |
The first token in each header is the class label:
>L4 sample_0001
ACTG...
>L2 sample_0002
ACTG...
Tab-separated: position REF ALT level1 [level2 ...]
#pos ref alt level1 level2
761155 C T L4 L4.9
2155168 G A L2 L2.2
Lineage columns are read until the first empty cell. Columns after the empty cell are treated as annotations.
See docs/input-formats.md for full format specifications.
The desktop app includes all five workflows with drag-and-drop file selection, interactive result tables, and real-time progress indicators.
# Development
cargo tauri dev
# Production build
cargo tauri buildSee docs/gui.md for system dependencies and build instructions.
Benchmarked on real M. tuberculosis genomes (~4.4 Mb, k=21), Mac mini M4, 4 threads:
| Module | Time | Peak RAM | Key property |
|---|---|---|---|
| train (10 genomes) | 0.6 s | 302 MB | Scales with dataset size |
| train (50 genomes) | 55 s | 1.4 GB | |
| predict (5 genomes) | 0.25 s | 198 MB | ~50 ms/genome, constant |
| classify (5 genomes) | 0.10 s | 92 MB | ~20 ms/genome |
| split-fastq (65× PE) | 10.5 s | 26 MB | Constant memory |
| match (20 refs) | 78 s | 4.6 GB | Streaming batches |
- SIMD-accelerated gzip decompression (zlib-ng)
- Streaming I/O — split-fastq holds 26 MB regardless of input size
- 85,000+ genomes/second prediction throughput (synthetic benchmarks)
See docs/benchmarks.md for detailed charts, scaling plots, and pathotypr vs fastlin comparison.
| pathotypr | fastlin | TB-Profiler | Mykrobe | SNP-IT | KvarQ | |
|---|---|---|---|---|---|---|
| Alignment-free (FASTQ) | ✅ | ✅ | ❌ | ✅ | ❌ | ✅ |
| Assemblies (FASTA) | ✅ | ✅ | ❌ | ❌ | VCF only | ❌ |
| Custom markers | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| ML training | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ |
| DR prediction | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ |
| Desktop GUI | ✅ | ❌ | Web | ✅ | ❌ | ✅ |
| Standalone binary | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| Organism-agnostic | ✅ | TB only | TB only | Limited | TB only | TB only |
| Speed (per sample) | ~1 s | <5 s | 3–10 min | ~3 min | 1–2 min | ~2 min |
pathotypr/
├── pathotypr-core/ # Core library + CLI
│ └── src/
│ ├── main.rs # CLI entry point
│ ├── train.rs # Random Forest training + OOB + CV
│ ├── predict.rs # Streaming batch prediction
│ ├── classify/ # Assembly-based marker classification
│ │ ├── mod.rs # Orchestration + genome analysis
│ │ ├── markers.rs # Marker parsing + k-mer generation
│ │ ├── annotation.rs # GFF parsing + AA translation
│ │ └── masking.rs # FASTA masking at marker sites
│ ├── classify_split_fastq.rs # FASTQ genotyping orchestration
│ ├── split_kmer.rs # Diagnostic k-mer engine + Bloom filter
│ ├── match/ # Reference matching
│ │ ├── mod.rs # Scoring + coarse-to-fine matching
│ │ └── index.rs # Compact inverted index + cache
│ ├── sparse_tree.rs # Custom CART on sparse vectors
│ ├── vectorizer.rs # Feature hashing (hashing trick)
│ ├── model.rs # Model bundle + label encoder
│ ├── lineage.rs # Hierarchical lineage classification
│ ├── fasta_io.rs # FASTA reading (needletail)
│ ├── paired_end.rs # Paired-end FASTQ detection
│ ├── excel.rs # Streaming Excel export
│ ├── errors.rs # Error types + cancellation
│ └── common.rs # Thread pool + shared utilities
├── src-tauri/ # Desktop app backend (Tauri)
├── frontend/ # GUI (HTML/CSS/JS)
├── docs/ # Detailed documentation
└── logo/ # Branding assets
If you use pathotypr, please cite:
Ruiz-Rodriguez P, Coscollá M. Pathotypr: harmonised MTBC lineage assignment and resistance-associated variant detection for genomic surveillance. bioRxiv (2026). doi: 10.64898/2026.03.24.714002
@article{ruiz-rodriguez_pathotypr_2026,
title = {Pathotypr: harmonised {MTBC} lineage assignment and resistance-associated variant detection for genomic surveillance},
author = {Ruiz-Rodriguez, Paula and Coscoll{\'a}, Mireia},
journal = {bioRxiv},
year = {2026},
doi = {10.64898/2026.03.24.714002},
url = {https://www.biorxiv.org/content/10.64898/2026.03.24.714002v1}
}Software & markers DOI: 10.5281/zenodo.19210044
|
Paula Ruiz-Rodriguez 💻 🔬 🤔 🔣 🎨 🔧 |
Mireia Coscolla 🔍 🤔 🧑🏫 🔬 📓 |
This project follows the all-contributors specification (emoji key).
