GitHub - PathoGenOmics-Lab/pathotypr: Genome classificator using machine learning and SNP markers.

Lineage classification and marker-driven genotyping — from assemblies or raw reads.

Quick Start · Commands · GUI · Docs · Citation

Paula Ruiz-Rodriguez¹ and Mireia Coscolla¹
_{1. Institute for Integrative Systems Biology, I²SysBio, University of Valencia-CSIC, Valencia, Spain}

What is pathotypr?

pathotypr is a Rust toolkit that classifies microbial genomes into lineages and genotypes them against user-defined marker panels. It works with both assembled genomes (FASTA) and raw sequencing reads (FASTQ), runs on a single laptop, and ships with a native desktop GUI.

pathotypr workflow schema

Five commands, one binary:

Command	What it does	Input
`train`	Build a Random Forest classifier from labeled genomes	FASTA
`predict`	Assign lineages using a trained model	FASTA + model
`classify`	Call known SNP markers in assemblies	FASTA + markers
`split-fastq`	Alignment-free genotyping from reads	FASTQ + markers
`match`	Find the closest reference genome	FASTQ + references

Key features:

🦠 Organism-agnostic — bring your own markers for any pathogen
⚡ Fast — Rust + SIMD gzip + parallel k-mers (~1–2 s per sample)
🖥️ Desktop GUI — native app via Tauri, no server required
📊 Excel + TSV output with interactive visualizations in the GUI

Installation

Desktop GUI (pre-built)

Download the latest release for your platform:

Platform	Download	Notes
🍎 macOS (Apple Silicon)	Pathotypr_1.0.0_aarch64.dmg	M1 / M2 / M3 / M4 Macs
🍎 macOS (Intel)	Pathotypr_1.0.0_x64.dmg	Pre-2020 Macs
🐧 Linux (.deb)	Pathotypr_1.0.0_amd64.deb	Debian / Ubuntu
🐧 Linux (.rpm)	Pathotypr-1.0.0-1.x86_64.rpm	Fedora / RHEL
🐧 Linux (AppImage)	Pathotypr_1.0.0_amd64.AppImage	Any distro, no install needed
🪟 Windows (installer)	Pathotypr_1.0.0_x64-setup.exe	Windows 10+
🪟 Windows (.msi)	Pathotypr_1.0.0_x64_en-US.msi	Windows 10+ (MSI)

Note

macOS users: The app is not signed with an Apple Developer certificate. On first launch, right-click the app → Open → click Open in the dialog. See Apple support for details.

Windows users: Windows SmartScreen may show a warning for unrecognized apps. Click More info → Run anyway to proceed.

All releases: Releases page

CLI (Bioconda)

conda create -n pathotypr -c bioconda pathotypr
conda activate pathotypr
pathotypr --help

CLI (from source)

git clone https://github.com/PathoGenOmics-Lab/pathotypr.git
cd pathotypr
cargo build --release -p pathotypr-core --bin pathotypr
./target/release/pathotypr --help

GUI (from source)

See docs/gui.md for building the Tauri desktop app from source.

MTBC Marker Files & Pre-trained Model

Ready-to-use marker panels and a pre-trained Random Forest model for Mycobacterium tuberculosis complex (MTBC) are available on Zenodo:

File	Description	Download
`pathotypr_lineage_markers_v1.0.0.tsv`	3,707 lineage SNPs (L1–L10, A1–A4)	⬇ Download
`pathotypr_dr_markers_v1.0.0.tsv`	102,213 DR mutations (WHO catalogue 2021)	⬇ Download
`pathotypr_rf_model_v1.0.0.pathotypr`	Pre-trained RF model (k=31, 100 trees)	⬇ Download

DOI: 10.5281/zenodo.19210044

Quick Start

# Train a lineage model
pathotypr train -i labeled_genomes.fasta -o model.pathotypr.zst

# Predict lineages
pathotypr predict -i query.fasta -m model.pathotypr.zst -o predictions.tsv

# Classify markers in assemblies
pathotypr classify -m markers.tsv -r reference.fasta -i sample.fasta -o results

# Genotype from FASTQ reads
pathotypr split-fastq -m markers.tsv -r reference.fasta \
  -i reads_R1.fastq.gz -i reads_R2.fastq.gz --paired -o genotype

# Find best reference match
pathotypr match -i reads_R1.fastq.gz reads_R2.fastq.gz \
  -r references.fasta -o match.tsv

Add --excel to any command to also generate .xlsx files.

Commands

Each command has its own detailed documentation:

Command	Docs	Summary
`train`	docs/train.md	Random Forest on k-mer feature-hashed vectors
`predict`	docs/predict.md	Streaming batch prediction with confidence scores
`classify`	docs/classify.md	Marker k-mer matching + GFF annotation + masked FASTA
`split-fastq`	docs/split-fastq.md	Alignment-free genotyping with Bloom filter acceleration
`match`	docs/match.md	K-mer containment scoring against reference databases

Run pathotypr <command> --help for all options.

Algorithm Details

For in-depth descriptions of the algorithms, data structures, and design decisions behind each module, see docs/algorithms/:

Document	Topic
Feature Hashing	The hashing trick: k-mers → fixed-size sparse vectors
Random Forest	Sparse CART trees with bootstrap aggregation
Training Pipeline	Vectorize → evaluate → train → OOB → export
Prediction	Streaming batch prediction with majority voting
Marker Genotyping	Diagnostic k-mers + Bloom filter for FASTQ scanning
Reference Matching	K-mer containment scoring with streaming batches
Assembly Classification	Marker calling on FASTA with GFF annotation

Input Formats

Training FASTA

The first token in each header is the class label:

>L4 sample_0001
ACTG...
>L2 sample_0002
ACTG...

Marker TSV

Tab-separated: position REF ALT level1 [level2 ...]

#pos    ref    alt    level1    level2
761155  C      T      L4        L4.9
2155168 G      A      L2        L2.2

Lineage columns are read until the first empty cell. Columns after the empty cell are treated as annotations.

See docs/input-formats.md for full format specifications.

GUI

The desktop app includes all five workflows with drag-and-drop file selection, interactive result tables, and real-time progress indicators.

# Development
cargo tauri dev

# Production build
cargo tauri build

See docs/gui.md for system dependencies and build instructions.

Performance

pathotypr performance benchmarks

Benchmarked on real M. tuberculosis genomes (~4.4 Mb, k=21), Mac mini M4, 4 threads:

Module	Time	Peak RAM	Key property
train (10 genomes)	0.6 s	302 MB	Scales with dataset size
train (50 genomes)	55 s	1.4 GB
predict (5 genomes)	0.25 s	198 MB	~50 ms/genome, constant
classify (5 genomes)	0.10 s	92 MB	~20 ms/genome
split-fastq (65× PE)	10.5 s	26 MB	Constant memory
match (20 refs)	78 s	4.6 GB	Streaming batches

SIMD-accelerated gzip decompression (zlib-ng)
Streaming I/O — split-fastq holds 26 MB regardless of input size
85,000+ genomes/second prediction throughput (synthetic benchmarks)

See docs/benchmarks.md for detailed charts, scaling plots, and pathotypr vs fastlin comparison.

Comparison

	pathotypr	fastlin	TB-Profiler	Mykrobe	SNP-IT	KvarQ
Alignment-free (FASTQ)	✅	✅	❌	✅	❌	✅
Assemblies (FASTA)	✅	✅	❌	❌	VCF only	❌
Custom markers	✅	❌	❌	❌	❌	❌
ML training	✅	❌	❌	❌	✅	❌
DR prediction	✅	❌	✅	✅	❌	✅
Desktop GUI	✅	❌	Web	✅	❌	✅
Standalone binary	✅	✅	❌	❌	❌	❌
Organism-agnostic	✅	TB only	TB only	Limited	TB only	TB only
Speed (per sample)	~1 s	<5 s	3–10 min	~3 min	1–2 min	~2 min

Project Structure

pathotypr/
├── pathotypr-core/           # Core library + CLI
│   └── src/
│       ├── main.rs           # CLI entry point
│       ├── train.rs          # Random Forest training + OOB + CV
│       ├── predict.rs        # Streaming batch prediction
│       ├── classify/         # Assembly-based marker classification
│       │   ├── mod.rs        #   Orchestration + genome analysis
│       │   ├── markers.rs    #   Marker parsing + k-mer generation
│       │   ├── annotation.rs #   GFF parsing + AA translation
│       │   └── masking.rs    #   FASTA masking at marker sites
│       ├── classify_split_fastq.rs  # FASTQ genotyping orchestration
│       ├── split_kmer.rs     # Diagnostic k-mer engine + Bloom filter
│       ├── match/            # Reference matching
│       │   ├── mod.rs        #   Scoring + coarse-to-fine matching
│       │   └── index.rs      #   Compact inverted index + cache
│       ├── sparse_tree.rs    # Custom CART on sparse vectors
│       ├── vectorizer.rs     # Feature hashing (hashing trick)
│       ├── model.rs          # Model bundle + label encoder
│       ├── lineage.rs        # Hierarchical lineage classification
│       ├── fasta_io.rs       # FASTA reading (needletail)
│       ├── paired_end.rs     # Paired-end FASTQ detection
│       ├── excel.rs          # Streaming Excel export
│       ├── errors.rs         # Error types + cancellation
│       └── common.rs         # Thread pool + shared utilities
├── src-tauri/                # Desktop app backend (Tauri)
├── frontend/                 # GUI (HTML/CSS/JS)
├── docs/                     # Detailed documentation
└── logo/                     # Branding assets

Citation

If you use pathotypr, please cite:

Ruiz-Rodriguez P, Coscollá M. Pathotypr: harmonised MTBC lineage assignment and resistance-associated variant detection for genomic surveillance. bioRxiv (2026). doi: 10.64898/2026.03.24.714002

@article{ruiz-rodriguez_pathotypr_2026,
  title     = {Pathotypr: harmonised {MTBC} lineage assignment and resistance-associated variant detection for genomic surveillance},
  author    = {Ruiz-Rodriguez, Paula and Coscoll{\'a}, Mireia},
  journal   = {bioRxiv},
  year      = {2026},
  doi       = {10.64898/2026.03.24.714002},
  url       = {https://www.biorxiv.org/content/10.64898/2026.03.24.714002v1}
}

Software & markers DOI: 10.5281/zenodo.19210044

License

GNU Affero General Public License v3.0

✨ Contributors

pathotypr is developed with ❤️ by:

_{Paula Ruiz-Rodriguez}
💻 🔬 🤔 🔣 🎨 🔧

_{Mireia Coscolla}
🔍 🤔 🧑‍🏫 🔬 📓

This project follows the all-contributors specification (emoji key).

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
.github		.github
benchmarks		benchmarks
docs		docs
frontend		frontend
logo		logo
pathotypr-core		pathotypr-core
src-tauri		src-tauri
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is pathotypr?

Installation

Desktop GUI (pre-built)

CLI (Bioconda)

CLI (from source)

GUI (from source)

MTBC Marker Files & Pre-trained Model

Quick Start

Commands

Algorithm Details

Input Formats

Training FASTA

Marker TSV

GUI

Performance

Comparison

Project Structure

Citation

License

GNU Affero General Public License v3.0

✨ Contributors

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

What is pathotypr?

Installation

Desktop GUI (pre-built)

CLI (Bioconda)

CLI (from source)

GUI (from source)

MTBC Marker Files & Pre-trained Model

Quick Start

Commands

Algorithm Details

Input Formats

Training FASTA

Marker TSV

GUI

Performance

Comparison

Project Structure

Citation

License

GNU Affero General Public License v3.0

✨ Contributors

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 1

Languages

Packages