Prediction (Streaming Batch)

Overview

The predict command classifies new genome sequences using a trained model. It uses streaming batch processing to handle arbitrarily large inputs with constant peak memory.

Pipeline

Input FASTA ──→ Stream in batches of 512 ──→ Vectorize (parallel) ──→ Predict (parallel) ──→ Write TSV
     │                                                                                         │
     └── needletail parser                                                               flush & free
         (handles .gz transparently)                                                     batch memory

Step 1: Model Loading

.pathotypr.zst file ──→ zstd::decode_all ──→ bincode::deserialize ──→ ModelBundle

Validation checks on load:

Trees array is non-empty
Label encoder has at least one class
K-mer size is within 1–31
Format version matches current (v3)
Version mismatch produces a warning (not an error)

Step 2: Streaming Batches

Instead of loading all sequences into memory:

Open FASTA with needletail::parse_fastx_file() (handles plain and gzipped)
Read up to 512 records into a batch
Process the batch (vectorize + predict)
Write results to TSV
Drop the batch — memory is freed before the next batch
Repeat until EOF

Why 512? Small enough for constant memory (~50 MB per batch for bacterial genomes), large enough for efficient rayon parallelism.

Step 3: Vectorization

Each batch of sequences is vectorized in parallel using the same FeatureHasher from training:

let x_sparse = bundle.vectorizer.transform_sparse(&sequences, kmer_size);

This produces one sparse vector per sequence, identical to the training representation.

Step 4: Ensemble Voting

For each sparse vector, all 100 trees vote independently:

Tree 1: predict_one(sparse_row) → class 2
Tree 2: predict_one(sparse_row) → class 2
Tree 3: predict_one(sparse_row) → class 0
...
Tree 100: predict_one(sparse_row) → class 2

Each predict_one() traverses root-to-leaf in O(depth) using binary search on the sparse row.

Metrics

Metric	Formula	Meaning
Predicted class	argmax(votes)	Class with most votes
Confidence	winner_votes / 100	Proportion of trees agreeing
Margin	(winner - runner_up) / 100	Separation from second-best class
Other votes	Top 3 non-winning classes with vote %	Alternative classifications

Output Format

TSV

Header                Predicted_Lineage  Confidence  Confidence_Margin  Other_Votes
sample_0001           L4                 0.9800      0.9200             L4.9:0.03,L2:0.01
sample_0002           L2                 1.0000      1.0000

Excel (optional, `--excel`)

Same columns with conditional formatting:

Confidence: green ≥ 0.9, yellow 0.5–0.9, red < 0.5
Margin: green ≥ 0.5, yellow 0.1–0.5, red < 0.1

Performance

Throughput: ~85,000 genomes/second at 4000 genomes (synthetic)
Prediction time per sample: ~0.03 ms (constant regardless of dataset size)
Memory: Constant peak — batch memory is freed between iterations
I/O: SIMD-accelerated gzip decompression via zlib-ng

Error Handling

Empty sequences are skipped with a warning
Non-UTF8 headers/sequences produce parsing errors
Cancellation is checked between batches (GUI integration)
On error, partial output files are cleaned up

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prediction (Streaming Batch)

Overview

Pipeline

Step 1: Model Loading

Step 2: Streaming Batches

Step 3: Vectorization

Step 4: Ensemble Voting

Metrics

Output Format

TSV

Excel (optional, `--excel`)

Performance

Error Handling

FilesExpand file tree

prediction.md

Latest commit

History

prediction.md

File metadata and controls

Prediction (Streaming Batch)

Overview

Pipeline

Step 1: Model Loading

Step 2: Streaming Batches

Step 3: Vectorization

Step 4: Ensemble Voting

Metrics

Output Format

TSV

Excel (optional, --excel)

Performance

Error Handling

Excel (optional, `--excel`)