The predict command classifies new genome sequences using a trained model. It uses streaming batch processing to handle arbitrarily large inputs with constant peak memory.
Input FASTA ──→ Stream in batches of 512 ──→ Vectorize (parallel) ──→ Predict (parallel) ──→ Write TSV
│ │
└── needletail parser flush & free
(handles .gz transparently) batch memory
.pathotypr.zst file ──→ zstd::decode_all ──→ bincode::deserialize ──→ ModelBundle
Validation checks on load:
- Trees array is non-empty
- Label encoder has at least one class
- K-mer size is within 1–31
- Format version matches current (v3)
- Version mismatch produces a warning (not an error)
Instead of loading all sequences into memory:
- Open FASTA with
needletail::parse_fastx_file()(handles plain and gzipped) - Read up to 512 records into a batch
- Process the batch (vectorize + predict)
- Write results to TSV
- Drop the batch — memory is freed before the next batch
- Repeat until EOF
Why 512? Small enough for constant memory (~50 MB per batch for bacterial genomes), large enough for efficient rayon parallelism.
Each batch of sequences is vectorized in parallel using the same FeatureHasher from training:
let x_sparse = bundle.vectorizer.transform_sparse(&sequences, kmer_size);This produces one sparse vector per sequence, identical to the training representation.
For each sparse vector, all 100 trees vote independently:
Tree 1: predict_one(sparse_row) → class 2
Tree 2: predict_one(sparse_row) → class 2
Tree 3: predict_one(sparse_row) → class 0
...
Tree 100: predict_one(sparse_row) → class 2
Each predict_one() traverses root-to-leaf in O(depth) using binary search on the sparse row.
| Metric | Formula | Meaning |
|---|---|---|
| Predicted class | argmax(votes) | Class with most votes |
| Confidence | winner_votes / 100 | Proportion of trees agreeing |
| Margin | (winner - runner_up) / 100 | Separation from second-best class |
| Other votes | Top 3 non-winning classes with vote % | Alternative classifications |
Header Predicted_Lineage Confidence Confidence_Margin Other_Votes
sample_0001 L4 0.9800 0.9200 L4.9:0.03,L2:0.01
sample_0002 L2 1.0000 1.0000
Same columns with conditional formatting:
- Confidence: green ≥ 0.9, yellow 0.5–0.9, red < 0.5
- Margin: green ≥ 0.5, yellow 0.1–0.5, red < 0.1
- Throughput: ~85,000 genomes/second at 4000 genomes (synthetic)
- Prediction time per sample: ~0.03 ms (constant regardless of dataset size)
- Memory: Constant peak — batch memory is freed between iterations
- I/O: SIMD-accelerated gzip decompression via zlib-ng
- Empty sequences are skipped with a warning
- Non-UTF8 headers/sequences produce parsing errors
- Cancellation is checked between batches (GUI integration)
- On error, partial output files are cleaned up