Labeled FASTA ──→ Feature Hashing ──→ Accuracy Estimation ──→ Train Final Model ──→ OOB ──→ Save
(k-mer → sparse) (CV or single split) (all data, 100 trees) (bincode+zstd)
│
↓
Feature Importance
(reverse map k-mers + coords)
Labeled multi-FASTA where the first token in each header is the class label:
>L4 sample_0001
ACTGACTG...
>L2 sample_0002
ACTGACTG...
Each sequence is transformed into a sparse feature vector via feature hashing:
- Extract all k-mers (default k=21) using needletail's 2-bit encoding
- Hash each k-mer into one of 2^20 buckets via bitmask
- Count occurrences per bucket
- Output: sorted sparse vector
Vec<(bucket_idx, count)>
Class labels (strings) are mapped to integers via LabelEncoder:
fit(): Builds bidirectional mapping (label ↔ integer)transform(): Converts label strings to integer indices- Minimum 2 distinct classes required
Two modes, controlled by --cv-folds:
- Shuffle indices with seed=42
- Hold out
test_splitfraction (default 20%) - Train a temporary 100-tree ensemble on the training portion
- Evaluate on the held-out test set
- Report accuracy percentage
- Group sample indices by class
- Shuffle within each class (seed=42)
- Assign to folds round-robin → preserves class proportions
- For each fold:
- Train 100-tree ensemble on k-1 folds
- Evaluate on the held-out fold
- Report: mean accuracy ± standard deviation (Bessel's correction: divide by n-1)
Key: The accuracy estimation step trains throwaway ensembles. The final model is always trained on all data.
- Uses all samples (no held-out set)
- Trains 100 trees in parallel via rayon (see random-forest.md)
- Returns both trees and their bootstrap seeds (for OOB computation)
Always computed on the final model (see random-forest.md):
- Regenerates each tree's bootstrap from its seed
- For each sample, collects votes only from trees where it was OOB (~37% of trees)
- Reports majority-vote accuracy across all samples
OOB provides a nearly unbiased accuracy estimate at zero additional cost.
ModelBundle {
config: ModelConfig {
pathotypr_version, // e.g., "0.2.0"
kmer_size, // e.g., 21
n_trees, // 100
format_version, // 3 (current)
},
vectorizer: FeatureHasher { num_buckets },
label_encoder: LabelEncoder { maps },
trees: Vec<SparseDecisionTree>,
}
ModelBundle ──→ bincode::serialize_into ──→ zstd::Encoder (level 3) ──→ file
Streaming serialization: bincode writes directly into the zstd compressor, which writes to a buffered file. Never holds the full uncompressed + compressed model in memory simultaneously.
Typical sizes: 5–50 MB compressed for real bacterial genomes; <3 KB for synthetic benchmarks.
Top 500 features ranked by split count:
rank bucket split_count importance_pct kmers
1 42 87 2.31 ACTGACTGCTAGCTGATCGATC,GATCGATCGATCGATCGATCG
2 1087 73 1.94 ...
importance_pct = split_count / total_splits_across_all_features × 100
Maps each discriminant k-mer back to its physical location in the training sequences:
rank bucket split_count importance_pct kmer sequence lineage position
1 42 87 2.31 ACTG... seq_header L4 1234567
This enables researchers to identify which genomic regions drive classification — linking ML features to biology.
| Dataset | Training Time | Model Size |
|---|---|---|
| 100 bacterial genomes | ~10 s | ~10 MB |
| 500 bacterial genomes | ~30 s | ~20 MB |
| 4000 synthetic sequences | ~2 s | ~2.5 KB |
Times measured on Apple M4 (4 cores). Training scales approximately linearly with dataset size due to the tree construction dominating.