Fast, memory-efficient extraction of variable sites from FASTA alignments.
Quick Start · Features · Usage · Benchmarks · Citation
Paula Ruiz-Rodriguez1
and Mireia Coscolla1
1. Institute for Integrative Systems Biology, I2SysBio, University of Valencia-CSIC, Valencia, Spain
SNPick extracts variable (SNP) sites from whole-genome FASTA alignments. It produces reduced alignments ready for phylogenetic inference with ascertainment bias correction (ASC) in IQ-TREE and RAxML, and optionally generates VCF files.
Why not snp-sites? snp-sites works well for small datasets but struggles with large alignments — it loads everything into memory and scales poorly. SNPick uses a zero-copy memory-mapped architecture that handles thousands of genomes in seconds with minimal RAM.
| SNPick | snp-sites | |
|---|---|---|
| Architecture | Zero-copy mmap, parallel scan | Full matrix in memory |
| 250 seqs × 4.4 Mbp | 0.9 s, 105 MB | 9.5 s, 520 MB |
| 1000 seqs × 4.4 Mbp | ~3 s, ~140 MB | >26 min (killed), 3+ GB |
| ASC fconst output | ✅ Built-in | ❌ Not supported |
| VCF output | ✅ Optional | ✅ Default |
| Gap handling | ✅ Optional (-g) |
✅ Default |
| IUPAC ambiguous | ✅ Tracked as ambiguous |
# Install
conda install -c bioconda snpick
# Extract variable sites
snpick -f alignment.fasta -o snps.fasta
# With VCF output
snpick -f alignment.fasta -o snps.fasta --vcf
# Include gaps as informative
snpick -f alignment.fasta -o snps.fasta -gIdentifies positions with more than one observed nucleotide across all sequences. Constant and ambiguous-only positions are excluded from the output.
Reports constant site counts (fconst) directly, formatted for IQ-TREE's +ASC models:
[snpick] ASC fconst: 744123,1382922,1382180,743556
Use in IQ-TREE:
iqtree2 -s snps.fasta -m GTR+ASC -fconst 744123,1382922,1382180,743556Optional VCF v4.2 output with per-sample genotypes. Reference allele taken from the first sequence. Ambiguous bases reported as missing (.).
- Ambiguous bases (N, R, Y, etc.): not counted as alleles — positions are only variable if they have ≥2 standard bases (A, C, G, T)
- Gaps (
-): ignored by default, included as a 5th character with-g
Automatic multi-threaded scanning via Rayon when the dataset is large enough. Falls back to single-threaded for small inputs to avoid overhead.
conda install -c bioconda snpick
# or
mamba install -c bioconda snpickgit clone https://github.com/PathoGenOmics-Lab/snpick.git
cd snpick
cargo build --release
# Binary at target/release/snpickwget https://github.com/PathoGenOmics-Lab/snpick/releases/latest/download/snpick
chmod +x snpicksnpick [OPTIONS] --fasta <FASTA> --output <OUTPUT>
| Argument | Required | Description |
|---|---|---|
-f, --fasta <FILE> |
✅ | Input FASTA alignment |
-o, --output <FILE> |
✅ | Output FASTA (variable sites only) |
-g, --include-gaps |
Treat gaps (-) as a 5th character |
|
--vcf |
Generate VCF file (derived from output name) | |
--vcf-output <FILE> |
Custom VCF output path |
Input (alignment.fasta):
>sequence1
ATGCTAGCTAGCTAGCTA
>sequence2
ATGCTAGCTGGCTAGCTA
>sequence3
ATGCTAGCTAGCTAGCTA
Command:
snpick -f alignment.fasta -o snps.fastaOutput (snps.fasta):
>sequence1
A
>sequence2
G
>sequence3
A
stderr:
[snpick] Mapped 63 bytes. 3 sequences × 18 positions.
[snpick] 1 variable, 17 constant (A:4 C:4 G:4 T:5), 0 ambiguous-only, 18 total.
[snpick] ASC fconst: 4,4,4,5
[snpick] Done in 0.00s. 1 vars from 3 seqs × 18 pos.
Simulated M. tuberculosis-like genomes (4.4 Mbp, ~65% GC, 3.6% variable sites).
SNPick maintains O(L) memory regardless of sequence count, while snp-sites requires O(N×L).
Input FASTA ──mmap──▶ Index records ──▶ Pass 1: bitmask scan ──▶ Analyze
│ (parallel) │
│ ▼
└──────────▶ Pass 2: extract sites ──▶ FASTA + VCF
(sparse random access)
- Single memory-mapped file shared across both passes — zero copies
- Pass 1: OR-based bitmask over all sequences (parallel with Rayon)
- Pass 2: only reads variable positions (sparse access via mmap)
- Lookup tables: 256-byte arrays for O(1) nucleotide classification and case conversion
If you use SNPick in your research, please cite:
@software{snpick,
author = {Ruiz-Rodriguez, Paula and Coscolla, Mireia},
title = {SNPick: Fast extraction of variable sites from FASTA alignments},
url = {https://github.com/PathoGenOmics-Lab/snpick},
doi = {10.5281/zenodo.14191809},
license = {GPL-3.0}
}|
Paula Ruiz-Rodriguez 💻 🔬 🤔 🔣 🎨 🔧 |
Mireia Coscolla 🔍 🤔 🧑🏫 🔬 📓 |
This project follows the all-contributors specification (emoji key).


