Under construction — not used in production.
Nextflow DSL2 pipeline for DNA sequencing analysis. Processes per-strain FASTQ files through alignment, variant calling, CNV, and coverage analysis (processSingleExperiment), then merges results across strains for downstream database loading (mergeExperiments).
# Per-strain analysis (default)
nextflow run main.nf -profile processSingleExperiment
# Multi-strain merge
nextflow run main.nf -entry mergeExperiments -profile mergeExperiments
# Tests
nextflow run main.nf -entry runTests -profile testsDocker is enabled by default in all profiles.
Runs per-strain: takes paired or single-end FASTQs from an nf-core samplesheet and produces a consensus FASTA, VCFs, indel TSV, coverage tracks, ploidy, and gene CNV estimates.
┌──────────────────────────────────────────────────────────┐
│ QC & Preprocessing │
│ │
│ FASTQ ──► FastQC ──► FastQC check ──► Trimmomatic │
└────────────────────────────┬─────────────────────────────┘
│
┌────────────────────────────▼─────────────────────────────┐
│ Alignment │
│ │
│ BWA-MEM ──► Picard (dedup) ──► GATK (indel realign) │
└──────────────────┬──────────────────────┬────────────────┘
│ │
┌──────────────────▼──────┐ ┌────────────▼───────────────┐
│ Variant Calling │ │ CNV / Coverage │
│ │ │ │
│ FreeBayes │ │ genomecov │
│ filterAndSplitVcf │ │ └──► coverage bigwig │
│ ├── SNPs VCF │ │ │
│ ├── Indels VCF │ │ htseqCount ──► TPM │
│ └── Consensus VCF │ │ └──► ploidy & gene CNVs │
│ makeIndelTSV │ │ │
│ makeCoverageBed │ │ bedtoolsWindowed │
│ consensus FASTA │ │ └──► norm. cov. bigwig │
│ (coverage-masked) │ │ │
└──────────────┬──────────┘ └─────────────────────────────┘
│
┌──────────────▼──────────────────────────────────────────┐
│ Density Tracks │
│ │
│ SNP density bigwigs │
│ Het SNP density bigwigs (ploidy > 1 only) │
└──────────────────────────────────────────────────────────┘
Alignment stats (samtools + bedtools genomecov) ──► merged TSV
| File | Description |
|---|---|
*_consensus.fa.gz |
Per-strain consensus FASTA, low-coverage positions masked |
result.vcf.gz |
Full FreeBayes VCF (complex variants) |
indels.tsv |
Indel table (homozygous only; het indels excluded) |
coverage.bed.gz |
Per-position coverage BED |
*_Ploidy.txt |
Estimated ploidy per strain |
*_geneCNVs.txt |
Gene-level copy number estimates |
*.bw |
Coverage, normalised coverage, SNP density, het SNP density bigwigs |
alignment_stats.tsv |
Merged samtools + bedtools coverage stats across all samples |
| Parameter | Description |
|---|---|
samplesheet |
nf-core CSV (sample, fastq_1, fastq_2) |
genomeFastaFile |
Reference genome FASTA |
gtfFile |
Gene annotation GTF |
footprintFile |
Gene footprints file for CNV |
geneSourceIdOrthologFile |
Gene source ID / ortholog mapping TSV |
chrsForCalcFile |
Chromosomes to include in ploidy calculation |
minCoverage |
Minimum depth for variant calling and consensus masking |
ploidy |
Expected ploidy (het SNP tracks skipped when 1) |
winLen |
Window size (bp) for density and windowed coverage tracks |
bwaThreads |
Threads for BWA-MEM |
outputDir |
Output directory |
Takes the per-strain outputs from one or more processSingleExperiment runs and merges them across strains. Annotates variants against SQLite coding-sequence and indel databases, then runs SnpEff for functional annotation. Outputs are intended for loading into a GUS/VEuPathDB database.
- Combine indels — collect per-strain
indels.tsvfiles and build a genomic indel SQLite database (makeGenomicIndelDb) - Merge VCFs — bcftools merge across all per-strain VCFs; single-strain inputs skip the merge step
- Merge coverage — concatenate per-strain coverage BED files into a single TSV (
mergeCoverageBeds) - Build coding data — derive coding-sequence and coding-indel SQLite databases from consensus FASTAs + GTF + reference genome (
makeCodingData) - Annotate variants —
processSeqVarsrunsbin/processSequenceVariations.jlagainst the SQLite DBs; produces annotated VCF and variation/allele/product DAT files - SnpEff — functional annotation of the merged VCF
| Parameter | Description |
|---|---|
relativeConsensusFilePattern |
Glob for per-strain *_consensus.fa.gz files |
vcfFiles |
Glob for per-strain result.vcf.gz files |
indelsFiles |
Glob for per-strain indels.tsv files |
coverageFiles |
Glob for per-strain *.coverage.bed.gz files |
| Parameter | Description |
|---|---|
genomeFastaFile |
Reference genome FASTA |
gtfFile |
Gene annotation GTF |
vcfCacheFile |
VCF cache from a previous run (avoid re-annotating known variants) |
undoneStrains |
Strains to exclude from annotation |
reference_strain |
Reference strain name |
outputDir |
Output directory |
| Image | Used by |
|---|---|
veupathdb/shortreadaligner:1.0.0 |
BWA, samtools, Picard, GATK3, FreeBayes, bcftools, bedtools, Julia 1.10, Perl/BioPerl, SnpEff |
veupathdb/dnaseqanalysis:1.0.0 |
Trimmomatic, htseq-count |
nextflow run main.nf -entry runTests -profile testsTests live in testing/t/ and use Perl's Test2::V0 framework, run via prove.