Skip to content

Replace per-sample FreeBayes + bcftools merge with joint multi-sample FreeBayes#7

Draft
jbrestel wants to merge 21 commits into
merge-experiments-refactorfrom
multi-sample-freebayes
Draft

Replace per-sample FreeBayes + bcftools merge with joint multi-sample FreeBayes#7
jbrestel wants to merge 21 commits into
merge-experiments-refactorfrom
multi-sample-freebayes

Conversation

@jbrestel
Copy link
Copy Markdown
Member

Summary

  • Replaces the per-sample FreeBayes → bcftools merge pattern with a joint multi-sample FreeBayes call across all samples, eliminating post-hoc merge artifacts
  • Genome is split into fixed-size chunks (chunkSize = 1000000) via makeRegionBed; FreeBayes runs per-chunk across all sample BAMs in parallel
  • Zero-coverage splits are derived from the union of all sample BAMs per region, ensuring no sample has a reference block that another cannot support
  • The joint multi-sample gVCF is the single source of truth for consensus FASTAs, indels TSV, and per-sample CNV VCFs

New processes (modules/snp.nf)

  • makeRegionBed — splits reference into fixed-size BED windows
  • makeMultiSampleZeroCoverageBed — computes union zero-coverage BED across all sample BAMs for a region
  • concatMultiSampleVcf — gathers per-region split gVCF chunks into a single joint multi-sample VCF; emits both gVCF and variants-only VCF
  • extractSampleVcf — extracts per-sample VCF (+ SNPs + indels subsets) from the joint VCF for CNV downstream steps

Modified processes

  • freebayesMultiSample — now accepts a regionLine input and --region flag; emits regionKey for scatter-gather joining
  • splitGvcfAtZeroCoverage — removed per-sample BAM input; now takes a pre-computed zero_cov.bed from makeMultiSampleZeroCoverageBed
  • makeConsensusFromGvcf — removed sampleName input; Python script iterates all samples in the joint gVCF, outputs one {sampleName}_consensus.fa.gz per sample
  • makeIndelTSV — now takes joint multi-sample gVCF; loops over all samples via bcftools query -l
  • makeSnpDensity (cnv.nf) — input tuple slimmed from 9 to 7 elements (no gVCF paths)

Removed processes

freebayes, mergeVcfs, makeMergedVariantIndex, mergeGvcfs, bcftoolsMpileupGvcf

Python script changes (bin/)

  • splitGvcfAtZeroCoverage.py — replaced --bedgraph + DP recomputation with --zero-cov-bed (pre-computed union BED)
  • makeConsensusFastaFromGvcf.py — updated for multi-sample: build_consensus takes sample_idx, main() loops over all samples with VCF reopen per sample, writes to --output-dir

Test coverage

  • testing/t/test_splitGvcfAtZeroCoverage.py — 5 pytest unit tests
  • testing/t/test_makeConsensusFastaFromGvcf.py — 5 pytest unit tests
  • Stub run validated: all 8 new processes appear in DAG, no dead processes, no DSL errors

Test plan

  • Run python -m pytest testing/t/test_splitGvcfAtZeroCoverage.py testing/t/test_makeConsensusFastaFromGvcf.py -v
  • Run stub: nextflow run main.nf -entry processSingleExperiment -profile processSingleExperiment -stub
  • Verify all 8 expected processes appear; none of the 5 removed processes appear
  • Run against real diploid test data and confirm joint gVCF, per-sample consensus FASTAs, and indels TSV are produced correctly

🤖 Generated with Claude Code

jbrestel and others added 19 commits April 30, 2026 15:34
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ge; support multi-sample gVCF

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… gVCF

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds extractSampleVcf to modules/snp.nf to extract per-sample VCFs
(all variants, SNPs, indels) from a multi-sample VCF. Drops the gVCF
tuple fields from makeSnpDensity input in modules/cnv.nf and renames
the VCF param to vcfGz for consistency.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Bayes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbrestel jbrestel changed the base branch from main to merge-experiments-refactor April 30, 2026 23:24
jbrestel and others added 2 commits April 30, 2026 19:38
…is container

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ndaries

When splitGvcfAtZeroCoverage splits a <*> reference block, the original FORMAT
fields were copied verbatim into every sub-interval. Samples with zero coverage
in a sub-interval therefore showed the parent block's DP instead of 0, causing
makeConsensusFastaFromGvcf to emit reference bases instead of Ns.

- makeMultiSampleZeroCoverageBed now emits per-sample zero-coverage BEDs
  (<samplename>.persample.bed) in addition to union_zero.bed and all_zero.bed
- splitGvcfAtZeroCoverage receives the per-sample BEDs and passes them to the
  Python script; for each emitted sub-interval, DP and MIN_DP are set to 0 for
  any sample whose BED fully covers that sub-interval
- makeConsensusFastaFromGvcf now treats negative cyvcf2 DP sentinels (the
  in-memory representation of VCF '.') as zero, matching explicit DP=0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbrestel jbrestel marked this pull request as draft May 1, 2026 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant