Replace per-sample FreeBayes + bcftools merge with joint multi-sample FreeBayes#7
Draft
jbrestel wants to merge 21 commits into
Draft
Replace per-sample FreeBayes + bcftools merge with joint multi-sample FreeBayes#7jbrestel wants to merge 21 commits into
jbrestel wants to merge 21 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ge; support multi-sample gVCF Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…one FASTA per sample
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… gVCF Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds extractSampleVcf to modules/snp.nf to extract per-sample VCFs (all variants, SNPs, indels) from a multi-sample VCF. Drops the gVCF tuple fields from makeSnpDensity input in modules/cnv.nf and renames the VCF param to vcfGz for consistency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Bayes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, mergeGvcfs, bcftoolsMpileupGvcf
…is container Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ndaries When splitGvcfAtZeroCoverage splits a <*> reference block, the original FORMAT fields were copied verbatim into every sub-interval. Samples with zero coverage in a sub-interval therefore showed the parent block's DP instead of 0, causing makeConsensusFastaFromGvcf to emit reference bases instead of Ns. - makeMultiSampleZeroCoverageBed now emits per-sample zero-coverage BEDs (<samplename>.persample.bed) in addition to union_zero.bed and all_zero.bed - splitGvcfAtZeroCoverage receives the per-sample BEDs and passes them to the Python script; for each emitted sub-interval, DP and MIN_DP are set to 0 for any sample whose BED fully covers that sub-interval - makeConsensusFastaFromGvcf now treats negative cyvcf2 DP sentinels (the in-memory representation of VCF '.') as zero, matching explicit DP=0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
bcftools mergepattern with a joint multi-sample FreeBayes call across all samples, eliminating post-hoc merge artifactschunkSize = 1000000) viamakeRegionBed; FreeBayes runs per-chunk across all sample BAMs in parallelNew processes (
modules/snp.nf)makeRegionBed— splits reference into fixed-size BED windowsmakeMultiSampleZeroCoverageBed— computes union zero-coverage BED across all sample BAMs for a regionconcatMultiSampleVcf— gathers per-region split gVCF chunks into a single joint multi-sample VCF; emits both gVCF and variants-only VCFextractSampleVcf— extracts per-sample VCF (+ SNPs + indels subsets) from the joint VCF for CNV downstream stepsModified processes
freebayesMultiSample— now accepts aregionLineinput and--regionflag; emitsregionKeyfor scatter-gather joiningsplitGvcfAtZeroCoverage— removed per-sample BAM input; now takes a pre-computedzero_cov.bedfrommakeMultiSampleZeroCoverageBedmakeConsensusFromGvcf— removedsampleNameinput; Python script iterates all samples in the joint gVCF, outputs one{sampleName}_consensus.fa.gzper samplemakeIndelTSV— now takes joint multi-sample gVCF; loops over all samples viabcftools query -lmakeSnpDensity(cnv.nf) — input tuple slimmed from 9 to 7 elements (no gVCF paths)Removed processes
freebayes,mergeVcfs,makeMergedVariantIndex,mergeGvcfs,bcftoolsMpileupGvcfPython script changes (
bin/)splitGvcfAtZeroCoverage.py— replaced--bedgraph+ DP recomputation with--zero-cov-bed(pre-computed union BED)makeConsensusFastaFromGvcf.py— updated for multi-sample:build_consensustakessample_idx,main()loops over all samples with VCF reopen per sample, writes to--output-dirTest coverage
testing/t/test_splitGvcfAtZeroCoverage.py— 5 pytest unit teststesting/t/test_makeConsensusFastaFromGvcf.py— 5 pytest unit testsTest plan
python -m pytest testing/t/test_splitGvcfAtZeroCoverage.py testing/t/test_makeConsensusFastaFromGvcf.py -vnextflow run main.nf -entry processSingleExperiment -profile processSingleExperiment -stub🤖 Generated with Claude Code