Skip to content

vcfdist v3.0.0#37

Open
TimD1 wants to merge 131 commits intomasterfrom
dev
Open

vcfdist v3.0.0#37
TimD1 wants to merge 131 commits intomasterfrom
dev

Conversation

@TimD1
Copy link
Owner

@TimD1 TimD1 commented Apr 6, 2025

Main Goals for v3.0.0

  1. Avoid memory/runtime explosion in dense variant regions
  2. Support large SVs (ideally raise limit from 10kb to 100kb)
  3. Per-variant phasing analysis
  4. Genotyping error summary

Required Deliverables

Code Deliverables

  • shift from per-supercluster to per-variant phasing evaluation
  • remove edit distance calculation entirely
  • remove VCF normalization option entirely
  • shift to graph-based alignment
  • add genotype and allele count error summary
  • re-evaluate superclusters with FNs
  • shift to WFA alignment
    • initial attempt ended up not being much faster, since WFA occurs for each submatrix
    • merge all truth graph nodes during WFA
    • merge all query reference nodes during WFA
  • limit max supercluster size
  • more efficient clustering: mix of exact (bi-wfa) and heuristic-based

Runtime/Accuracy Investigation

  • maximum supercluster size
  • bi-wfa clustering iterations
  • adding clustering heuristics
  • different number of retries with FNs

Analysis Deliverables

  • regenerate accuracy plots from previous paper, showing v3 is better
  • regenerate runtime/memory plots from previous paper

Documentation Deliverables

  • update documentation/wiki
  • write blogpost summary of improvements
  • write BlueSky/LinkedIn post

Minor TODOs

  • update memory calculation?
  • update thread limiting logic (thread2 unused)

Limitations and Assumptions

  • Phase groups need all variants to be adjacent. If there is an unphased variant in the middle of a phase group, it will be split up. (for example, the test VCF in ./run doesn’t phase heterozygous alternate variants)

Nice to Have

Related Projects

  • add complex variant normalization as bcftools extension
  • add nf-core module for vcfdist
  • add multiqc plugin for vcfdist
  • sandbox.bio tutorial for vcfdist

Potential Improvements

  • FN retrying should fix FN SVs "swallowing" TP SNPs, but what about TP SVs "swallowing" FN/FP SNPs?
  • biWFA clustering requires variants to be included. What about graph-based alignment here?

Efficiency Optimizations

  • wf_swg_align() DP matrix only needs offsets for N most recent rows
  • shift from Dijkstra to A*?
  • no traceback for retries only calculating distance

Punting to v3.1

  • expand support to SV tags with precise breakpoints
  • evaluate unphased variants
  • retain all INFO/FORMAT fields in original VCFs
    • make summary.vcf output fields optional
  • add more thorough integration tests
    • bacterial genome (haploid)
    • VCF against itself (change normalization?)
    • empty truth or query VCF
    • mix of upper/lower case
  • feedback on desired outputs (fp.vcf, fn.vcf, tp.vcf?)
    • output dropped/unevaluated variants to VCF?

Runtime Analysis

Helium, 16 threads, 64GB RAM, max variant 1000

PAV

SNP F1 INDEL F1 SV F1 cluster prec-recall max ED biWFA max retries
0.973031 0.963212 0.956453 1922.912 22400.579 1000 4 5
0.972793 0.962972 0.948846 1908.611 9519.489 100 4 5
0.972688 0.962791 0.948865 593.576 8812.550 100 1 5
0.972690 0.962811 0.948748 586.249 5794.202 100 1 3
0.972654 0.962774 0.947229 584.666 2887.851 100 1 1
* * * 584.441 260.692 100 1 0

HPRC

SNP F1 INDEL F1 SV F1 cluster prec-recall max ED biWFA max retries
0.998519 0.984859 0.995799 2055.939 2105.901 1000 4 5
0.998497 0.984855 0.993183 2036.129 1226.166 100 4 5
0.998498 0.984847 0.993182 578.310 1116.628 100 1 5
0.998500 0.984848 0.993230 581.234 552.833 100 1 3
0.998496 0.984838 0.992879 584.549 317.261 100 1 1
* * * 583.830 106.741 100 1 0

GIAB-TR

SNP F1 INDEL F1 SV F1 cluster prec-recall max ED biWFA max retries
0.982936 0.982728 0.969777 1998.548 20667.432 1000 4 5
0.982458 0.982524 0.964042 2004.119 11258.847 100 4 5
0.979404 0.981241 0.962295 571.643 11466.438 100 1 5
0.979451 0.981298 0.962706 565.778 7439.718 100 1 3
0.979339 0.981249 0.961703 570.376 3711.576 100 1 1
* * * 569.330 313.688 100 1 0

TimD1 and others added 30 commits March 22, 2024 14:13
 - no real changes to analysis, just minor visual updates
 - simplified, no more plotting SNP, INDEL, or LARGE
 - stats written to json file, added script for calc
 - changed to at most 2 threads per supercluster
 - started high-level reorganization
 - this occurs after clustering but before superclustering
 - this will allow graph generation, with genotype information
 - introduced add_callset_vars()
 - added variantData->nc, moved ctg_variants to callset_vars
 - updated Graph definition/structure with additional info
 - always select INS first if multiple variants occur at same position
 - main `calc_prec_recall_aln()` function is complete
 - added +1 to last query graph node length
 - added full alignment graph printing for debugging
 - truth string generation checks if variant occurs on hap
 - truth sequence is now a path through a graph
 - this should enable much simple logic for parsing sync groups etc
 - incorrect, but everything completes
 - still need to uncomment and update initial phaseset printing
 - FP counting is still wrong: if calc_gt is 0|0, it's a FP
@TimD1 TimD1 self-assigned this Jun 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant