Skip to content

feat: het indel product calling — ref-length FASTA + dual-track products#6

Draft
jbrestel wants to merge 8 commits into
merge-experiments-refactorfrom
het-indel
Draft

feat: het indel product calling — ref-length FASTA + dual-track products#6
jbrestel wants to merge 8 commits into
merge-experiments-refactorfrom
het-indel

Conversation

@jbrestel
Copy link
Copy Markdown
Member

Summary

  • Consensus FASTA is now permanently reference-length: all indel positions (hom and het) emit reference bases. N is reserved exclusively for low/no-coverage positions.
  • Single genomic_indels table with zygosity, ref_allele, alt_allele columns replaces the prior hom-only approach. findValues.pl now parses GT fields and emits the full 7-column TSV.
  • Het and hom indels now produce amino acid products: processSequenceVariations.jl applies indels to CDS sequences at product-calling time — hom indels emit one product (ALT-applied), het indels emit two products (REF + ALT), consistent with how het SNPs are handled today.

Files Changed

File Change
bin/findValues.pl GT parsing → 7-column TSV (zygosity, ref_allele, alt_allele)
modules/mergeExperiments.nf genomic_indels table extended to 7 columns
bin/makeConsensusFastaFromGvcf.py Hom + het indel branches emit ref bases
bin/makeCodingData.jl Remove fasta_offset; extend codingIndels.db schema
bin/processSequenceVariations.jl Remove get_indel_shift; add lookup_cds_indel, apply_indel_to_cds; indel product path
testing/t/findValues.t New Perl tests
testing/t/test_makeConsensusFastaFromGvcf.py New Python tests (4/4 passing on host)
testing/t/makeCodingData.jl Updated tests for new schema
testing/t/processSequenceVariations.jl New Julia tests (require Docker)

Test Plan

  • Python tests pass on host: python3 testing/t/test_makeConsensusFastaFromGvcf.py (4/4)
  • Julia tests in Docker: julia testing/t/makeCodingData.jl and julia testing/t/processSequenceVariations.jl
  • Perl tests in Docker: PERL5LIB=... prove testing/t/findValues.t
  • End-to-end: run mergeExperiments workflow with a sample that has het indels and verify product.dat contains both REF and ALT products

🤖 Generated with Claude Code

jbrestel and others added 8 commits April 17, 2026 13:21
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rence length

- Hom and het indels now emit ref_seq[pos:pos+len(REF)] instead of ALT allele or N masks
- Ensures consensus FASTA always matches reference length; indel info tracked in DB
- Also fixed SNP detection to check that allele length matches REF length
- Added test_makeConsensusFastaFromGvcf.py with 4 test cases covering hom/het indels and low coverage

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hema with zygosity/alleles

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… het indels now emit products

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbrestel jbrestel marked this pull request as draft May 1, 2026 18:42
@jbrestel jbrestel changed the base branch from main to merge-experiments-refactor May 1, 2026 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant