Skip to content

feat: Add VCF export support for SNP call datasets (GH-1054)#1197

Open
khushthecoder wants to merge 7 commits intomalariagen:masterfrom
khushthecoder:GH1054-add-vcf-export
Open

feat: Add VCF export support for SNP call datasets (GH-1054)#1197
khushthecoder wants to merge 7 commits intomalariagen:masterfrom
khushthecoder:GH1054-add-vcf-export

Conversation

@khushthecoder
Copy link
Contributor

Fix: #1054 — SNP calls to VCF export

Summary

Adds support for exporting SNP call datasets to Variant Call Format (VCF), enabling interoperability with common genomics tools and workflows such as bcftools, GATK, and R pipelines.


Changes

  • Introduce a VcfConverter mixin following the existing PlinkConverter pattern
  • Implement snp_calls_to_vcf() to export SNP calls directly to .vcf format
  • Stream data in chunks (10K variants per chunk) to avoid loading the full genotype matrix into memory
  • Explicitly decode byte-backed allele values to prevent b'A'-style artifacts in the output
  • Use snp_calls() as the data source so that multiallelic sites are preserved (standard for VCF)
  • Integrate the exporter into AnophelesDataResource via the existing cooperative MRO
  • Add vcf_params.py following the plink_params.py pattern for type-annotated parameters
  • Add tests following the test_plink_converter pattern to verify output structure, header correctness, sample ID matching, variant positions, and overwrite behavior

Files Changed

File Status Description
malariagen_data/anoph/to_vcf.py New VcfConverter mixin with snp_calls_to_vcf()
malariagen_data/anoph/vcf_params.py New Parameter type aliases for VCF functions
malariagen_data/anopheles.py Modified Import + MRO integration of VcfConverter
tests/anoph/test_vcf_converter.py New Tests for VCF export

Notes

  • No new external dependencies are introduced — VCF records are written directly as formatted text
  • The implementation carefully handles byte-backed allele values (a concern raised in PR Add VCF export support for SNP call datasets #1071's review)
  • The overwrite parameter controls whether existing output files are regenerated or reused

Adds a VcfConverter mixin following the existing PlinkConverter pattern:

- New VcfConverter class in malariagen_data/anoph/to_vcf.py with
  snp_calls_to_vcf() method for exporting SNP calls to VCF format
- Streams data in chunks to keep memory usage low on large datasets
- Explicitly decodes byte-backed allele values to prevent artifacts
- Uses snp_calls() as data source to preserve multiallelic sites
- Integrated into AnophelesDataResource via MRO
- Tests follow the test_plink_converter pattern

Closes malariagen#1054
variant_contig stores integer indices (dtype u1), not string names.
The check 'contig_chunk.dtype.kind == "S"' was never true, so CHROM
values were written as '0', '1', '2' instead of actual contig names
like '2R', '2L', '3R'.

Fix: Use ds.attrs['contigs'] to map integer indices to contig names,
following the established pattern in snp_frq.py and aim_data.py.

Also:
- Add ##contig header lines for VCF spec compliance
- Update test to verify CHROM values match actual contig names
@khushthecoder
Copy link
Contributor Author

@jonbrenas — All checks, including the recent pre-commit / ruff-format linting steps, are now passing perfectly. The codebase correctly handles byte- decoding without native dependencies and handles chunk-streaming with minimal memory overhead, thoroughly validated by the new tests.

Could you please take a look when you have a moment? Let me know if you need any final tweaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding other file formats option

1 participant