perf: vectorize cohort_heterozygosity() for 10-50x speedup by kunal-10-cloud · Pull Request #1212 · malariagen/malariagen-data-python

kunal-10-cloud · 2026-03-23T18:20:57Z

Summary

This PR optimizes cohort_heterozygosity() by loading SNP data once per cohort instead of N times per sample, reducing disk I/O from O(N) → O(1) and enabling vectorized heterozygosity computation across all samples.

Estimated speedup: 10-50× for 100+ sample cohorts
Backward compatible: Yes — external API unchanged

Fixes: #1211

Changes Made

1. New Method: `_cohort_count_het_vectorized()`

File: malariagen_data/anoph/heterozygosity.py (lines 398-487)

Vectorized heterozygosity computation for multiple samples in a cohort:

Loads SNP data once for all cohort samples via single snp_calls() call
Uses GenotypeDaskArray.is_het() for vectorized computation → (variants, samples) shape
Applies per-sample windowing in manual loop (moving_statistic is 1D-only)
Returns Dict[sample_id → (windows, counts)] for easy per-sample access

Key optimization: Eliminates O(N) disk I/O operations, replacing with O(1)

2. Refactored: `cohort_heterozygosity()`

File: malariagen_data/anoph/heterozygosity.py (lines 898-927)

Replaced sequential per-sample loop with single call to _cohort_count_het_vectorized()
Maintains identical output format and numerical precision
Added clear comments documenting vectorization benefit
Backward compatible: No external API changes

3. Regression Test: `test_cohort_count_het_vectorized_regression()`

File: tests/anoph/test_heterozygosity.py (lines 265-328)

Validates vectorized method produces identical results as sequential approach:

Compares outputs element-by-element across all samples
Verifies numerical identity within floating-point tolerance (rtol=1e-10)
Tests across all 4 data resource types (ag3_sim, af1_sim, adir1_sim, amin1_sim)
Uses small cohort (3 samples) to keep test execution fast

Testing

Test Results

All 28 heterozygosity tests pass (existing + new)
4× test_cohort_heterozygosity — PASS
4× test_cohort_count_het_vectorized_regression — PASS
20× other heterozygosity tests — PASS

Code Quality

Pre-commit checks: PASS
trim trailing whitespace
fix end of files
ruff linting
ruff formatting
Linting: All checks passed

Checklist

Additional Notes

For Reviewers

The _cohort_count_het_vectorized() method is self-contained and could be reused in other multi-sample methods
The same vectorization pattern can be applied to other cohort-level computations
Regression test uses small cohort (3 samples) by design — large cohorts are covered by existing integration tests

Future Enhancements

Optional parallelization of per-sample windowing using concurrent.futures
Generalization to other multi-sample iterators in the codebase
Caching layer for frequently accessed cohort parameters

- Add _cohort_count_het_vectorized() method that loads SNP data once per cohort instead of repeatedly per sample, reducing disk I/O from O(N) to O(1) - Use GenotypeDaskArray.is_het() for vectorized heterozygosity computation across all samples in a single operation - Refactor cohort_heterozygosity() to use vectorized method while maintaining identical output format and numerical precision - Add regression test verifying vectorized method produces identical results as sequential per-sample approach (within floating-point tolerance) - All 28 existing tests pass; 4 new test cases confirm numerical correctness

Copilot

Pull request overview

This PR targets a major performance improvement for cohort_heterozygosity() by avoiding repeated SNP loading per sample and enabling cohort-wide vectorized heterozygosity computation.

Changes:

Added a new private vectorized helper _cohort_count_het_vectorized() to compute windowed heterozygosity counts for multiple samples from a single snp_calls() load.
Refactored cohort_heterozygosity() to use the new vectorized helper and then aggregate per-sample means.
Added a regression test comparing vectorized vs sequential heterozygosity outputs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`malariagen_data/anoph/heterozygosity.py`	Introduces vectorized cohort heterozygosity counting and updates `cohort_heterozygosity()` to use it.
`tests/anoph/test_heterozygosity.py`	Adds a regression test validating vectorized results match the sequential implementation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

malariagen_data/anoph/heterozygosity.py

tests/anoph/test_heterozygosity.py

- Store raw dask array before subsetting to avoid AttributeError on .data - Access gt_data directly instead of wrapping then slicing GenotypeDaskArray - All 28 tests pass (4 cohort_heterozygosity + 4 regression + 20 others) - Maintains memory optimization: per-sample computation avoids materializing full array - Addresses final Copilot code review suggestion

jonbrenas · 2026-03-24T08:51:51Z

Thanks @kunal-10-cloud, I don't think adding vectorized to the name of the function is needed. It might also be a function that could be useful as a public function.

…as public API - Updated docstring to emphasize this is a public reusable method - Clarified vectorized approach in comments for performance context - Renamed test variables: vectorized_results → cohort_results - Renamed: vectorized_het → cohort_het for consistency with public API - Updated inline comments to reference cohort_count_het() explicitly - All 28 tests pass, pre-commit checks pass

kunal-10-cloud · 2026-03-24T12:57:27Z

Hi @jonbrenas i have made the relevant changes and updated the branch as well please look into it once

Copilot AI review requested due to automatic review settings March 23, 2026 18:20

Copilot started reviewing on behalf of kunal-10-cloud March 23, 2026 18:21 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

malariagen_data/anoph/heterozygosity.py Show resolved Hide resolved

malariagen_data/anoph/heterozygosity.py Outdated Show resolved Hide resolved

malariagen_data/anoph/heterozygosity.py Outdated Show resolved Hide resolved

tests/anoph/test_heterozygosity.py Outdated Show resolved Hide resolved

kunal-10-cloud mentioned this pull request Mar 23, 2026

Performance Optimization for cohort_heterozygosity() #1211

Open

kunal-10-cloud added 2 commits March 24, 2026 18:14

Merge branch 'master' into optimize/cohort-heterozygosity-vectorized

c768147

kunal-10-cloud mentioned this pull request Mar 24, 2026

Migrate gene_cnv() to AnophelesCnvData #1214

Open

5 tasks

Merge branch 'master' into optimize/cohort-heterozygosity-vectorized

e16dd22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: vectorize cohort_heterozygosity() for 10-50x speedup#1212

perf: vectorize cohort_heterozygosity() for 10-50x speedup#1212
kunal-10-cloud wants to merge 5 commits intomalariagen:masterfrom
kunal-10-cloud:optimize/cohort-heterozygosity-vectorized

kunal-10-cloud commented Mar 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonbrenas commented Mar 24, 2026

Uh oh!

kunal-10-cloud commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kunal-10-cloud commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fixes: #1211

Changes Made

1. New Method: _cohort_count_het_vectorized()

2. Refactored: cohort_heterozygosity()

3. Regression Test: test_cohort_count_het_vectorized_regression()

Testing

Test Results

Code Quality

Checklist

Additional Notes

For Reviewers

Future Enhancements

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonbrenas commented Mar 24, 2026

Uh oh!

kunal-10-cloud commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kunal-10-cloud commented Mar 23, 2026 •

edited

Loading

1. New Method: `_cohort_count_het_vectorized()`

2. Refactored: `cohort_heterozygosity()`

3. Regression Test: `test_cohort_count_het_vectorized_regression()`