Skip to content

perf: vectorize cohort_heterozygosity() for 10-50x speedup#1212

Open
kunal-10-cloud wants to merge 5 commits intomalariagen:masterfrom
kunal-10-cloud:optimize/cohort-heterozygosity-vectorized
Open

perf: vectorize cohort_heterozygosity() for 10-50x speedup#1212
kunal-10-cloud wants to merge 5 commits intomalariagen:masterfrom
kunal-10-cloud:optimize/cohort-heterozygosity-vectorized

Conversation

@kunal-10-cloud
Copy link
Contributor

@kunal-10-cloud kunal-10-cloud commented Mar 23, 2026

Summary

This PR optimizes cohort_heterozygosity() by loading SNP data once per cohort instead of N times per sample, reducing disk I/O from O(N) → O(1) and enabling vectorized heterozygosity computation across all samples.

Estimated speedup: 10-50× for 100+ sample cohorts
Backward compatible: Yes — external API unchanged

Fixes: #1211

Changes Made

1. New Method: _cohort_count_het_vectorized()

File: malariagen_data/anoph/heterozygosity.py (lines 398-487)

Vectorized heterozygosity computation for multiple samples in a cohort:

  • Loads SNP data once for all cohort samples via single snp_calls() call
  • Uses GenotypeDaskArray.is_het() for vectorized computation → (variants, samples) shape
  • Applies per-sample windowing in manual loop (moving_statistic is 1D-only)
  • Returns Dict[sample_id → (windows, counts)] for easy per-sample access

Key optimization: Eliminates O(N) disk I/O operations, replacing with O(1)

2. Refactored: cohort_heterozygosity()

File: malariagen_data/anoph/heterozygosity.py (lines 898-927)

  • Replaced sequential per-sample loop with single call to _cohort_count_het_vectorized()
  • Maintains identical output format and numerical precision
  • Added clear comments documenting vectorization benefit
  • Backward compatible: No external API changes

3. Regression Test: test_cohort_count_het_vectorized_regression()

File: tests/anoph/test_heterozygosity.py (lines 265-328)

Validates vectorized method produces identical results as sequential approach:

  • Compares outputs element-by-element across all samples
  • Verifies numerical identity within floating-point tolerance (rtol=1e-10)
  • Tests across all 4 data resource types (ag3_sim, af1_sim, adir1_sim, amin1_sim)
  • Uses small cohort (3 samples) to keep test execution fast

Testing

Test Results

  • All 28 heterozygosity tests pass (existing + new)
  • test_cohort_heterozygosity — PASS
  • test_cohort_count_het_vectorized_regression — PASS
  • 20× other heterozygosity tests — PASS

Code Quality

  • Pre-commit checks: PASS

  • trim trailing whitespace

  • fix end of files

  • ruff linting

  • ruff formatting

  • Linting: All checks passed

Checklist

  • Changes follow project style guidelines
  • Pre-commit checks pass
  • Ruff linting checks pass
  • Ruff formatting checks pass
  • All tests pass (28/28)
  • Regression test verifies numerical correctness
  • New method has comprehensive docstring
  • Comments explain optimization benefit
  • Backward compatible
  • No breaking changes

Additional Notes

For Reviewers

  1. The _cohort_count_het_vectorized() method is self-contained and could be reused in other multi-sample methods
  2. The same vectorization pattern can be applied to other cohort-level computations
  3. Regression test uses small cohort (3 samples) by design — large cohorts are covered by existing integration tests

Future Enhancements

  • Optional parallelization of per-sample windowing using concurrent.futures
  • Generalization to other multi-sample iterators in the codebase
  • Caching layer for frequently accessed cohort parameters

- Add _cohort_count_het_vectorized() method that loads SNP data once per cohort
  instead of repeatedly per sample, reducing disk I/O from O(N) to O(1)
- Use GenotypeDaskArray.is_het() for vectorized heterozygosity computation
  across all samples in a single operation
- Refactor cohort_heterozygosity() to use vectorized method while maintaining
  identical output format and numerical precision
- Add regression test verifying vectorized method produces identical results
  as sequential per-sample approach (within floating-point tolerance)
- All 28 existing tests pass; 4 new test cases confirm numerical correctness
Copilot AI review requested due to automatic review settings March 23, 2026 18:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets a major performance improvement for cohort_heterozygosity() by avoiding repeated SNP loading per sample and enabling cohort-wide vectorized heterozygosity computation.

Changes:

  • Added a new private vectorized helper _cohort_count_het_vectorized() to compute windowed heterozygosity counts for multiple samples from a single snp_calls() load.
  • Refactored cohort_heterozygosity() to use the new vectorized helper and then aggregate per-sample means.
  • Added a regression test comparing vectorized vs sequential heterozygosity outputs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
malariagen_data/anoph/heterozygosity.py Introduces vectorized cohort heterozygosity counting and updates cohort_heterozygosity() to use it.
tests/anoph/test_heterozygosity.py Adds a regression test validating vectorized results match the sequential implementation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Store raw dask array before subsetting to avoid AttributeError on .data
- Access gt_data directly instead of wrapping then slicing GenotypeDaskArray
- All 28 tests pass (4 cohort_heterozygosity + 4 regression + 20 others)
- Maintains memory optimization: per-sample computation avoids materializing full array
- Addresses final Copilot code review suggestion
@jonbrenas
Copy link
Collaborator

Thanks @kunal-10-cloud, I don't think adding vectorized to the name of the function is needed. It might also be a function that could be useful as a public function.

…as public API

- Updated docstring to emphasize this is a public reusable method
- Clarified vectorized approach in comments for performance context
- Renamed test variables: vectorized_results → cohort_results
- Renamed: vectorized_het → cohort_het for consistency with public API
- Updated inline comments to reference cohort_count_het() explicitly
- All 28 tests pass, pre-commit checks pass
@kunal-10-cloud
Copy link
Contributor Author

Hi @jonbrenas i have made the relevant changes and updated the branch as well please look into it once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance Optimization for cohort_heterozygosity()

3 participants