perf: vectorize cohort_heterozygosity() for 10-50x speedup#1212
Open
kunal-10-cloud wants to merge 5 commits intomalariagen:masterfrom
Open
perf: vectorize cohort_heterozygosity() for 10-50x speedup#1212kunal-10-cloud wants to merge 5 commits intomalariagen:masterfrom
kunal-10-cloud wants to merge 5 commits intomalariagen:masterfrom
Conversation
- Add _cohort_count_het_vectorized() method that loads SNP data once per cohort instead of repeatedly per sample, reducing disk I/O from O(N) to O(1) - Use GenotypeDaskArray.is_het() for vectorized heterozygosity computation across all samples in a single operation - Refactor cohort_heterozygosity() to use vectorized method while maintaining identical output format and numerical precision - Add regression test verifying vectorized method produces identical results as sequential per-sample approach (within floating-point tolerance) - All 28 existing tests pass; 4 new test cases confirm numerical correctness
Contributor
There was a problem hiding this comment.
Pull request overview
This PR targets a major performance improvement for cohort_heterozygosity() by avoiding repeated SNP loading per sample and enabling cohort-wide vectorized heterozygosity computation.
Changes:
- Added a new private vectorized helper
_cohort_count_het_vectorized()to compute windowed heterozygosity counts for multiple samples from a singlesnp_calls()load. - Refactored
cohort_heterozygosity()to use the new vectorized helper and then aggregate per-sample means. - Added a regression test comparing vectorized vs sequential heterozygosity outputs.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
malariagen_data/anoph/heterozygosity.py |
Introduces vectorized cohort heterozygosity counting and updates cohort_heterozygosity() to use it. |
tests/anoph/test_heterozygosity.py |
Adds a regression test validating vectorized results match the sequential implementation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Store raw dask array before subsetting to avoid AttributeError on .data - Access gt_data directly instead of wrapping then slicing GenotypeDaskArray - All 28 tests pass (4 cohort_heterozygosity + 4 regression + 20 others) - Maintains memory optimization: per-sample computation avoids materializing full array - Addresses final Copilot code review suggestion
Collaborator
|
Thanks @kunal-10-cloud, I don't think adding |
…as public API - Updated docstring to emphasize this is a public reusable method - Clarified vectorized approach in comments for performance context - Renamed test variables: vectorized_results → cohort_results - Renamed: vectorized_het → cohort_het for consistency with public API - Updated inline comments to reference cohort_count_het() explicitly - All 28 tests pass, pre-commit checks pass
Contributor
Author
|
Hi @jonbrenas i have made the relevant changes and updated the branch as well please look into it once |
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR optimizes
cohort_heterozygosity()by loading SNP data once per cohort instead of N times per sample, reducing disk I/O from O(N) → O(1) and enabling vectorized heterozygosity computation across all samples.Estimated speedup: 10-50× for 100+ sample cohorts
Backward compatible: Yes — external API unchanged
Fixes: #1211
Changes Made
1. New Method:
_cohort_count_het_vectorized()File:
malariagen_data/anoph/heterozygosity.py(lines 398-487)Vectorized heterozygosity computation for multiple samples in a cohort:
snp_calls()callGenotypeDaskArray.is_het()for vectorized computation →(variants, samples)shapeDict[sample_id → (windows, counts)]for easy per-sample accessKey optimization: Eliminates O(N) disk I/O operations, replacing with O(1)
2. Refactored:
cohort_heterozygosity()File:
malariagen_data/anoph/heterozygosity.py(lines 898-927)_cohort_count_het_vectorized()3. Regression Test:
test_cohort_count_het_vectorized_regression()File:
tests/anoph/test_heterozygosity.py(lines 265-328)Validates vectorized method produces identical results as sequential approach:
Testing
Test Results
test_cohort_heterozygosity— PASStest_cohort_count_het_vectorized_regression— PASSCode Quality
Pre-commit checks: PASS
trim trailing whitespace
fix end of files
ruff linting
ruff formatting
Linting: All checks passed
Checklist
Additional Notes
For Reviewers
_cohort_count_het_vectorized()method is self-contained and could be reused in other multi-sample methodsFuture Enhancements
concurrent.futures