Problem Statement
The cohort_heterozygosity() method in malariagen_data/anoph/heterozygosity.py computes heterozygosity for each sample in a cohort sequentially using a for-loop. For large cohorts (100+ samples), this approach is unnecessarily slow because:
-
Repeated Data Loading: Each sample_count_het() call independently loads the same SNP data for the region from disk via snp_calls(). For a 100-sample cohort, SNP data is loaded 100 times unnecessarily.
-
Sequential Processing: The inner loop processes samples one at a time, preventing any vectorization benefit.
Current Implementation (Lines 798-809)
het_values = []
for sample_id in df_cohort_samples["sample_id"]:
df_het = self.sample_count_het(
sample=sample_id,
region=region_prepped,
window_size=window_size,
site_mask=site_mask,
chunks=chunks,
inline_array=inline_array,
)
het_values.append(df_het["heterozygosity"].mean())
Root Cause
snp_calls() is called O(N) times where N = number of samples in cohort
- Each call involves expensive disk I/O to load variant position and genotype data
- Data loading dominates execution time for large cohorts, not the actual heterozygosity computation
Proposed Solution
Implement a vectorized approach that:
- Load SNP data once per cohort using a single
snp_calls() call instead of N separate calls
- Compute heterozygosity across all samples vectorized using
GenotypeDaskArray.is_het() which produces shape (variants, samples) instead of per-sample (variants,)
- Apply per-sample windowing in a loop (unavoidable since
allel.moving_statistic() is 1D-only)
- Return results dict mapping
sample_id → (windows, counts) tuples
Key Optimization Points
- Data Loading: O(N) → O(1) disk I/O operations
- Heterozygosity Computation: Single batch operation across all samples
- Per-Sample Aggregation: Manual loop for windowing (required by API constraints)
Testing Requirements
- Regression test: Verify vectorized method produces identical results as sequential approach
- Test all cohort sizes: 1, 10, 100+ samples
- Test all sample_set, site_mask, window_size combinations
- Numerical tolerance: Within floating-point precision (rtol=1e-10)
- Integration test: Confirm existing tests still pass
- Edge cases: Single-sample cohorts, missing sites, window_size > region size
Problem Statement
The
cohort_heterozygosity()method inmalariagen_data/anoph/heterozygosity.pycomputes heterozygosity for each sample in a cohort sequentially using a for-loop. For large cohorts (100+ samples), this approach is unnecessarily slow because:Repeated Data Loading: Each
sample_count_het()call independently loads the same SNP data for the region from disk viasnp_calls(). For a 100-sample cohort, SNP data is loaded 100 times unnecessarily.Sequential Processing: The inner loop processes samples one at a time, preventing any vectorization benefit.
Current Implementation (Lines 798-809)
Root Cause
snp_calls()is called O(N) times where N = number of samples in cohortProposed Solution
Implement a vectorized approach that:
snp_calls()call instead of N separate callsGenotypeDaskArray.is_het()which produces shape(variants, samples)instead of per-sample(variants,)allel.moving_statistic()is 1D-only)sample_id → (windows, counts)tuplesKey Optimization Points
Testing Requirements