Skip to content

Performance Optimization for cohort_heterozygosity() #1211

@kunal-10-cloud

Description

@kunal-10-cloud

Problem Statement

The cohort_heterozygosity() method in malariagen_data/anoph/heterozygosity.py computes heterozygosity for each sample in a cohort sequentially using a for-loop. For large cohorts (100+ samples), this approach is unnecessarily slow because:

  1. Repeated Data Loading: Each sample_count_het() call independently loads the same SNP data for the region from disk via snp_calls(). For a 100-sample cohort, SNP data is loaded 100 times unnecessarily.

  2. Sequential Processing: The inner loop processes samples one at a time, preventing any vectorization benefit.

Current Implementation (Lines 798-809)

het_values = []
for sample_id in df_cohort_samples["sample_id"]:
    df_het = self.sample_count_het(
        sample=sample_id,
        region=region_prepped,
        window_size=window_size,
        site_mask=site_mask,
        chunks=chunks,
        inline_array=inline_array,
    )
    het_values.append(df_het["heterozygosity"].mean())

Root Cause

  • snp_calls() is called O(N) times where N = number of samples in cohort
  • Each call involves expensive disk I/O to load variant position and genotype data
  • Data loading dominates execution time for large cohorts, not the actual heterozygosity computation

Proposed Solution

Implement a vectorized approach that:

  1. Load SNP data once per cohort using a single snp_calls() call instead of N separate calls
  2. Compute heterozygosity across all samples vectorized using GenotypeDaskArray.is_het() which produces shape (variants, samples) instead of per-sample (variants,)
  3. Apply per-sample windowing in a loop (unavoidable since allel.moving_statistic() is 1D-only)
  4. Return results dict mapping sample_id → (windows, counts) tuples

Key Optimization Points

  • Data Loading: O(N) → O(1) disk I/O operations
  • Heterozygosity Computation: Single batch operation across all samples
  • Per-Sample Aggregation: Manual loop for windowing (required by API constraints)

Testing Requirements

  • Regression test: Verify vectorized method produces identical results as sequential approach
    • Test all cohort sizes: 1, 10, 100+ samples
    • Test all sample_set, site_mask, window_size combinations
    • Numerical tolerance: Within floating-point precision (rtol=1e-10)
  • Integration test: Confirm existing tests still pass
  • Edge cases: Single-sample cohorts, missing sites, window_size > region size

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions