Optimize sparse matrix aggregation and conversion handling by mumichae · Pull Request #4073 · scverse/scanpy

mumichae · 2026-04-17T22:36:36Z

Adjusted the dask call to use dask.delayed, which has massively improved the memory footprint on my end. I also threw in some little adjustments for making use of sparsity and returning sparse matrices, since even aggregated data can be sparse.

Closes sc.get.aggregate memory leak for Dask array #4074
Tests included or not required because:

Release notes not necessary because:

…ed anymore

… aggregate fits sparsity heuristic

codecov · 2026-04-17T22:37:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.62%. Comparing base (2fa6ac0) to head (d9e766c).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4073      +/-   ##
==========================================
+ Coverage   78.60%   78.62%   +0.01%     
==========================================
  Files         118      118              
  Lines       12756    12766      +10     
==========================================
+ Hits        10027    10037      +10     
  Misses       2729     2729

Flag	Coverage Δ
hatch-test.low-vers	`77.93% <84.78%> (+0.02%)`	⬆️
hatch-test.pre	`78.50% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/scanpy/get/_aggregated.py	`93.58% <100.00%> (+0.28%)`	⬆️

…ic to keep chunks consistent

ilan-gold · 2026-04-20T11:01:37Z

Adjusted the dask call to use dask.delayed, which has massively improved the memory footprint on my end. I also threw in some little adjustments for making use of sparsity and returning sparse matrices, since even aggregated data can be sparse.

Are you sure the dask.delayed is responsible for the reduced memory footprint? It seems like the later would be more responsible for that. This would be quite surprising to me if dask.delayed helped with this. Do you have a sense for why?

ilan-gold · 2026-04-20T11:09:19Z

        out = np.zeros((self.indicator_matrix.shape[0], data.shape[1]), dtype=dtype)
        (agg_sum_csr if isinstance(data, CSRBase) else agg_sum_csc)(
            self.indicator_matrix, data, out
        )
+        if keep_sparse and isinstance(data, CSBase):
+            return type(data)(out)  # convert to sparse type of input


Interesting, I would have assumed aggregations of sum would produce something for all features - you have features with literally 0 value in a feature across some categories? Maybe with pseudobulk, I could see that.

Maybe we should try out a kernel that creates a coo-like data structure instead as new category-feature combinations are seen which is then csr-ed instead of doing this sparsification. Or category-by-category the sparse-ification happens instead (i.e., allocate a buffer up front per-category, and then sparse-ify in numba). Or a mix between the two - allocate enough memory for a max-sized coo-matrix, fill it up as far as you need, then drop the unused allocation. At the end, concatenate all the results and call tocsr before returning.

I would think sparsifying to be a relatively expensive operation that doesn't do much for a lot of situations (where you don't have such small categories so that feature is non-zero when summed).

ilan-gold

Ok I see why we need delayed now - there is 3D CSR matrix so our old trick of adding a dimension i.e, return res[None, :] if unchunked_axis == 1 else res and then summing over hte first dimension doesn't work. It's possible we can work around that. Let's see, otherwise this PR makes a lot of sense. Thanks @mumichae !

mumichae added 3 commits April 17, 2026 15:23

perf: use delayed dask matrix for agggregate, unchunked_axis not need…

2225fcc

…ed anymore

perf: pass only nnz to _sum function for sparse matrix in count_nonzero

972b5e9

feat: convert aggregate output back to sparse if input was sparse and…

d7abc6b

… aggregate fits sparsity heuristic

mumichae added 5 commits April 17, 2026 15:40

fix: make dask import internal to aggregate_dask function

72694c9

fix: adjust combine function to work with csc matrix

7287d7a

fix: rechunk matrix for bad chunk and update test not fail

64fe6d4

fix: add keep_sparse parameter to control sparsity and remove heurist…

05d825a

…ic to keep chunks consistent

add tests for comparing dask output with in-memory output

d9e766c

ilan-gold reviewed Apr 20, 2026

View reviewed changes

ilan-gold reviewed Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize sparse matrix aggregation and conversion handling#4073

Optimize sparse matrix aggregation and conversion handling#4073
mumichae wants to merge 8 commits intoscverse:mainfrom
mumichae:dask_aggregate_memory_leak

mumichae commented Apr 17, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

ilan-gold commented Apr 20, 2026

Uh oh!

ilan-gold Apr 20, 2026 •

edited

Loading

Uh oh!

ilan-gold left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mumichae commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ilan-gold commented Apr 20, 2026

Uh oh!

ilan-gold Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mumichae commented Apr 17, 2026 •

edited

Loading

codecov Bot commented Apr 17, 2026 •

edited

Loading

ilan-gold Apr 20, 2026 •

edited

Loading

ilan-gold left a comment •

edited

Loading