Skip to content

Optimize sparse matrix aggregation and conversion handling#4073

Open
mumichae wants to merge 8 commits intoscverse:mainfrom
mumichae:dask_aggregate_memory_leak
Open

Optimize sparse matrix aggregation and conversion handling#4073
mumichae wants to merge 8 commits intoscverse:mainfrom
mumichae:dask_aggregate_memory_leak

Conversation

@mumichae
Copy link
Copy Markdown
Contributor

@mumichae mumichae commented Apr 17, 2026

Adjusted the dask call to use dask.delayed, which has massively improved the memory footprint on my end. I also threw in some little adjustments for making use of sparsity and returning sparse matrices, since even aggregated data can be sparse.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.62%. Comparing base (2fa6ac0) to head (d9e766c).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4073      +/-   ##
==========================================
+ Coverage   78.60%   78.62%   +0.01%     
==========================================
  Files         118      118              
  Lines       12756    12766      +10     
==========================================
+ Hits        10027    10037      +10     
  Misses       2729     2729              
Flag Coverage Δ
hatch-test.low-vers 77.93% <84.78%> (+0.02%) ⬆️
hatch-test.pre 78.50% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/scanpy/get/_aggregated.py 93.58% <100.00%> (+0.28%) ⬆️

@ilan-gold
Copy link
Copy Markdown
Contributor

Adjusted the dask call to use dask.delayed, which has massively improved the memory footprint on my end. I also threw in some little adjustments for making use of sparsity and returning sparse matrices, since even aggregated data can be sparse.

Are you sure the dask.delayed is responsible for the reduced memory footprint? It seems like the later would be more responsible for that. This would be quite surprising to me if dask.delayed helped with this. Do you have a sense for why?

Comment on lines 106 to +111
out = np.zeros((self.indicator_matrix.shape[0], data.shape[1]), dtype=dtype)
(agg_sum_csr if isinstance(data, CSRBase) else agg_sum_csc)(
self.indicator_matrix, data, out
)
if keep_sparse and isinstance(data, CSBase):
return type(data)(out) # convert to sparse type of input
Copy link
Copy Markdown
Contributor

@ilan-gold ilan-gold Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I would have assumed aggregations of sum would produce something for all features - you have features with literally 0 value in a feature across some categories? Maybe with pseudobulk, I could see that.

Maybe we should try out a kernel that creates a coo-like data structure instead as new category-feature combinations are seen which is then csr-ed instead of doing this sparsification. Or category-by-category the sparse-ification happens instead (i.e., allocate a buffer up front per-category, and then sparse-ify in numba). Or a mix between the two - allocate enough memory for a max-sized coo-matrix, fill it up as far as you need, then drop the unused allocation. At the end, concatenate all the results and call tocsr before returning.

I would think sparsifying to be a relatively expensive operation that doesn't do much for a lot of situations (where you don't have such small categories so that feature is non-zero when summed).

Copy link
Copy Markdown
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see why we need delayed now - there is 3D CSR matrix so our old trick of adding a dimension i.e, return res[None, :] if unchunked_axis == 1 else res and then summing over hte first dimension doesn't work. It's possible we can work around that. Let's see, otherwise this PR makes a lot of sense. Thanks @mumichae !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sc.get.aggregate memory leak for Dask array

2 participants