Optimize sparse matrix aggregation and conversion handling#4073
Optimize sparse matrix aggregation and conversion handling#4073mumichae wants to merge 8 commits intoscverse:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4073 +/- ##
==========================================
+ Coverage 78.60% 78.62% +0.01%
==========================================
Files 118 118
Lines 12756 12766 +10
==========================================
+ Hits 10027 10037 +10
Misses 2729 2729
Flags with carried forward coverage won't be shown. Click here to find out more.
|
…ic to keep chunks consistent
Are you sure the |
| out = np.zeros((self.indicator_matrix.shape[0], data.shape[1]), dtype=dtype) | ||
| (agg_sum_csr if isinstance(data, CSRBase) else agg_sum_csc)( | ||
| self.indicator_matrix, data, out | ||
| ) | ||
| if keep_sparse and isinstance(data, CSBase): | ||
| return type(data)(out) # convert to sparse type of input |
There was a problem hiding this comment.
Interesting, I would have assumed aggregations of sum would produce something for all features - you have features with literally 0 value in a feature across some categories? Maybe with pseudobulk, I could see that.
Maybe we should try out a kernel that creates a coo-like data structure instead as new category-feature combinations are seen which is then csr-ed instead of doing this sparsification. Or category-by-category the sparse-ification happens instead (i.e., allocate a buffer up front per-category, and then sparse-ify in numba). Or a mix between the two - allocate enough memory for a max-sized coo-matrix, fill it up as far as you need, then drop the unused allocation. At the end, concatenate all the results and call tocsr before returning.
I would think sparsifying to be a relatively expensive operation that doesn't do much for a lot of situations (where you don't have such small categories so that feature is non-zero when summed).
There was a problem hiding this comment.
Ok I see why we need delayed now - there is 3D CSR matrix so our old trick of adding a dimension i.e, return res[None, :] if unchunked_axis == 1 else res and then summing over hte first dimension doesn't work. It's possible we can work around that. Let's see, otherwise this PR makes a lot of sense. Thanks @mumichae !
Adjusted the dask call to use dask.delayed, which has massively improved the memory footprint on my end. I also threw in some little adjustments for making use of sparsity and returning sparse matrices, since even aggregated data can be sparse.
sc.get.aggregatememory leak for Dask array #4074