Skip to content

Rank refactor 1#4118

Draft
zboldyga wants to merge 4 commits into
scverse:ig/illicofrom
zboldyga:rank-refactor-1
Draft

Rank refactor 1#4118
zboldyga wants to merge 4 commits into
scverse:ig/illicofrom
zboldyga:rank-refactor-1

Conversation

@zboldyga
Copy link
Copy Markdown
Contributor

  • Closes #
  • Tests included or not required because:

@zboldyga
Copy link
Copy Markdown
Contributor Author

zboldyga commented May 12, 2026

@ilan-gold here's a proof of concept for the stats speedup.

The stats are the biggest performance improvement remaining on the scanpy side of the illico integration -- here's the total scanpy illico time before vs. after the patch.

Note that this is only relevant to vs_rest mode (all other cells). Using an individual group as a reference was already fine.

vs_rest

dataset n_cells n_groups n_genes original (s) latest (s) speedup
adamson16 62,673 89 20,616 65.38 30.31 2.16×
wessels23 30,636 184 16,775 48.56 15.33 3.17×
sunshine23 38,431 194 23,570 69.19 21.60 3.20×
norman19 111,545 238 22,608 224.67 47.98 4.68×
nadig25hepg2 133,757 1,819 9,623 1,838.24 72.96 25.20×
replogle22k562 308,646 1,972 8,563 4,025.46 137.68 29.24×

I used aggregate as you mentioned.

That said, two additional points:

  1. You'll notice that I preserved some prior stats computation logic for the t tests in: _compute_rest_stats_for_t_test , which means we now have different computations for t test and wilcoxon . There are two issues with significant digits / numerical stability that prevent changes to this without changing the outputs:
    -- using 'aggregate' has a variance calculation that is numerically unstable, e.g. it can suffer from significant digit cancellation
    -- X[~self.groups_masks_obs[g]] is inefficient, but the alternative used for the sped-up variant for wilcoxon is numerically unstable for the variance computation needed for t test.

There's a better algorithm for calculating variance that doesn't have these issues: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm . I did an implementation of this (didn't commit), and in my initial tests it was roughly the same speed as this current approach using 'aggregate' (possibly 20% faster but too early to say). I would need to think more carefully about where this fits in scanpy, e.g. it might be best as a util in get alongside aggregate, or a replacement for aggregate. So that in itself is a separate issue, perhaps it needs to be addressed before we can finish this basic stats speedup work... e.g. with that in place, I can simplify this code a bit more, and we avoid numerical stability issues.

(note that this issue already exist in 'aggregate').

  1. The _RankGenes class uses side effects and it makes the code complex (e.g. functions that act on the class's self objects, it's unclear what functions require/return). I think it would be helpful to refactor this to a more pure functional approach, which also seems in line with other scanpy code I reviewed. It should be straightforward, but I wanted to check with you whether this is something you'd accept, and when and off which branch to implement this?

Thoughts on these 2 points and the current PR?

@ilan-gold
Copy link
Copy Markdown
Contributor

So to your points

  1. I would wholesale replace the variance calculation if possible in aggregate. There's a comment in the dask implementation about how strangely unstable the variance calculation can be so presumably that is less a dask thing than a "data big enough to require dask" thing i.e, triggering instability due to more summation ops, in which case, I think the calculation can be considered "wrong" and replaced. The online algorithm looks to be compatible with a single-pass approach to calculating mean at the same time i.e., only one data access and reuse the value. I would just replace our current sparse numba kernel for mean var calculation with one that does this. That seems possible, right? Does this appear in dense data as well? Presumably.
  2. Yeah, this was a separate issue and predates me. I specifically didn't open that can of worms in my other PR but yes, I think we would absolutely take a refactor of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants