Rank refactor 1 by zboldyga · Pull Request #4118 · scverse/scanpy

zboldyga · 2026-05-11T23:46:47Z

Closes #
Tests included or not required because:

Release notes not necessary because:

for more information, see https://pre-commit.ci

zboldyga · 2026-05-12T01:40:16Z

@ilan-gold here's a proof of concept for the stats speedup.

The stats are the biggest performance improvement remaining on the scanpy side of the illico integration -- here's the total scanpy illico time before vs. after the patch.

Note that this is only relevant to vs_rest mode (all other cells). Using an individual group as a reference was already fine.

vs_rest

dataset	n_cells	n_groups	n_genes	original (s)	latest (s)	speedup
adamson16	62,673	89	20,616	65.38	30.31	2.16×
wessels23	30,636	184	16,775	48.56	15.33	3.17×
sunshine23	38,431	194	23,570	69.19	21.60	3.20×
norman19	111,545	238	22,608	224.67	47.98	4.68×
nadig25hepg2	133,757	1,819	9,623	1,838.24	72.96	25.20×
replogle22k562	308,646	1,972	8,563	4,025.46	137.68	29.24×

I used aggregate as you mentioned.

That said, two additional points:

You'll notice that I preserved some prior stats computation logic for the t tests in: _compute_rest_stats_for_t_test , which means we now have different computations for t test and wilcoxon . There are two issues with significant digits / numerical stability that prevent changes to this without changing the outputs:
-- using 'aggregate' has a variance calculation that is numerically unstable, e.g. it can suffer from significant digit cancellation
-- X[~self.groups_masks_obs[g]] is inefficient, but the alternative used for the sped-up variant for wilcoxon is numerically unstable for the variance computation needed for t test.

There's a better algorithm for calculating variance that doesn't have these issues: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm . I did an implementation of this (didn't commit), and in my initial tests it was roughly the same speed as this current approach using 'aggregate' (possibly 20% faster but too early to say). I would need to think more carefully about where this fits in scanpy, e.g. it might be best as a util in get alongside aggregate, or a replacement for aggregate. So that in itself is a separate issue, perhaps it needs to be addressed before we can finish this basic stats speedup work... e.g. with that in place, I can simplify this code a bit more, and we avoid numerical stability issues.

(note that this issue already exist in 'aggregate').

The _RankGenes class uses side effects and it makes the code complex (e.g. functions that act on the class's self objects, it's unclear what functions require/return). I think it would be helpful to refactor this to a more pure functional approach, which also seems in line with other scanpy code I reviewed. It should be straightforward, but I wanted to check with you whether this is something you'd accept, and when and off which branch to implement this?

Thoughts on these 2 points and the current PR?

ilan-gold · 2026-05-12T08:53:13Z

So to your points

I would wholesale replace the variance calculation if possible in aggregate. There's a comment in the dask implementation about how strangely unstable the variance calculation can be so presumably that is less a dask thing than a "data big enough to require dask" thing i.e, triggering instability due to more summation ops, in which case, I think the calculation can be considered "wrong" and replaced. The online algorithm looks to be compatible with a single-pass approach to calculating mean at the same time i.e., only one data access and reuse the value. I would just replace our current sparse numba kernel for mean var calculation with one that does this. That seems possible, right? Does this appear in dense data as well? Presumably.
Yeah, this was a separate issue and predates me. I specifically didn't open that can of worms in my other PR but yes, I think we would absolutely take a refactor of this.

zboldyga and others added 4 commits May 11, 2026 16:44

initial optimization

57ad6a0

t test stats separated for clarity

d40aace

proof of concept

bcc9407

[pre-commit.ci] auto fixes from pre-commit.com hooks

fc4ef9b

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rank refactor 1#4118

Rank refactor 1#4118
zboldyga wants to merge 4 commits into
scverse:ig/illicofrom
zboldyga:rank-refactor-1

zboldyga commented May 11, 2026

Uh oh!

zboldyga commented May 12, 2026 •

edited

Loading

Uh oh!

ilan-gold commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zboldyga commented May 11, 2026

Uh oh!

zboldyga commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

vs_rest

Uh oh!

ilan-gold commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zboldyga commented May 12, 2026 •

edited

Loading