Rank refactor 1#4118
Conversation
zboldyga
commented
May 11, 2026
- Closes #
- Tests included or not required because:
- Release notes not necessary because:
|
@ilan-gold here's a proof of concept for the stats speedup. The stats are the biggest performance improvement remaining on the scanpy side of the illico integration -- here's the total scanpy illico time before vs. after the patch. Note that this is only relevant to vs_rest mode (all other cells). Using an individual group as a reference was already fine. vs_rest
I used aggregate as you mentioned. That said, two additional points:
There's a better algorithm for calculating variance that doesn't have these issues: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm . I did an implementation of this (didn't commit), and in my initial tests it was roughly the same speed as this current approach using 'aggregate' (possibly 20% faster but too early to say). I would need to think more carefully about where this fits in scanpy, e.g. it might be best as a util in get alongside aggregate, or a replacement for aggregate. So that in itself is a separate issue, perhaps it needs to be addressed before we can finish this basic stats speedup work... e.g. with that in place, I can simplify this code a bit more, and we avoid numerical stability issues. (note that this issue already exist in 'aggregate').
Thoughts on these 2 points and the current PR? |
|
So to your points
|