feat: DonorData.C supports annbatch collection#116
Conversation
| ] | ||
| optional-dependencies.ml = [ | ||
| "pytorch-lightning", | ||
| "cellink[annbatch]", |
There was a problem hiding this comment.
added this for convinience only. we can remove it if you want
ilan-gold
left a comment
There was a problem hiding this comment.
I don't really follow what this PR has to do with the description of this PR. Why do we need a prefetch queue? Why does this PR make (3) simpler?
| source | ||
| Cell-level data to stream into the collection. May be: | ||
|
|
||
| - an :class:`anndata.AnnData` (will first be written to a temp h5ad), |
There was a problem hiding this comment.
Why do in-memory anndata files need to be written to disk first?
There was a problem hiding this comment.
sorry for the overlook. It's AI slop. Let me be more clear with @LArnoldt on the pipeline to make sure if we even need this function.
There was a problem hiding this comment.
@LArnoldt , how would imagine the pipeline? For annbatch preshuffling you need to create a collection but would you want this on your module level?
There was a problem hiding this comment.
But not every .dd.h5 file would be in-memory no? I guess, there could be both options from written and from memory?
| sharded zarr collection, and to open such a collection as a configured | ||
| ``annbatch.Loader``. | ||
|
|
||
| The donor-side AnnData (``dd.G``) is intentionally not handled here -- its |
There was a problem hiding this comment.
Is there an on-disk format for celllink @LArnoldt ? I don't see anything in https://cellink-docs.readthedocs.io/en/latest/api/io.html or is it just in-memory?
There was a problem hiding this comment.
| # Lazy re-exports for the optional `annbatch` extra. We don't want importing | ||
| # `cellink.io` to fail when annbatch isn't installed -- only the relevant | ||
| # attribute access should error, with a clear hint pointing at the extra. | ||
| _annbatch_exports = { | ||
| "write_annbatch_collection": "_annbatch", | ||
| "open_annbatch_loader": "_annbatch", | ||
| } |
There was a problem hiding this comment.
I think you can just use https://docs.python.org/3/library/importlib.html#importlib.abc.MetaPathFinder.find_spec to see if annbatch is installed, and then choose to import from _annbatch.py or not?
There was a problem hiding this comment.
Yeah, honestly I don't know why the LLM's kept suggesting me this in all cases and I went to a rabbithole for this. Since it was already in the codebase I went with it. Even though I know that it was likely an LLM suggestion. But I also see this in other places as well for example in Tim's PR and in spatialdata. Which are probably also LLM suggestions lol. I didn't see any big packages using __getattr__ for lazy imports I think. For example it isn't in xarray.
I am not really sure about having two different styles in the codebase itself tbh. Also my hunch would be to add for the two functions to match xarray I guess? But maybe that's an overkill given that annbatch is lightweight? idk python import times are anoyying sometimes
try:
import zarr
from annbatch import DatasetCollection
except ImportError as e:
raise ImportError(_INSTALL_HINT) from eThere was a problem hiding this comment.
I get wanting to stay consistent, but I also think it should be changed in Tim's PR FWIW. Not saying this has to happen here or now, but I prefer find_spec
| f"obs_keys {missing!r} not found in on-disk obs (have: " | ||
| f"{list(obs.columns)})" | ||
| ) | ||
| return AnnData(X=X, obs=obs[obs_keys]) |
|
@ilan-gold sorry should've given more context. We discussed this with @LArnoldt. After this PR dd.C can be a collection (which is nice because we need the shuffling). So that we can create a Loader and bind it however we want from donor data itself. def attach_donor_features(
dd.C # collection
dd.G # backed anndata
donor_id_key: str = "donor_id",
layer: str | None = None,
key: str = "donor_features",
):
loader = Loader(dd.C)
for batch in loader:
donors = batch["obs"][donor_id_key].to_numpy()
unique, donor_idx = np.unique(donors, return_inverse=True)
x = dd.G[unique].X
batch[key] = np.asarray(x)
batch["unique_donors"] = unique
batch["donor_idx"] = donor_idx
yield batchwhy prefetch in the future? Depends on the bottleneck really but @LArnoldt expects the bottleneck to be from the fetching of the backed |
What you have here looks like a candidate again for synced RNGs and a custom loader class. For this use-case, donors = batch["obs"][donor_id_key].to_numpy()
unique, _ = np.unique(donors, return_inverse=True)
yield uniqueinside the sampler. You can put the |
|
I see we can talk about how we do the custom loader in the next PR but do you agree that it's enough to start with only a backed anndata |
(3) is highest priority. After this (3) only needs a wrapper function with a prefetch queue for LIVI. Then performance of that should be measured. This PR makes (1) easier but that direction would need groupby'ed data etc which will make input format a bit stricter