feat: DonorData.C supports annbatch collection by selmanozleyen · Pull Request #116 · theislab/cellink

selmanozleyen · 2026-04-28T14:15:04Z

(1) Single-cell only with a MIL dataloader. You essentially have a one-patient batch, where you load all cells belonging to one patient, in MIL we call this a bag. This can be a flexible amount of cells, e.g. 1000 to. 2000 cells.
(2) Single-cell combined with genotypes in a MIL dataloader: There you load the single-cell part described before and genotypes, so there are several single-cell per aptient, but only genotype per patient. The genotype feature axis, again, can be huge.

There is a third use case:

(3) Single-cell combined with genotypes, but not in a MIL dataloader: Single-cell is shuffled across cells, not donors, then in each batch you load the single-cell. Additionally you load all genotypes of all donors present in one batch, so this can be several then. (edited)

(3) is highest priority. After this (3) only needs a wrapper function with a prefetch queue for LIVI. Then performance of that should be measured. This PR makes (1) easier but that direction would need groupby'ed data etc which will make input format a bit stricter

selmanozleyen · 2026-04-28T14:16:50Z

 ]
 optional-dependencies.ml = [
  "pytorch-lightning",
+  "cellink[annbatch]",


added this for convinience only. we can remove it if you want

ilan-gold

I don't really follow what this PR has to do with the description of this PR. Why do we need a prefetch queue? Why does this PR make (3) simpler?

ilan-gold · 2026-04-29T08:55:54Z

+    source
+        Cell-level data to stream into the collection. May be:
+
+        - an :class:`anndata.AnnData` (will first be written to a temp h5ad),


Why do in-memory anndata files need to be written to disk first?

sorry for the overlook. It's AI slop. Let me be more clear with @LArnoldt on the pipeline to make sure if we even need this function.

@LArnoldt , how would imagine the pipeline? For annbatch preshuffling you need to create a collection but would you want this on your module level?

But not every .dd.h5 file would be in-memory no? I guess, there could be both options from written and from memory?

ilan-gold · 2026-04-29T08:57:11Z

+sharded zarr collection, and to open such a collection as a configured
+``annbatch.Loader``.
+
+The donor-side AnnData (``dd.G``) is intentionally not handled here -- its


Is there an on-disk format for celllink @LArnoldt ? I don't see anything in https://cellink-docs.readthedocs.io/en/latest/api/io.html or is it just in-memory?

Check out https://cellink-docs.readthedocs.io/en/latest/generated/cellink.DonorData.html#cellink.DonorData.write_h5_dd and https://cellink-docs.readthedocs.io/en/latest/generated/cellink.DonorData.html#cellink.DonorData.write_zarr_dd.

ilan-gold · 2026-04-29T08:57:59Z

+# Lazy re-exports for the optional `annbatch` extra. We don't want importing
+# `cellink.io` to fail when annbatch isn't installed -- only the relevant
+# attribute access should error, with a clear hint pointing at the extra.
+_annbatch_exports = {
+    "write_annbatch_collection": "_annbatch",
+    "open_annbatch_loader": "_annbatch",
+}


I think you can just use https://docs.python.org/3/library/importlib.html#importlib.abc.MetaPathFinder.find_spec to see if annbatch is installed, and then choose to import from _annbatch.py or not?

Yeah, honestly I don't know why the LLM's kept suggesting me this in all cases and I went to a rabbithole for this. Since it was already in the codebase I went with it. Even though I know that it was likely an LLM suggestion. But I also see this in other places as well for example in Tim's PR and in spatialdata. Which are probably also LLM suggestions lol. I didn't see any big packages using __getattr__ for lazy imports I think. For example it isn't in xarray.

I am not really sure about having two different styles in the codebase itself tbh. Also my hunch would be to add for the two functions to match xarray I guess? But maybe that's an overkill given that annbatch is lightweight? idk python import times are anoyying sometimes

try: import zarr from annbatch import DatasetCollection except ImportError as e: raise ImportError(_INSTALL_HINT) from e

I get wanting to stay consistent, but I also think it should be changed in Tim's PR FWIW. Not saying this has to happen here or now, but I prefer find_spec

ilan-gold · 2026-04-29T08:58:26Z

+                f"obs_keys {missing!r} not found in on-disk obs (have: "
+                f"{list(obs.columns)})"
+            )
+        return AnnData(X=X, obs=obs[obs_keys])


Why no var?

selmanozleyen · 2026-04-29T09:30:19Z

@ilan-gold sorry should've given more context. We discussed this with @LArnoldt.

After this PR dd.C can be a collection (which is nice because we need the shuffling). So that we can create a Loader and bind it however we want from donor data itself.

def attach_donor_features(
    dd.C # collection
    dd.G # backed anndata

    donor_id_key: str = "donor_id",
    layer: str | None = None,
    key: str = "donor_features",
):
    loader = Loader(dd.C)
    for batch in loader:
        donors = batch["obs"][donor_id_key].to_numpy()
        unique, donor_idx = np.unique(donors, return_inverse=True)
        x = dd.G[unique].X
        batch[key] = np.asarray(x)
        batch["unique_donors"] = unique
        batch["donor_idx"] = donor_idx
        yield batch

why prefetch in the future? Depends on the bottleneck really but @LArnoldt expects the bottleneck to be from the fetching of the backed dd.G, if it really is we can utilize that we know the unique donor_data obs beforehand with a small code change. But it really depends on the initial results I think

ilan-gold · 2026-04-29T09:46:52Z

why prefetch in the future?

What you have here looks like a candidate again for synced RNGs and a custom loader class. For this use-case, C and G use the same random seed/generator that yields C integer indexes. G's (custom) sampler holds the mapping of C->G internally, and handles what you have in the loop:

        donors = batch["obs"][donor_id_key].to_numpy()
        unique, _ = np.unique(donors, return_inverse=True)
        yield unique

inside the sampler. You can put the G loader and C loader on separate threads which would then allow interleaving their I/O. I'm definitely open to prefetching, but I think it should go inside annbatch. For example, I think we are effectively doing it right now anyway with preload_to_gpu=True because GPU ops appear to happen instantly to the python user AFAIK until you request the data transfered back to the CPU i.e., computing losses for logging.

selmanozleyen · 2026-04-29T11:28:01Z

I see we can talk about how we do the custom loader in the next PR but do you agree that it's enough to start with only a backed anndata dd.G and a Loader dd.C?

init

01ab90e

selmanozleyen commented Apr 28, 2026

View reviewed changes

expose rng

4d4eadd

selmanozleyen requested review from LArnoldt and ilan-gold April 28, 2026 14:20

ilan-gold reviewed Apr 29, 2026

View reviewed changes

Conversation

selmanozleyen commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

selmanozleyen commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold commented Apr 29, 2026

Uh oh!

selmanozleyen commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

selmanozleyen commented Apr 28, 2026 •

edited

Loading

selmanozleyen commented Apr 29, 2026 •

edited

Loading