Skip to content

feat: DonorData.C supports annbatch collection#116

Open
selmanozleyen wants to merge 2 commits into
mainfrom
feat/annbatch-on-dd-c
Open

feat: DonorData.C supports annbatch collection#116
selmanozleyen wants to merge 2 commits into
mainfrom
feat/annbatch-on-dd-c

Conversation

@selmanozleyen
Copy link
Copy Markdown

@selmanozleyen selmanozleyen commented Apr 28, 2026

(1) Single-cell only with a MIL dataloader. You essentially have a one-patient batch, where you load all cells belonging to one patient, in MIL we call this a bag. This can be a flexible amount of cells, e.g. 1000 to. 2000 cells.
(2) Single-cell combined with genotypes in a MIL dataloader: There you load the single-cell part described before and genotypes, so there are several single-cell per aptient, but only genotype per patient. The genotype feature axis, again, can be huge.

There is a third use case:

(3) Single-cell combined with genotypes, but not in a MIL dataloader: Single-cell is shuffled across cells, not donors, then in each batch you load the single-cell. Additionally you load all genotypes of all donors present in one batch, so this can be several then. (edited) 

(3) is highest priority. After this (3) only needs a wrapper function with a prefetch queue for LIVI. Then performance of that should be measured. This PR makes (1) easier but that direction would need groupby'ed data etc which will make input format a bit stricter

Comment thread pyproject.toml
]
optional-dependencies.ml = [
"pytorch-lightning",
"cellink[annbatch]",
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this for convinience only. we can remove it if you want

Copy link
Copy Markdown

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really follow what this PR has to do with the description of this PR. Why do we need a prefetch queue? Why does this PR make (3) simpler?

source
Cell-level data to stream into the collection. May be:

- an :class:`anndata.AnnData` (will first be written to a temp h5ad),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do in-memory anndata files need to be written to disk first?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the overlook. It's AI slop. Let me be more clear with @LArnoldt on the pipeline to make sure if we even need this function.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LArnoldt , how would imagine the pipeline? For annbatch preshuffling you need to create a collection but would you want this on your module level?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But not every .dd.h5 file would be in-memory no? I guess, there could be both options from written and from memory?

sharded zarr collection, and to open such a collection as a configured
``annbatch.Loader``.

The donor-side AnnData (``dd.G``) is intentionally not handled here -- its
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an on-disk format for celllink @LArnoldt ? I don't see anything in https://cellink-docs.readthedocs.io/en/latest/api/io.html or is it just in-memory?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +9 to +15
# Lazy re-exports for the optional `annbatch` extra. We don't want importing
# `cellink.io` to fail when annbatch isn't installed -- only the relevant
# attribute access should error, with a clear hint pointing at the extra.
_annbatch_exports = {
"write_annbatch_collection": "_annbatch",
"open_annbatch_loader": "_annbatch",
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just use https://docs.python.org/3/library/importlib.html#importlib.abc.MetaPathFinder.find_spec to see if annbatch is installed, and then choose to import from _annbatch.py or not?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, honestly I don't know why the LLM's kept suggesting me this in all cases and I went to a rabbithole for this. Since it was already in the codebase I went with it. Even though I know that it was likely an LLM suggestion. But I also see this in other places as well for example in Tim's PR and in spatialdata. Which are probably also LLM suggestions lol. I didn't see any big packages using __getattr__ for lazy imports I think. For example it isn't in xarray.

I am not really sure about having two different styles in the codebase itself tbh. Also my hunch would be to add for the two functions to match xarray I guess? But maybe that's an overkill given that annbatch is lightweight? idk python import times are anoyying sometimes

    try:
        import zarr
        from annbatch import DatasetCollection
    except ImportError as e:
        raise ImportError(_INSTALL_HINT) from e

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get wanting to stay consistent, but I also think it should be changed in Tim's PR FWIW. Not saying this has to happen here or now, but I prefer find_spec

f"obs_keys {missing!r} not found in on-disk obs (have: "
f"{list(obs.columns)})"
)
return AnnData(X=X, obs=obs[obs_keys])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no var?

@selmanozleyen
Copy link
Copy Markdown
Author

selmanozleyen commented Apr 29, 2026

@ilan-gold sorry should've given more context. We discussed this with @LArnoldt.

After this PR dd.C can be a collection (which is nice because we need the shuffling). So that we can create a Loader and bind it however we want from donor data itself.

def attach_donor_features(
    dd.C # collection
    dd.G # backed anndata

    donor_id_key: str = "donor_id",
    layer: str | None = None,
    key: str = "donor_features",
):
    loader = Loader(dd.C)
    for batch in loader:
        donors = batch["obs"][donor_id_key].to_numpy()
        unique, donor_idx = np.unique(donors, return_inverse=True)
        x = dd.G[unique].X
        batch[key] = np.asarray(x)
        batch["unique_donors"] = unique
        batch["donor_idx"] = donor_idx
        yield batch

why prefetch in the future? Depends on the bottleneck really but @LArnoldt expects the bottleneck to be from the fetching of the backed dd.G, if it really is we can utilize that we know the unique donor_data obs beforehand with a small code change. But it really depends on the initial results I think

@ilan-gold
Copy link
Copy Markdown

why prefetch in the future?

What you have here looks like a candidate again for synced RNGs and a custom loader class. For this use-case, C and G use the same random seed/generator that yields C integer indexes. G's (custom) sampler holds the mapping of C->G internally, and handles what you have in the loop:

        donors = batch["obs"][donor_id_key].to_numpy()
        unique, _ = np.unique(donors, return_inverse=True)
        yield unique

inside the sampler. You can put the G loader and C loader on separate threads which would then allow interleaving their I/O. I'm definitely open to prefetching, but I think it should go inside annbatch. For example, I think we are effectively doing it right now anyway with preload_to_gpu=True because GPU ops appear to happen instantly to the python user AFAIK until you request the data transfered back to the CPU i.e., computing losses for logging.

@selmanozleyen
Copy link
Copy Markdown
Author

I see we can talk about how we do the custom loader in the next PR but do you agree that it's enough to start with only a backed anndata dd.G and a Loader dd.C?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants