High-speed streamed assembly of nD tensors from zarr sources to GPU.
Damacy reads sharded NGFF zarr v3 stores directly on the GPU: per-shard chunk indexing, parallel host I/O, in-flight GPU-side decompression (zstd, blosc1-zstd), and assembly of each batch as a DLPack-ready device tensor.
import random
import damacy
import torch
cfg = damacy.Config(
batch_size=8,
# Resource caps are fixed at construction; nothing grows after.
max_gpu_memory_bytes=1 << 30, # primary GPU budget
dtype="bf16", # source dtype is cast on assemble
# One pipeline binds to one GPU. Omit `device=` to capture the
# current CUDA context (handy single-GPU; PyTorch sets one up
# implicitly). For multi-GPU pass `device=local_rank` — see
# https://nclack.github.io/damacy/distributed/
)
# A Sample names an absolute uri and a per-axis half-open AABB into
# the stored array (np.s_[...] also accepted). Build them however
# suits — your own sampler, a torch Dataset, a curriculum, a fixed
# tile grid, ...
volumes = { # absolute uri → full ZYX shape
"/data/cells/brain-001.zarr": (512, 4096, 4096),
"/data/cells/brain-002.zarr": (768, 4096, 4096),
"/data/cells/kidney-007.zarr": (256, 2048, 2048),
}
def random_crop(size=(64, 256, 256)):
uri, full = random.choice(list(volumes.items()))
origin = [random.randint(0, f - s) for f, s in zip(full, size)]
return damacy.Sample(uri=uri, aabb=[(o, o + s) for o, s in zip(origin, size)])
samples = [random_crop() for _ in range(64)]
with damacy.Pipeline(cfg) as p:
p.push(samples) # producer side
for batch in p.batches(len(samples) // cfg.batch_size):
with batch as t: # consumer side
x = torch.from_dlpack(t) # zero-copy + stream-fenced
... # train steptorch.from_dlpack (or any DLPack v1 consumer — cupy, jax, …) handles the
stream handoff: damacy hands over BatchInfo.ready_stream, the consumer
records a cuStreamWaitEvent against it, and the resulting tensor is
fenced for downstream kernels. Damacy's internal streams are non-blocking
with respect to the legacy default stream, so don't read
BatchInfo.device_ptr directly without a matching cuStreamWaitEvent on
ready_stream.
push accepts any iterable, including infinite generators — samples
are pulled lazily as pop frees space. For unbounded training, hand
the pipeline a generator and let it drain:
def crops():
while True:
yield random_crop() # from the example above
with damacy.Pipeline(cfg) as p:
p.push(crops()) # pulled on demand
for step in range(N_STEPS):
with p.pop() as t:
x = torch.from_dlpack(t)
... # train stepFor training loops that prefetch the next batch on a background thread, see Async prefetch — zero-copy with deferred release, plus the dedicated-copy-stream variant.
Damacy reads zarr v3 (sharded and non-sharded). What's recognized today:
| supported | notes | |
|---|---|---|
| Array versions | v3 | v2 stores are not read |
| Layout (sharded) | sharding_indexed |
with index_location either "start" or "end" (default) |
| Layout (non-sharded) | yes | each chunk is a separate file at c/<i>/<j>/... |
| Inner / chunk codec | bytes (passthrough), zstd, blosc (cname=zstd) |
blosc with cname=lz4/lz4hc is recognized at parse time and rejected at planning |
| Sharding index codec | bytes + crc32c |
the shard index itself; the data codec is separate |
| Missing chunks | yes — read as fill_value |
per zarr v3 spec; sharded "empty" entries (offset==nbytes==2^64−1) and missing chunk files both route here |
Not yet handled — arrays declaring any of these will fail to parse:
- Non-trivial transposes (
transposecodec) - Compression codecs other than the list above (
gzip,lz4raw,crc32cas a data codec, future v3 codecs) - Complex / fixed-bytes / variable-length dtypes
If you have data that uses one of the unsupported codecs and you'd like it added, please open an issue with a sample zarr.json.
Damacy links only the essentials. Optional features dlopen their backends lazily, so a damacy binary loads on any host with a working CUDA driver — the feature simply turns off when its library isn't present.
| Library | Required at runtime | What you lose if missing | How damacy finds it |
|---|---|---|---|
libcuda.so.1 |
always | nothing — damacy cannot run without it | NVIDIA driver install (/run/opengl-driver/lib, /usr/lib/x86_64-linux-gnu, …) |
libnuma.so.1 |
optional | NUMA pinning of pinned-host slabs + io_queue / scheduler threads (single-socket hosts: no effect) | dlopen via dynamic loader (LD_LIBRARY_PATH, ld.so.cache) |
libcufile.so.0 |
optional | damacy_config.enable_gds = DAMACY_GDS_ON — direct read of compressed chunks into device memory via NVIDIA GPUDirect Storage |
dlopen via dynamic loader; ships with the CUDA toolkit and with nvidia-fs. Requires a build with -DDAMACY_ENABLE_GDS=ON (default OFF) |
libmount.so.1, libudev.so.1 |
required if and only if using GDS | cuFile dlopen's these at driver init even in compat mode | dynamic loader |
GDS notes:
- Build with
cmake -DDAMACY_ENABLE_GDS=ONto link the cuFile backend. The default-OFF build still acceptsenable_gds = DAMACY_GDS_ONbutdamacy_createreturnsDAMACY_INVAL(no silent fallback). enable_gds = DAMACY_GDS_AUTO(default, the value from designated-init) defers to envDAMACY_GDS_ENABLE=1; explicitDAMACY_GDS_ON/DAMACY_GDS_OFFoverride the env.- On hosts without nvidia-fs, point
CUFILE_ENV_PATH_JSONat a JSON with{"properties":{"allow_compat_mode":true}}to enable cuFile compat mode — reads go through cuFile's host-bounce buffer instead of DMA. Useful for correctness testing on consumer GPUs. - If libcufile can't be loaded or
cuFileDriverOpenfails,damacy_createreturnsDAMACY_INVAL.
Full API reference and guides: https://nclack.github.io/damacy/
- Multi-GPU & distributed training — device binding model + torchrun / DDP examples.
Performance dashboards (auto-published from bench.yml):
- Throughput — bigger is better
- Timings — smaller is better