Skip to content

ENH: Add bulk getValues/setValues API to AbstractDataStore#1564

Open
joeykleingers wants to merge 2 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-BulkChunkIO
Open

ENH: Add bulk getValues/setValues API to AbstractDataStore#1564
joeykleingers wants to merge 2 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-BulkChunkIO

Conversation

@joeykleingers
Copy link
Contributor

@joeykleingers joeykleingers commented Mar 17, 2026

Summary

  • Adds getValues(startIndex, span) and setValues(startIndex, span) virtual methods to AbstractDataStore<T> for bulk read/write of contiguous element ranges
  • DataStore<T> overrides use std::memcpy for maximum in-memory throughput
  • Rewrites fill(), copy(), copyFrom(), setTuple(), and fillTuple() to use the bulk API internally so all existing callers benefit automatically

Companion PR: https://github.com/BlueQuartzSoftware/FileStore/pull/4

Motivation

Per-element OOC access through ZarrStore::operator[] costs ~50-100ns each even when the chunk is cached (virtual dispatch -> mutex -> N-D position calc -> 6-slot FIFO cache search -> element access -> mutex unlock). Filter-level optimizations have minimized the number of accesses, but the per-access cost is fixed by the storage layer.

The bulk API allows ZarrStore (in the FileStore plugin) to override with chunk-aware implementations that use memcpy per chunk-row instead of per element, with three optimization levels:

  1. Per-row memcpy — one memcpy per contiguous chunk-row (~50x reduction for 50^3 chunks)
  2. Chunk-sticky — holds chunk shared_ptr across rows, avoids redundant cache lookups (~2500x fewer lookups per chunk)
  3. Multi-row extension — merges rows when chunks cover full fast dimensions (up to entire chunk in one memcpy)

Implementation Details

  • Default getValues/setValues on AbstractDataStore loop over getValue/setValue for backward compatibility with any subclass that doesn't override
  • Uses std::make_unique<T[]> for intermediate buffers to avoid the std::vector<bool> bit-packing issue

Test Plan

  • DataStore Bulk getValues/setValues — full array, partial range, boundary, roundtrip, empty span, out-of-range, single element (7 sections)
  • DataStore Bulk fill / copy / copyFrom / setTuple / fillTuple
  • DataStore Cross-API Roundtrip — setValue->getValues, setValues->getValue, partial writes
  • DataStore Bulk getValues with multi-component — multi-dim tuples + float32
  • All pre-existing DataStore/DataArray tests pass
  • Both in-core and OOC builds pass (19/19 DataStore tests on each)

Add virtual getValues() and setValues() methods to AbstractDataStore
for reading/writing contiguous ranges of elements in a single call.

DataStore overrides use std::memcpy for maximum throughput. ZarrStore
overrides (in FileStore plugin) use chunk-aware bulk access with three
optimizations: per-row memcpy, chunk-sticky shared_ptr reuse, and
multi-row extension when chunks cover full fast dimensions.

Existing methods fill(), copy(), copyFrom(), setTuple(), and
fillTuple() are rewritten to use the bulk API internally, so all
callers benefit automatically with no code changes.

Uses std::make_unique<T[]> for intermediate buffers to avoid the
std::vector<bool> bit-packing issue.
- Make getValues/setValues pure virtual on AbstractDataStore
- Revert fill/copy defaults to simple per-element loops
- Add NVI copyFromImpl virtual for optimized data transfer
- DataStore: override fill, copy, copyFromImpl with direct access
- DataStore: use std::copy instead of std::memcpy
- EmptyDataStore: add getValues/setValues stubs (throw)
- Revert fillTuple to simple setValue loop (small component count)
@joeykleingers joeykleingers requested a review from JDuffeyBQ March 19, 2026 15:12
Copy link
Collaborator

@JDuffeyBQ JDuffeyBQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. One thing is that it looks a bit odd at first to see a getValues() call with no return value. Maybe we could change the function names to make that a bit more obvious. Something like getValuesIntoBuffer(). Not a major issue but I think most of our other functions named with "get" return something so it would help differentiate them.

@imikejackson
Copy link
Contributor

Changes look good. One thing is that it looks a bit odd at first to see a getValues() call with no return value. Maybe we could change the function names to make that a bit more obvious. Something like getValuesIntoBuffer(). Not a major issue but I think most of our other functions named with "get" return something so it would help differentiate them.

@joeykleingers @JDuffeyBQ
How about 'copyInto()' or 'fillBuffer()'

@imikejackson
Copy link
Contributor

copyIntoBuffer
copyFromBuffer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants