Skip to content

PERF: SegmentFeature Filters, IdentifySample, FillBadData OoC Optimization#1559

Draft
joeykleingers wants to merge 25 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-OptimizeGroupD
Draft

PERF: SegmentFeature Filters, IdentifySample, FillBadData OoC Optimization#1559
joeykleingers wants to merge 25 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-OptimizeGroupD

Conversation

@joeykleingers
Copy link
Contributor

@joeykleingers joeykleingers commented Mar 5, 2026

Summary

  • Optimizes 5 Group D filters (ScalarSegmentFeatures, EBSDSegmentFeatures, CAxisSegmentFeatures, FillBadData, IdentifySample) for out-of-core (ZarrStore) performance
  • Adds chunk-sequential Connected Component Labeling (CCL) with Union-Find to SegmentFeatures base class, replacing DFS flood-fill for OOC paths
  • Adds rolling slice buffer pre-loading to eliminate per-element OOC overhead during CCL neighbor comparisons
  • Runtime dispatch selects BFS/DFS for in-core and CCL for OOC based on storage type
  • Splits FillBadData and IdentifySample into separate BFS/CCL algorithm files
  • Adds UnionFind.hpp utility class with path-halving compression and union-by-rank
  • Restructures all 5 filter test suites with tier-based exemplar testing pattern

Algorithm Decision

Problem

The original DFS flood-fill in SegmentFeatures uses random data access — each stack pop reads an arbitrary voxel then checks 6 scattered neighbors. For in-core DataStore this is fast, but for ZarrStore-backed arrays every random access can trigger a chunk load/evict cycle, causing 3,000x–8,000x slowdowns on 200^3 datasets.

Solution: Chunk-Sequential CCL with Union-Find (Segment Features)

A multi-phase scanline algorithm that processes the grid in strict Z-Y-X order:

Phase 1 — Forward Labeling: Iterate every voxel in scanline order. For each valid voxel, check only backward neighbors (-X, -Y, -Z for face connectivity, or all 13 backward neighbors for FaceEdgeVertex). Assign provisional labels and union equivalences in a Union-Find structure. A 2-slice rolling buffer (current + previous Z-slice) keeps all neighbor reads in RAM.

Phase 1b — Periodic Boundary Merge (when isPeriodic=true): The forward pass cannot detect periodic connections because the wrapped neighbor has a higher linear index and hasn't been processed yet. This phase reads back provisional labels from the data store and unites labels of similar voxels on opposite boundary faces. Face connectivity checks the 3 axis-aligned face pairs; FaceEdgeVertex checks all 26-neighbor pairs that wrap across any boundary.

Phase 2 — Resolution and Relabeling: Flatten the Union-Find, then two sequential passes remap provisional labels to final feature IDs (assigned in seed-discovery order for compatibility with the original DFS numbering).

Rolling Slice Buffer Pre-Loading

Even after converting to CCL, the areNeighborsSimilar() virtual method still reads input arrays (quaternions, phases, mask, scalar data) through the OOC operator[] path. Each access costs ~50–100 ns due to virtual dispatch → mutex lock → N-D position calculation → chunk cache search → element access → mutex unlock. For EBSD with face connectivity, each voxel triggers ~36 OOC reads (8 quat + 2 phase + 2 validity per backward neighbor × 3 neighbors), totaling 300M+ OOC accesses for a 200^3 dataset.

The prepareForSlice() virtual hook in executeCCL() allows subclasses to pre-load input data for the current Z-slice into local std::vector buffers at the start of each slice iteration. All neighbor comparisons then read from these buffers with zero OOC overhead. The buffers use a rolling 2-slot scheme (slot = iz % 2) so only current + previous Z-slice are in memory. Phase 1b disables buffering via prepareForSlice(-1) since it accesses arbitrary Z-slices, but Phase 1b is O(dim²) so the overhead is negligible.

Memory cost for a 2000³ EBSD dataset: 128 MB (quats) + 32 MB (phases) + 8 MB (mask) = 168 MB for the rolling buffers — acceptable for a workstation processing a 128 GB quats array.

Why This Algorithm

  1. Eliminates random access entirely — every read is from the rolling buffer (RAM) or a sequential chunk load
  2. Memory scales with slice size, not volume — O(dimX × dimY) for the buffer + O(num_features) for Union-Find
  3. Preserves feature numbering order — Phase 2 remaps to match original DFS seed-discovery order
  4. Shared base classexecuteCCL() in SegmentFeatures serves all 3 subclasses via isValidVoxel() and areNeighborsSimilar() virtual methods

Dispatch

Each filter checks IsOutOfCore(*featureIdsArray) || ForceOocAlgorithm():

  • TrueexecuteCCL() (chunk-sequential CCL)
  • Falseexecute() (original DFS flood-fill)

FillBadData Algorithm

Split into FillBadDataBFS.cpp (in-core) and FillBadDataCCL.cpp (OOC) with dispatch in the main algorithm class:

  1. CCL with negative labels for bad-data regions
  2. Flatten Union-Find and accumulate region sizes
  3. Classify regions by size (small → fill, large → preserve as voids)
  4. Iterative morphological dilation to fill marked regions

IdentifySample Algorithm

Split into IdentifySampleBFS.cpp (in-core) and IdentifySampleCCL.cpp (OOC) with shared logic in IdentifySampleCommon.hpp:

  1. CCL to label all connected components in the mask array
  2. Identify the largest component by voxel count (this is the "sample")
  3. Set all non-sample mask voxels to false
  4. Optional hole-filling pass using iterative dilation within the sample bounding box
  5. Slice-by-slice mode processes each 2D slice independently on the specified plane (XY/XZ/YZ)

DispatchAlgorithm<IdentifySampleBFS, IdentifySampleCCL> selects the path at runtime based on mask array residency.

Bug Fixes

  • Restore periodic boundary condition neighbor wrapping in SegmentFeatures — the m_IsPeriodic flag was set but never read. Add periodic wrapping to both DFS (getFaceNeighbors/getAllNeighbors) and CCL (Phase 1b boundary merge) paths.
  • Fix phantom feature from hardcoded seed=0 in DFS execute() — use getSeed() to skip masked voxels, preventing inflated feature counts and ghost Active entries.

Test Restructuring

  • Add SegmentFeaturesTestUtils.hpp with shared builder and verification functions
  • Restructure all 5 filter test suites to the tier-based exemplar pattern:
    • Tier 1 (200^3): exemplar comparison with GENERATE(false, true) for both algorithm paths
    • Tier 2 (15^3): exemplar comparison, every parameter combo with GENERATE(false, true) for both algorithm paths
    • Error, Randomize, High Tolerance, and FaceEdgeVertex connectivity tests
  • Publish 3 exemplar archives to Data_Archive: segment_features_exemplars.tar.gz, fill_bad_data_exemplars.tar.gz, identify_sample_exemplars.tar.gz
  • All exemplar datasets were programmatically validated (feature counts, array shapes, Active array integrity, mask consistency) and manually validated in DREAM3D-NX
  • Orientation test data uses clean Z-layer quaternion scheme (0°/30°/60° X-axis rotations in EBSDlib vector-scalar order) with a merge pair override for visual verification

OOC Performance (200^3 dataset, arm64 Release, 1 Z-slice per chunk)

Filter OOC Before OOC After Speedup
ScalarSegmentFeatures 708.3s ~2.8s 253x
EBSDSegmentFeatures 972.6s ~8.3s 117x
CAxisSegmentFeatures 824.1s ~5.9s 140x
FillBadData 8.6s ~5.6s 1.5x
IdentifySample 825.0s ~3.9s 212x

All timings measure the filter's execute() method only, excluding test setup (data generation, exemplar loading, result comparison). This reflects the actual cost a user would see when running the filter in a pipeline.

FillBadData baseline was already partially OOC-friendly, explaining the modest improvement.

Test Plan

  • 29/29 Group D tests pass (in-core build)
  • 29/29 Group D tests pass (OOC build)
  • All Tier 1 and Tier 2 tests run both in-core and OOC algorithm paths via GENERATE(false, true)
  • FaceEdgeVertex (26-neighbor) connectivity tested for all 3 segment features filters
  • Periodic BC tested in both DFS and CCL paths (Scalar + EBSD)
  • Exemplar datasets programmatically and manually validated in DREAM3D-NX

@joeykleingers joeykleingers force-pushed the worktree-OptimizeGroupD branch 3 times, most recently from 93b4565 to 935f5de Compare March 5, 2026 14:58
@imikejackson imikejackson changed the title ENH: OOC optimization for Group D (CCL/Segmentation) ENH: OoC optimization for Segmentation Filters Mar 5, 2026
@joeykleingers joeykleingers force-pushed the worktree-OptimizeGroupD branch 2 times, most recently from 0007b93 to ca4ef6a Compare March 5, 2026 17:53
@imikejackson imikejackson changed the title ENH: OoC optimization for Segmentation Filters ENH: Segmentation Filters OoC Optimization Mar 9, 2026
@joeykleingers joeykleingers marked this pull request as draft March 10, 2026 01:16
@joeykleingers joeykleingers force-pushed the worktree-OptimizeGroupD branch from ca4ef6a to 1f9a9d5 Compare March 16, 2026 14:06
Consolidate OOC filter optimizations from identify-sample-optimizations worktree:

- Add AlgorithmDispatch.hpp and UnionFind.hpp utilities
- SegmentFeatures: Add executeCCL() with 2-slice rolling buffer + Union-Find
- ScalarSegmentFeatures: CCL dispatch + CompareFunctor::compare()
- EBSDSegmentFeatures: CCL dispatch + isValidVoxel/areNeighborsSimilar
- CAxisSegmentFeatures: CCL dispatch + isValidVoxel/areNeighborsSimilar
- Tests: PreferencesSentinel, ForceOocAlgorithmGuard, 200^3 benchmarks
Update IdentifySample and FillBadData to use the AlgorithmDispatch
BFS/CCL split pattern instead of monolithic inlined algorithms.
Add BFS/CCL split files, update tests with ForceOocAlgorithmGuard,
PreferencesSentinel, and 200x200x200 benchmark test cases.
Benchmark tests should let the dispatch happen naturally based on
storage type, not force the OOC algorithm path. ForceOocAlgorithmGuard
remains in correctness tests to exercise both code paths.
- Fix incorrect global index in IdentifySampleCommon hole-fill for
  XZ/YZ planes (was using flat local index instead of stride-based
  global index)
- Move dp1/dp2 arrays to static constexpr class members in
  IdentifySampleSliceBySliceFunctor
- Fix stale temp file reads in FillBadDataCCL phase 4 by tracking
  write position (rewind doesn't truncate)
- Fix int32 truncation of uint64 region size in FillBadDataCCL
  phase 3 small-defect comparison
- Propagate tmpfile() failure as Result<> error instead of silent
  message-only failure
- Remove const from FillBadDataCCL::operator()() to match non-const
  member usage
… bugs

- Fix k_ prefix on benchmark test constants (kDimX → k_DimX, etc.) in 5 test files
- Add const to FillBadDataInputValues* across FillBadData/BFS/CCL headers and sources
- Add ftell error check in FillBadDataCCL phase 4
- Replace // comment with /// @copydoc Doxygen on CCL virtual method overrides
- Hoist GetAllChildDataPaths() out of iterative fill loop in FillBadDataBFS (pre-existing)
- Fix float32 boundary checks to int64 in FillBadDataBFS iterative fill (pre-existing)


Replace /// @copydoc shorthand with proper multi-line /** */ Doxygen
blocks including @brief, @param, and @return for isValidVoxel() and
areNeighborsSimilar() overrides.
- ScalarSegmentFeatures.hpp: Replace // comment with proper /** */ blocks
  for isValidVoxel() and areNeighborsSimilar() overrides
- IdentifySampleCommon.hpp: Add @class/@struct tags, add @brief/@param/@return
  to VectorUnionFind public methods and IdentifySampleSliceBySliceFunctor
- SegmentFeatures.hpp: Fill in empty @param/@return descriptions on executeCCL()
…asses

Add @brief, @param, and @return documentation to public constructors
and operator()() methods on FillBadData, FillBadDataBFS, FillBadDataCCL,
IdentifySampleBFS, and IdentifySampleCCL per doxygen-comments skill.
- Use simplnx type aliases (int64, usize) consistently in header
  instead of mixing int64_t/size_t with int64/usize
- Merge maxPhase and numFeatures loops into single pass to avoid
  redundant full-volume scan on OOC data
- Check fwrite return value in Phase 4 temp file writes
- Replace ftell-based writeEnd tracking with pairsWritten counter
  to avoid long overflow on 32-bit platforms
- Add cancellation check per-chunk in Phase 4 Pass 1 inner loop

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartzsoft.com>
IdentifySampleCommon: The checked vector was not being reset between
slices in IdentifySampleSliceBySliceFunctor, causing incorrect results
on slices after the first.

ScalarSegmentFeatures: The Active array was only setting index 0 to 0
after resizeTuples, leaving other entries uninitialized. Now fills the
entire array with 1 (active) then sets index 0 to 0 (reserved).
Add SegmentFeaturesTestUtils.hpp with reusable builder and verification
functions for segment features filter tests. Includes geometry builders,
scalar/orientation test data generators, spherical mask builder, and
shared verification helpers used across ScalarSegmentFeatures,
EBSDSegmentFeatures, and CAxisSegmentFeatures tests.
The periodic BC flag (m_IsPeriodic) was being set by segment features
filters but never read by any neighbor-finding code. Boundary voxels
were simply skipped instead of wrapping to the opposite face.

DFS path: getFaceNeighbors() and getAllNeighbors() now accept an
isPeriodic parameter and return wrapped neighbor indices at boundaries.

CCL path: Added Phase 1b periodic boundary merge after the forward
pass. The forward scan cannot detect periodic connections because the
wrapped neighbor has a higher linear index and hasn't been processed
yet. Phase 1b iterates over opposite boundary face pairs and unites
labels of similar voxels in the union-find. Face connectivity checks
the 3 axis-aligned face pairs; FaceEdgeVertex connectivity checks all
26-neighbor pairs that wrap across any boundary.

Verified: 15^3 block-patterned data with blockSize=5 now produces 8
features with periodic BC (was incorrectly producing 27).
The DFS execute() path hardcoded seed=0 as the first voxel to process.
When voxel 0 was masked, the loop still incremented gnum, creating a
phantom feature with Active=1 but zero voxels. This caused FeatureIds
to start at 2 instead of 1 and inflated the feature count by 1.

Fix: initialize seed via getSeed(gnum, 0) which skips masked voxels.
When all voxels are masked, getSeed returns -1 immediately, the loop
never executes, and m_FoundFeatures=0 triggers the -87000 error.
Replace Fibonacci hemisphere (CAxis) and composed-rotation (EBSD)
orientation generators with a simple Z-layer scheme shared by both:
  z=0: 0° X-rotation,  z=1: 30° X-rotation,  z=2: 60° X-rotation

Adjacent layers differ by 30 degrees (well above 5 degree tolerance),
producing 3 distinct features as horizontal color bands that are
trivial to verify visually in DREAM3D-NX.

One merge-pair override: block (1,1,1) gets 0 degrees instead of 30,
merging into the z=0 layer through its face neighbor. This verifies
the filter correctly merges blocks within tolerance.

Also fixes quaternion storage order: EBSDlib uses vector-scalar
(x,y,z,w) layout, not scalar-vector (w,x,y,z).
Restructure test suites for ScalarSegmentFeatures, EBSDSegmentFeatures,
CAxisSegmentFeatures, FillBadData, and IdentifySample to use the
tier-based exemplar testing pattern:

- Tier 1 (200x200x200): exemplar comparison with forceOocAlgo GENERATE
  to prove in-core == OOC == golden reference
- Tier 2 (15x15x15): small correctness tests covering all parameter
  combos with exemplar comparison
- Error tests: verify -87000 when all voxels masked
- Randomize tests: verify feature count and ID permutation properties

Input data is generated on-the-fly by shared builder functions in
SegmentFeaturesTestUtils.hpp. Exemplar archives contain only validated
output arrays (no input data).

Archives published to Data_Archive GitHub release:
  segment_features_exemplars.tar.gz (shared by Scalar/EBSD/CAxis)
  fill_bad_data_exemplars.tar.gz
  identify_sample_exemplars.tar.gz
@joeykleingers joeykleingers force-pushed the worktree-OptimizeGroupD branch from 1f9a9d5 to 1b79d2b Compare March 16, 2026 14:07
IdentifySample had BFS and CCL algorithm classes but never called them.
The operator() still used the old inline functors, bypassing the OOC
dispatch entirely. Replace with DispatchAlgorithm<BFS, CCL> and remove
the dead functor code.

OOC timing: 825s → 3.9s (212x speedup on 200^3 dataset).
Add shared RunFaceEdgeVertexConnectivityTest that verifies vertex and
edge neighbor merging on a 3x3x3 grid with diagonal voxel pairs. Tests
all three segment features filters (Scalar, EBSD, CAxis) with both
Face (expects 5 features) and FaceEdgeVertex (expects 3 features)
neighbor schemes.

Change EBSD large test to use isPeriodic=true without mask (sphere mask
eliminates boundary voxels, defeating periodic). This ensures the CCL
Phase 1b periodic boundary merge is tested on the OOC path.

Fix orientation data Z-layer index from min(bz, 2) to bz % 3 for a
repeating stripe pattern that makes periodic merging visually verifiable
on large datasets.
Regenerated segment_features_exemplars.tar.gz with EBSD large periodic
(no mask) and bz%3 repeating stripe pattern for all orientation data.
Every parameter combo in the small correctness tests now runs through
both the in-core (DFS/BFS) and OOC (CCL) algorithm paths via
GENERATE(false, true) with ForceOocAlgorithmGuard. This ensures the
OOC algorithm is tested for all parameter combinations, not just the
single combo exercised by the Tier 1 large test.
@joeykleingers joeykleingers changed the title ENH: Segmentation Filters OoC Optimization PERF: Out-of-core optimizations for Group D filters Mar 16, 2026
@joeykleingers joeykleingers marked this pull request as ready for review March 16, 2026 17:56
Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
@imikejackson imikejackson changed the title PERF: Out-of-core optimizations for Group D filters PERF: SegmentFeature Filters, IdentifySample, FillBadData OoC Optimization Mar 16, 2026
Add a prepareForSlice() virtual hook to SegmentFeatures::executeCCL()
that allows subclasses to pre-load input data into local vectors before
processing each Z-slice. This eliminates per-element OOC overhead
(mutex + virtual dispatch + cache search) during neighbor comparisons.

Implement slice buffering for all three segment features filters:
- EBSDSegmentFeatures: buffers Quats, Phases, Mask
- CAxisSegmentFeatures: buffers Quats, Phases, Mask
- ScalarSegmentFeatures: buffers scalar data (as float64), Mask

The buffers use a rolling 2-slot scheme (slot = iz % 2) so only
current + previous Z-slice are in memory at any time. Phase 1b
(periodic boundary merge) disables buffering since it accesses
arbitrary Z-slices, but this is O(dim^2) so the overhead is negligible.
Combine the separate label-discovery and relabeling passes into one
chunk-sequential pass. Each voxel is read once, its final feature ID
is discovered or looked up, and written back immediately. This halves
the OOC accesses in Phase 2 from 3N to 2N.
…ementations

Add detailed inline comments explaining algorithms, data access patterns,
and design decisions for all 8 algorithm files in the Group D optimization:

- SegmentFeatures.cpp: DFS flood-fill (execute) and CCL scanline
  (executeCCL) with rolling buffer, periodic merge, and single-pass
  resolution/relabeling
- EBSDSegmentFeatures.cpp: Misorientation-based segmentation with
  slice buffer pre-loading for OOC quaternion/phase/mask access
- CAxisSegmentFeatures.cpp: C-axis misalignment segmentation with
  slice buffer pre-loading
- ScalarSegmentFeatures.cpp: Tolerance-based scalar segmentation with
  type-dispatched float64 slice buffering
- FillBadDataBFS.cpp: BFS flood-fill for region classification and
  iterative morphological dilation
- FillBadDataCCL.cpp: Four-phase CCL with on-disk deferred fill
- IdentifySampleBFS.cpp: BFS largest-component detection and hole fill
- IdentifySampleCCL.cpp: CCL with replay-based label re-derivation
  to avoid O(volume) label storage
@joeykleingers joeykleingers marked this pull request as draft March 17, 2026 08:49
Pre-read each 2D slice into a local buffer before BFS to eliminate
chunk thrashing in IdentifySampleSliceBySliceFunctor. For XZ and YZ
planes, stride2 jumps by dimX*dimY per Z-step, hitting a different
chunk on every BFS neighbor access. The sequential pre-read loads
data in z-then-y order (staying within each chunk), and all BFS
operations run on the in-memory buffer. Write-back uses the same
sequential access pattern.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant