PERF: SegmentFeature Filters, IdentifySample, FillBadData OoC Optimization#1559
Draft
joeykleingers wants to merge 25 commits intoBlueQuartzSoftware:developfrom
Draft
PERF: SegmentFeature Filters, IdentifySample, FillBadData OoC Optimization#1559joeykleingers wants to merge 25 commits intoBlueQuartzSoftware:developfrom
joeykleingers wants to merge 25 commits intoBlueQuartzSoftware:developfrom
Conversation
93b4565 to
935f5de
Compare
0007b93 to
ca4ef6a
Compare
ca4ef6a to
1f9a9d5
Compare
Consolidate OOC filter optimizations from identify-sample-optimizations worktree: - Add AlgorithmDispatch.hpp and UnionFind.hpp utilities - SegmentFeatures: Add executeCCL() with 2-slice rolling buffer + Union-Find - ScalarSegmentFeatures: CCL dispatch + CompareFunctor::compare() - EBSDSegmentFeatures: CCL dispatch + isValidVoxel/areNeighborsSimilar - CAxisSegmentFeatures: CCL dispatch + isValidVoxel/areNeighborsSimilar - Tests: PreferencesSentinel, ForceOocAlgorithmGuard, 200^3 benchmarks
Update IdentifySample and FillBadData to use the AlgorithmDispatch BFS/CCL split pattern instead of monolithic inlined algorithms. Add BFS/CCL split files, update tests with ForceOocAlgorithmGuard, PreferencesSentinel, and 200x200x200 benchmark test cases.
Benchmark tests should let the dispatch happen naturally based on storage type, not force the OOC algorithm path. ForceOocAlgorithmGuard remains in correctness tests to exercise both code paths.
- Fix incorrect global index in IdentifySampleCommon hole-fill for XZ/YZ planes (was using flat local index instead of stride-based global index) - Move dp1/dp2 arrays to static constexpr class members in IdentifySampleSliceBySliceFunctor - Fix stale temp file reads in FillBadDataCCL phase 4 by tracking write position (rewind doesn't truncate) - Fix int32 truncation of uint64 region size in FillBadDataCCL phase 3 small-defect comparison - Propagate tmpfile() failure as Result<> error instead of silent message-only failure - Remove const from FillBadDataCCL::operator()() to match non-const member usage
… bugs - Fix k_ prefix on benchmark test constants (kDimX → k_DimX, etc.) in 5 test files - Add const to FillBadDataInputValues* across FillBadData/BFS/CCL headers and sources - Add ftell error check in FillBadDataCCL phase 4 - Replace // comment with /// @copydoc Doxygen on CCL virtual method overrides - Hoist GetAllChildDataPaths() out of iterative fill loop in FillBadDataBFS (pre-existing) - Fix float32 boundary checks to int64 in FillBadDataBFS iterative fill (pre-existing)
- ScalarSegmentFeatures.hpp: Replace // comment with proper /** */ blocks for isValidVoxel() and areNeighborsSimilar() overrides - IdentifySampleCommon.hpp: Add @class/@struct tags, add @brief/@param/@return to VectorUnionFind public methods and IdentifySampleSliceBySliceFunctor - SegmentFeatures.hpp: Fill in empty @param/@return descriptions on executeCCL()
- Use simplnx type aliases (int64, usize) consistently in header instead of mixing int64_t/size_t with int64/usize - Merge maxPhase and numFeatures loops into single pass to avoid redundant full-volume scan on OOC data - Check fwrite return value in Phase 4 temp file writes - Replace ftell-based writeEnd tracking with pairsWritten counter to avoid long overflow on 32-bit platforms - Add cancellation check per-chunk in Phase 4 Pass 1 inner loop Signed-off-by: Joey Kleingers <joey.kleingers@bluequartzsoft.com>
IdentifySampleCommon: The checked vector was not being reset between slices in IdentifySampleSliceBySliceFunctor, causing incorrect results on slices after the first. ScalarSegmentFeatures: The Active array was only setting index 0 to 0 after resizeTuples, leaving other entries uninitialized. Now fills the entire array with 1 (active) then sets index 0 to 0 (reserved).
Add SegmentFeaturesTestUtils.hpp with reusable builder and verification functions for segment features filter tests. Includes geometry builders, scalar/orientation test data generators, spherical mask builder, and shared verification helpers used across ScalarSegmentFeatures, EBSDSegmentFeatures, and CAxisSegmentFeatures tests.
The periodic BC flag (m_IsPeriodic) was being set by segment features filters but never read by any neighbor-finding code. Boundary voxels were simply skipped instead of wrapping to the opposite face. DFS path: getFaceNeighbors() and getAllNeighbors() now accept an isPeriodic parameter and return wrapped neighbor indices at boundaries. CCL path: Added Phase 1b periodic boundary merge after the forward pass. The forward scan cannot detect periodic connections because the wrapped neighbor has a higher linear index and hasn't been processed yet. Phase 1b iterates over opposite boundary face pairs and unites labels of similar voxels in the union-find. Face connectivity checks the 3 axis-aligned face pairs; FaceEdgeVertex connectivity checks all 26-neighbor pairs that wrap across any boundary. Verified: 15^3 block-patterned data with blockSize=5 now produces 8 features with periodic BC (was incorrectly producing 27).
The DFS execute() path hardcoded seed=0 as the first voxel to process. When voxel 0 was masked, the loop still incremented gnum, creating a phantom feature with Active=1 but zero voxels. This caused FeatureIds to start at 2 instead of 1 and inflated the feature count by 1. Fix: initialize seed via getSeed(gnum, 0) which skips masked voxels. When all voxels are masked, getSeed returns -1 immediately, the loop never executes, and m_FoundFeatures=0 triggers the -87000 error.
Replace Fibonacci hemisphere (CAxis) and composed-rotation (EBSD) orientation generators with a simple Z-layer scheme shared by both: z=0: 0° X-rotation, z=1: 30° X-rotation, z=2: 60° X-rotation Adjacent layers differ by 30 degrees (well above 5 degree tolerance), producing 3 distinct features as horizontal color bands that are trivial to verify visually in DREAM3D-NX. One merge-pair override: block (1,1,1) gets 0 degrees instead of 30, merging into the z=0 layer through its face neighbor. This verifies the filter correctly merges blocks within tolerance. Also fixes quaternion storage order: EBSDlib uses vector-scalar (x,y,z,w) layout, not scalar-vector (w,x,y,z).
Restructure test suites for ScalarSegmentFeatures, EBSDSegmentFeatures, CAxisSegmentFeatures, FillBadData, and IdentifySample to use the tier-based exemplar testing pattern: - Tier 1 (200x200x200): exemplar comparison with forceOocAlgo GENERATE to prove in-core == OOC == golden reference - Tier 2 (15x15x15): small correctness tests covering all parameter combos with exemplar comparison - Error tests: verify -87000 when all voxels masked - Randomize tests: verify feature count and ID permutation properties Input data is generated on-the-fly by shared builder functions in SegmentFeaturesTestUtils.hpp. Exemplar archives contain only validated output arrays (no input data). Archives published to Data_Archive GitHub release: segment_features_exemplars.tar.gz (shared by Scalar/EBSD/CAxis) fill_bad_data_exemplars.tar.gz identify_sample_exemplars.tar.gz
1f9a9d5 to
1b79d2b
Compare
IdentifySample had BFS and CCL algorithm classes but never called them. The operator() still used the old inline functors, bypassing the OOC dispatch entirely. Replace with DispatchAlgorithm<BFS, CCL> and remove the dead functor code. OOC timing: 825s → 3.9s (212x speedup on 200^3 dataset).
Add shared RunFaceEdgeVertexConnectivityTest that verifies vertex and edge neighbor merging on a 3x3x3 grid with diagonal voxel pairs. Tests all three segment features filters (Scalar, EBSD, CAxis) with both Face (expects 5 features) and FaceEdgeVertex (expects 3 features) neighbor schemes. Change EBSD large test to use isPeriodic=true without mask (sphere mask eliminates boundary voxels, defeating periodic). This ensures the CCL Phase 1b periodic boundary merge is tested on the OOC path. Fix orientation data Z-layer index from min(bz, 2) to bz % 3 for a repeating stripe pattern that makes periodic merging visually verifiable on large datasets.
Regenerated segment_features_exemplars.tar.gz with EBSD large periodic (no mask) and bz%3 repeating stripe pattern for all orientation data.
Every parameter combo in the small correctness tests now runs through both the in-core (DFS/BFS) and OOC (CCL) algorithm paths via GENERATE(false, true) with ForceOocAlgorithmGuard. This ensures the OOC algorithm is tested for all parameter combinations, not just the single combo exercised by the Tier 1 large test.
Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Add a prepareForSlice() virtual hook to SegmentFeatures::executeCCL() that allows subclasses to pre-load input data into local vectors before processing each Z-slice. This eliminates per-element OOC overhead (mutex + virtual dispatch + cache search) during neighbor comparisons. Implement slice buffering for all three segment features filters: - EBSDSegmentFeatures: buffers Quats, Phases, Mask - CAxisSegmentFeatures: buffers Quats, Phases, Mask - ScalarSegmentFeatures: buffers scalar data (as float64), Mask The buffers use a rolling 2-slot scheme (slot = iz % 2) so only current + previous Z-slice are in memory at any time. Phase 1b (periodic boundary merge) disables buffering since it accesses arbitrary Z-slices, but this is O(dim^2) so the overhead is negligible.
Combine the separate label-discovery and relabeling passes into one chunk-sequential pass. Each voxel is read once, its final feature ID is discovered or looked up, and written back immediately. This halves the OOC accesses in Phase 2 from 3N to 2N.
…ementations Add detailed inline comments explaining algorithms, data access patterns, and design decisions for all 8 algorithm files in the Group D optimization: - SegmentFeatures.cpp: DFS flood-fill (execute) and CCL scanline (executeCCL) with rolling buffer, periodic merge, and single-pass resolution/relabeling - EBSDSegmentFeatures.cpp: Misorientation-based segmentation with slice buffer pre-loading for OOC quaternion/phase/mask access - CAxisSegmentFeatures.cpp: C-axis misalignment segmentation with slice buffer pre-loading - ScalarSegmentFeatures.cpp: Tolerance-based scalar segmentation with type-dispatched float64 slice buffering - FillBadDataBFS.cpp: BFS flood-fill for region classification and iterative morphological dilation - FillBadDataCCL.cpp: Four-phase CCL with on-disk deferred fill - IdentifySampleBFS.cpp: BFS largest-component detection and hole fill - IdentifySampleCCL.cpp: CCL with replay-based label re-derivation to avoid O(volume) label storage
Pre-read each 2D slice into a local buffer before BFS to eliminate chunk thrashing in IdentifySampleSliceBySliceFunctor. For XZ and YZ planes, stride2 jumps by dimX*dimY per Z-step, hitting a different chunk on every BFS neighbor access. The sequential pre-read loads data in z-then-y order (staying within each chunk), and all BFS operations run on the in-memory buffer. Write-back uses the same sequential access pattern.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SegmentFeaturesbase class, replacing DFS flood-fill for OOC pathsUnionFind.hpputility class with path-halving compression and union-by-rankAlgorithm Decision
Problem
The original DFS flood-fill in
SegmentFeaturesuses random data access — each stack pop reads an arbitrary voxel then checks 6 scattered neighbors. For in-coreDataStorethis is fast, but forZarrStore-backed arrays every random access can trigger a chunk load/evict cycle, causing 3,000x–8,000x slowdowns on 200^3 datasets.Solution: Chunk-Sequential CCL with Union-Find (Segment Features)
A multi-phase scanline algorithm that processes the grid in strict Z-Y-X order:
Phase 1 — Forward Labeling: Iterate every voxel in scanline order. For each valid voxel, check only backward neighbors (-X, -Y, -Z for face connectivity, or all 13 backward neighbors for FaceEdgeVertex). Assign provisional labels and union equivalences in a Union-Find structure. A 2-slice rolling buffer (current + previous Z-slice) keeps all neighbor reads in RAM.
Phase 1b — Periodic Boundary Merge (when isPeriodic=true): The forward pass cannot detect periodic connections because the wrapped neighbor has a higher linear index and hasn't been processed yet. This phase reads back provisional labels from the data store and unites labels of similar voxels on opposite boundary faces. Face connectivity checks the 3 axis-aligned face pairs; FaceEdgeVertex checks all 26-neighbor pairs that wrap across any boundary.
Phase 2 — Resolution and Relabeling: Flatten the Union-Find, then two sequential passes remap provisional labels to final feature IDs (assigned in seed-discovery order for compatibility with the original DFS numbering).
Rolling Slice Buffer Pre-Loading
Even after converting to CCL, the
areNeighborsSimilar()virtual method still reads input arrays (quaternions, phases, mask, scalar data) through the OOCoperator[]path. Each access costs ~50–100 ns due to virtual dispatch → mutex lock → N-D position calculation → chunk cache search → element access → mutex unlock. For EBSD with face connectivity, each voxel triggers ~36 OOC reads (8 quat + 2 phase + 2 validity per backward neighbor × 3 neighbors), totaling 300M+ OOC accesses for a 200^3 dataset.The
prepareForSlice()virtual hook inexecuteCCL()allows subclasses to pre-load input data for the current Z-slice into localstd::vectorbuffers at the start of each slice iteration. All neighbor comparisons then read from these buffers with zero OOC overhead. The buffers use a rolling 2-slot scheme (slot = iz % 2) so only current + previous Z-slice are in memory. Phase 1b disables buffering viaprepareForSlice(-1)since it accesses arbitrary Z-slices, but Phase 1b is O(dim²) so the overhead is negligible.Memory cost for a 2000³ EBSD dataset: 128 MB (quats) + 32 MB (phases) + 8 MB (mask) = 168 MB for the rolling buffers — acceptable for a workstation processing a 128 GB quats array.
Why This Algorithm
executeCCL()inSegmentFeaturesserves all 3 subclasses viaisValidVoxel()andareNeighborsSimilar()virtual methodsDispatch
Each filter checks
IsOutOfCore(*featureIdsArray) || ForceOocAlgorithm():executeCCL()(chunk-sequential CCL)execute()(original DFS flood-fill)FillBadData Algorithm
Split into
FillBadDataBFS.cpp(in-core) andFillBadDataCCL.cpp(OOC) with dispatch in the main algorithm class:IdentifySample Algorithm
Split into
IdentifySampleBFS.cpp(in-core) andIdentifySampleCCL.cpp(OOC) with shared logic inIdentifySampleCommon.hpp:DispatchAlgorithm<IdentifySampleBFS, IdentifySampleCCL>selects the path at runtime based on mask array residency.Bug Fixes
SegmentFeatures— them_IsPeriodicflag was set but never read. Add periodic wrapping to both DFS (getFaceNeighbors/getAllNeighbors) and CCL (Phase 1b boundary merge) paths.seed=0in DFSexecute()— usegetSeed()to skip masked voxels, preventing inflated feature counts and ghost Active entries.Test Restructuring
SegmentFeaturesTestUtils.hppwith shared builder and verification functionsGENERATE(false, true)for both algorithm pathsGENERATE(false, true)for both algorithm pathssegment_features_exemplars.tar.gz,fill_bad_data_exemplars.tar.gz,identify_sample_exemplars.tar.gzOOC Performance (200^3 dataset, arm64 Release, 1 Z-slice per chunk)
All timings measure the filter's
execute()method only, excluding test setup (data generation, exemplar loading, result comparison). This reflects the actual cost a user would see when running the filter in a pipeline.FillBadData baseline was already partially OOC-friendly, explaining the modest improvement.
Test Plan
GENERATE(false, true)