GH-50026: [C++][Parquet] SIMD-accelerate SBBF probe via branchless autovec#50030
GH-50026: [C++][Parquet] SIMD-accelerate SBBF probe via branchless autovec#50030dmatth1 wants to merge 2 commits into
Conversation
…ess autovec Rewrite BlockSplitBloomFilter::FindHash from a short-circuit early-exit loop to a branchless OR-accumulator reduction. The early `return false` blocked compilers from collapsing the 8-lane probe to a horizontal block test; the reduction autovectorizes to a single SSE/NEON block test on clang, gcc, and MSVC. Wire the probe through CpuInfo runtime dispatch, mirroring the existing level_comparison_avx2 pattern. The shared body in bloom_filter_block_inc.h is built once at the baseline (SSE on x86, NEON on aarch64) and once in bloom_filter_avx2.cc compiled with `-mavx2`. The AVX2 TU spells the reduction in xsimd rather than relying on autovec: clang lowers the autovec body to a single vptest, but gcc/MSVC emit a longer horizontal vpor reduction that costs ~20% out-of-L3. xsimd is guaranteed available under ARROW_HAVE_RUNTIME_AVX2. A new cross-target diff test calls both probe bodies directly across 20K random + 200 production-populated blocks per CI run, so neither path can silently drift. A static_assert ties the 8-lane assumption to BlockSplitBloomFilter::kBitsSetPerBlock. On-disk format unchanged. SALT, XXH64, bucket index unchanged. Bit-identical to the scalar reference. End-to-end FindHash perf via parquet/benches/bloom_filter_benchmark.cc. M1 (Apple clang -O3, NEON via autovec, 10 reps, CV<=0.4%): | Bench | upstream/main (scalar) | simd-sbbf-autovec | Speedup | |-------------------------------------|---------------------------|---------------------------|---------| | BM_FindExistingHash (hit-heavy) | 3.85 ns/probe (259.6 M/s) | 2.41 ns/probe (415.1 M/s) | 1.60x | | BM_FindNonExistingHash (miss-heavy) | 9.04 ns/probe (110.6 M/s) | 2.41 ns/probe (415.4 M/s) | 3.75x | x86-64 (gcc 13.3, -O2 -mavx2 via AVX2 dispatch TU, 5 reps, CV<=0.6%): | Bench | upstream/main (scalar) | simd-sbbf-autovec | Speedup | |-------------------------------------|---------------------------|---------------------------|---------| | BM_FindExistingHash (hit-heavy) | 8.62 ns/probe (116.0 M/s) | 4.32 ns/probe (231.6 M/s) | 2.00x | | BM_FindNonExistingHash (miss-heavy) | 15.29 ns/probe (65.4 M/s) | 4.33 ns/probe (230.8 M/s) | 3.53x | The scalar miss path stalls on the data-dependent early-exit (slower than its own hit path on both archs); the branchless reduction is constant-time across hit and miss. Miss-heavy is the common case for Parquet row-group skipping. Insert/ComputeHash/batch paths unchanged (16 benches within +/-0.6%). Cache-regime sweep in the PR description. Insert path uses the same loop shape and follows in a separate PR.
|
|
There was a problem hiding this comment.
Pull request overview
This PR accelerates Parquet’s BlockSplitBloomFilter::FindHash probe by reshaping the scalar short-circuit loop into a branchless reduction that autovectorizes, and by adding an AVX2 runtime-dispatched probe kernel for x86 targets.
Changes:
- Rework
BlockSplitBloomFilter::FindHashto call a dispatchable per-block probe (FindHashBlockImpl) implemented as a branchless OR-accumulator reduction. - Add an AVX2-specific probe implementation in a separate translation unit (
bloom_filter_avx2.cc) using xsimd, wired throughDynamicDispatch. - Add a kernel agreement test that compares baseline vs AVX2 implementations on AVX2-capable hosts.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| cpp/src/parquet/CMakeLists.txt | Adds bloom_filter_avx2.cc to Parquet sources under runtime-AVX2 builds and applies AVX2 compile flags. |
| cpp/src/parquet/bloom_filter.cc | Introduces DynamicDispatch plumbing and routes FindHash through the new per-block probe kernels. |
| cpp/src/parquet/bloom_filter_test.cc | Adds an AVX2-only cross-kernel agreement test and includes the baseline/AVX2 probe entrypoints. |
| cpp/src/parquet/bloom_filter_block_inc.h | New header containing the baseline branchless per-block probe implementation. |
| cpp/src/parquet/bloom_filter_avx2.cc | New AVX2 probe kernel implementation using xsimd. |
| cpp/src/parquet/bloom_filter_avx2_internal.h | New internal header declaring the AVX2 probe entrypoint (exported for Windows/MinGW test usage). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ARROW_DISPATCH_TARGET_NONE(&standard::FindHashBlockImpl) // | ||
| ARROW_DISPATCH_TARGET_AVX2(&FindHashBlockAvx2) // |
There was a problem hiding this comment.
Is this correct? I don't see other similar places have #if defined(ARROW_HAVE_RUNTIME_AVX2) protection. cc @AntoinePrv
There was a problem hiding this comment.
ARROW_DISPATCH_TARGET_AVX2 expands when either ARROW_HAVE_RUNTIME_AVX2 or ARROW_HAVE_AVX2 is defined (see arrow/util/dispatch_internal.h).
This yes.
In a build with an AVX2 baseline (-DARROW_HAVE_AVX2) but with runtime dispatch disabled (ARROW_HAVE_RUNTIME_AVX2 unset), this targets list will still try to reference FindHashBlockAvx2
This too.
even though bloom_filter_avx2_internal.h isn’t included and bloom_filter_avx2.cc isn’t built
I thought in practice CMake will force ARROW_HAVE_RUNTIME_AVX2 with ARROW_HAVE_AVX2, but I cannot find any code to support this claim. If so, then this is a problem general to Arrow.
Regardless of the build tool, a defensive pattern defined(ARROW_HAVE_AVX2) || defined(ARROW_HAVE_RUNTIME_AVX2) would be welcome.
There was a problem hiding this comment.
I don't see other similar places have #if defined(ARROW_HAVE_RUNTIME_AVX2) protection
It is not needed, ARROW_DISPATCH_TARGET_AVX2 handles that.
There was a problem hiding this comment.
Thanks @AntoinePrv for confirming it. So it looks like we have issues in other similar parts.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
I wonder how would the avx2 path faster than scalar path 🤔 |
|
Branchless body alone (no xsimd kernel) on AVX2:
Cache regime sweep: scalar vs xsimd, post-hash probe latency:
These numbers are with the Can re-bench in-tree with the commit if you want directly-comparable numbers. |
| ARROW_DISPATCH_TARGET_NONE(&standard::FindHashBlockImpl) // | ||
| ARROW_DISPATCH_TARGET_AVX2(&FindHashBlockAvx2) // |
There was a problem hiding this comment.
Is this correct? I don't see other similar places have #if defined(ARROW_HAVE_RUNTIME_AVX2) protection. cc @AntoinePrv
| // PARQUET_EXPORT so the symbol is visible from parquet_shared on Windows MinGW | ||
| // (default visibility is hidden) -- the cross-target diff test calls this | ||
| // directly. |
There was a problem hiding this comment.
| // PARQUET_EXPORT so the symbol is visible from parquet_shared on Windows MinGW | |
| // (default visibility is hidden) -- the cross-target diff test calls this | |
| // directly. |
I don't think we need this comment
|
|
||
| namespace parquet::internal::PARQUET_IMPL_NAMESPACE { | ||
|
|
||
| // Branchless OR-accumulator reduction: the short-circuit `return false` shape |
There was a problem hiding this comment.
This comment points to original implementation which is outdated so it may confuse future readers. Should we remove them or simplify it to be more straightforward?
AntoinePrv
left a comment
There was a problem hiding this comment.
If we add an xsimd implementation, I wonder if it is worth using it for Neon/SSE.
- On the one hand the current autovec works and is minimal to maintain/test.
- On the other hand autovec is a black box.
Though with a bit more work, the xsimd implementation could be generic and also support AVX512, SVE, and future targets too.
I have no intuition how xsimd compares to autovec. Given the compiler also optimizes xsimd's code, I'd say slightly better, but again it's possible (and it has been the case) some things are not properly expressed in xsimd as well.
| ARROW_DISPATCH_TARGET_NONE(&standard::FindHashBlockImpl) // | ||
| ARROW_DISPATCH_TARGET_AVX2(&FindHashBlockAvx2) // |
There was a problem hiding this comment.
I don't see other similar places have #if defined(ARROW_HAVE_RUNTIME_AVX2) protection
It is not needed, ARROW_DISPATCH_TARGET_AVX2 handles that.
| // bloom_filter_block_inc.h: only clang lowers that body to a single vptest; | ||
| // gcc and MSVC emit a longer horizontal vpor reduction. | ||
| bool FindHashBlockAvx2(const uint32_t* block, const uint32_t* salt, uint32_t key) { | ||
| using batch = xsimd::batch<uint32_t, xsimd::avx2>; |
There was a problem hiding this comment.
| using batch = xsimd::batch<uint32_t, xsimd::avx2>; | |
| using batch = xsimd::batch<uint32_t>; |
Is all it takes to make this code run with all SIMD types.
| bool FindHashBlockAvx2(const uint32_t* block, const uint32_t* salt, uint32_t key) { | ||
| using batch = xsimd::batch<uint32_t, xsimd::avx2>; | ||
| const batch mask = batch(uint32_t{1}) | ||
| << ((batch(key) * batch::load_unaligned(salt)) >> 27); |
There was a problem hiding this comment.
Consider using xsimd::bitwise_rshift<27>(...) instead of >> 27.
In the former case 27 is deduced as a compile time constant and the code path potentially more performant.
Rationale for this change
BlockSplitBloomFilter::FindHashships the scalar reference probe — an 8-iteration short-circuit loop. The short-circuit blocks autovectorization, and on miss-heavy workloads (Parquet row-group skipping) the per-lane branch-mispredict dominates probe latency.Closes #50026. Dev list discussion: https://lists.apache.org/thread/omof0fq47tndfd80g5hwp2bvjmzvpb40. Sibling change in Rust: apache/arrow-rs#10011.
What changes are included in this PR?
FindHashas a branchless OR-accumulator reduction. The new shape autovectorizes to SSE on x86 and NEON on aarch64 at the baseline.bloom_filter_avx2.cc(xsimd kernel built with-mavx2) behindCpuInfo-basedDynamicDispatch, mirroring the existinglevel_comparison_avx2pattern. xsimd was a requirement from the dev thread; the AVX2 target spells the reduction explicitly because gcc/MSVC don't lower the autovec body to a singlevptest.Performance
End-to-end
FindHashviaparquet/benches/bloom_filter_benchmark.cc.M1 (Apple clang -O3, NEON via autovec, 10 reps, CV ≤ 0.4%):
BM_FindExistingHash(hit-heavy)BM_FindNonExistingHash(miss-heavy)x86-64 (gcc 13.3, -O2 -mavx2 via AVX2 dispatch TU, 5 reps, CV ≤ 0.6%):
BM_FindExistingHash(hit-heavy)BM_FindNonExistingHash(miss-heavy)The scalar miss path stalls on the data-dependent early-exit (slower than its own hit path on both archs); the branchless reduction is constant-time across hit/miss.
InsertHash,BatchInsertHash,ComputeHash,BatchComputeHashallunchanged (16 benches within ±0.6%, inside CV).
Are these changes tested?
Yes. New
BloomFilterProbeKerneltest calls both dispatch targets directly across 20K random blocks + 200 production-populated blocks per CI run, asserting bit-identical output.DynamicDispatchresolves once at static init, so without thistest the un-picked target would never be exercised in CI.
Existing
BasicTest,FPPTest, andCompatibilityTestcontinue to pass on both the scalar baseline and the AVX2 dispatch path.Are there any user-facing changes?
No. Read-path implementation change only.