[SPARK-57024][SQL] Use bulk fill APIs to materialize RLE runs in Parquet vectorized reader#56072
Open
viirya wants to merge 1 commit into
Open
[SPARK-57024][SQL] Use bulk fill APIs to materialize RLE runs in Parquet vectorized reader#56072viirya wants to merge 1 commit into
viirya wants to merge 1 commit into
Conversation
…uet vectorized reader
VectorizedRleValuesReader materializes RLE runs of nulls and definition
levels with degenerate per-element loops:
for (int k = 0; k < runLen; k++) {
nulls.putNull(valueOff + k);
}
for (int k = 0; k < runLen; k++) {
defLevels.putInt(levelIdx + k, runValue);
}
WritableColumnVector already exposes the bulk equivalents
putNulls(rowId, count) and putInts(rowId, count, value), but their
implementations were also degenerate loops. This commit switches the
callers to the bulk APIs and reimplements the bulk APIs with intrinsics:
- OnHeapColumnVector.putNulls -> Arrays.fill(byte[], ..., (byte) 1)
- OnHeapColumnVector.putInts(rowId, count, value)
-> Arrays.fill(int[], ..., value)
- OffHeapColumnVector.putNulls -> Platform.setMemory(addr, (byte) 1, count)
with a small-count fallback to an inline byte loop
Arrays.fill is a JIT intrinsic and Unsafe.setMemory lowers to a native
memset; both are faster than the unrolled-by-JIT loops they replace once
runLen grows beyond a handful of elements.
For OffHeap.putNulls, Unsafe.setMemory has a non-trivial JNI fixed cost,
so it loses to the inline byte loop for very short fills. A threshold
of 128 elements is used to pick between the two paths — this avoids a
regression at small counts (where the inline loop is faster) while
retaining the asymptotic gain (~10x at count >= 4096 in OffHeap fills).
Co-authored-by: Claude Code
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
VectorizedRleValuesReadermaterializes RLE runs of nulls anddefinition levels with degenerate per-element loops:
WritableColumnVectoralready exposes the bulk equivalentsputNulls(rowId, count)andputInts(rowId, count, value). This PRswitches the three caller sites to the bulk APIs, and reimplements the
bulk APIs themselves (which were also degenerate loops) using JIT
intrinsics:
OnHeapColumnVector.putNulls->Arrays.fill(byte[], ..., (byte) 1)OnHeapColumnVector.putInts(rowId, count, value)->Arrays.fill(int[], ..., value)OffHeapColumnVector.putNulls->Platform.setMemory(addr, (byte) 1, count)with a small-count fallback to an inline byte loop
Arrays.fillis backed by HotSpot's_jbyte_fill/_jint_fillintrinsic stubs and
Unsafe.setMemorylowers to a native memset; bothare faster than the byte/int loops they replace once
runLengrowsbeyond a handful of elements.
For
OffHeap.putNulls,Unsafe.setMemoryhas a non-trivial JNI fixedcost, so it loses to the inline byte loop for very short fills (which
are common in random null patterns). A threshold of 128 is used to pick
between the two paths.
Why are the changes needed?
The bulk-fill APIs on
WritableColumnVectorwere the obviously-correctcalls to make in
VectorizedRleValuesReader, but their implementationswere not actually bulk — both the callers and the implementations have
been small per-element loops.
Caller-side (Parquet RLE materialization)
Measured on Apple M4 Max + OpenJDK 21.0.8 using
VectorizedRleValuesReaderBenchmark(Group C, "Nullable batch decodewith def-level materialization", 1M rows, BATCH_SIZE=4096), ns/row:
Gains concentrate on clustered null patterns (long RLE runs), which are
common in real workloads — sparse dimension columns, ETL-staged nulls,
time-bucketed missing metrics. Random null patterns produce short runs
where the bulk-API call cost matches the original loop, hence the
no-op-to-noise band there.
Implementation-side (OffHeap putNulls)
A separate micro-benchmark of
OffHeapColumnVector.putNulls(run viaWritableColumnVectorBulkFillBenchmark, not included in this PR) showsthe threshold matters: a naive unconditional
Platform.setMemoryregresses small-count fills (
count <= 64) by up to 7x against theoriginal byte loop due to JNI fixed cost, while the count=4096+ path
gains ~10x. The 128-element threshold picks the right path for both
regimes; the crossover on the benchmarked hardware sits between 64 and
512, so 128 is conservative.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests; no behavior change. Ran locally:
VectorizedRleValuesReaderSuite(covers the modified caller paths)ColumnVectorSuiteandColumnarBatchSuite(cover the modifiedOnHeap/OffHeapColumnVector.putNulls/putIntsbulk APIs)ParquetIOSuite(end-to-end vectorized reader coverage)237 tests, all pass.
Benchmark numbers above produced by the existing
VectorizedRleValuesReaderBenchmark(no benchmark changes in this PR).Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.7)