[SPARK-57036][SQL] Use intrinsic bulk-fill APIs for constant-value WritableColumnVector methods#56081
Closed
viirya wants to merge 2 commits into
Closed
[SPARK-57036][SQL] Use intrinsic bulk-fill APIs for constant-value WritableColumnVector methods#56081viirya wants to merge 2 commits into
viirya wants to merge 2 commits into
Conversation
…uet vectorized reader
VectorizedRleValuesReader materializes RLE runs of nulls and definition
levels with degenerate per-element loops:
for (int k = 0; k < runLen; k++) {
nulls.putNull(valueOff + k);
}
for (int k = 0; k < runLen; k++) {
defLevels.putInt(levelIdx + k, runValue);
}
WritableColumnVector already exposes the bulk equivalents
putNulls(rowId, count) and putInts(rowId, count, value), but their
implementations were also degenerate loops. This commit switches the
callers to the bulk APIs and reimplements the bulk APIs with intrinsics:
- OnHeapColumnVector.putNulls -> Arrays.fill(byte[], ..., (byte) 1)
- OnHeapColumnVector.putInts(rowId, count, value)
-> Arrays.fill(int[], ..., value)
- OffHeapColumnVector.putNulls -> Platform.setMemory(addr, (byte) 1, count)
with a small-count fallback to an inline byte loop
Arrays.fill is a JIT intrinsic and Unsafe.setMemory lowers to a native
memset; both are faster than the unrolled-by-JIT loops they replace once
runLen grows beyond a handful of elements.
For OffHeap.putNulls, Unsafe.setMemory has a non-trivial JNI fixed cost,
so it loses to the inline byte loop for very short fills. A threshold
of 128 elements is used to pick between the two paths — this avoids a
regression at small counts (where the inline loop is faster) while
retaining the asymptotic gain (~10x at count >= 4096 in OffHeap fills).
Co-authored-by: Claude Code
…itableColumnVector methods
SPARK-57024 fixed the degenerate per-element loops in three bulk-fill
methods (OnHeap.putNulls, OnHeap.putInts(rowId, count, value), and
OffHeap.putNulls). The same pattern still exists in six sibling
methods. This change replaces them with the same intrinsic
substitutions used in SPARK-57024:
OnHeapColumnVector:
putBooleans -> Arrays.fill(byte[], ..., (byte) v)
putBytes -> Arrays.fill(byte[], ...)
putShorts -> Arrays.fill(short[], ...)
putLongs -> Arrays.fill(long[], ...)
OffHeapColumnVector:
putBooleans -> Platform.setMemory with SET_MEMORY_THRESHOLD fallback
putBytes -> Platform.setMemory with SET_MEMORY_THRESHOLD fallback
The OffHeap methods reuse the same threshold introduced for
OffHeap.putNulls in SPARK-57024: below 128 elements, an inline byte
loop avoids the JNI fixed cost of Unsafe.setMemory; at or above 128,
setMemory dominates the loop and gains accelerate up to ~10x at
count=4096+.
OffHeap multi-byte fills (putShorts / putInts / putLongs / putFloats /
putDoubles) are out of scope: Platform.setMemory is byte-only and the
value=0 short-circuit alternative was prototyped under SPARK-57024 and
showed no measurable gain.
Co-authored-by: Claude Code
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Follow-up to #56072 (SPARK-57024). That PR fixed the degenerate
per-element loops in three bulk-fill methods (
OnHeap.putNulls,OnHeap.putInts(rowId, count, value),OffHeap.putNulls). The samepattern exists in six sibling methods; this PR applies the same
intrinsic substitutions:
OnHeapColumnVector.putBooleans(rowId, count, value)Arrays.fill(byte[], ..., (byte) v)OnHeapColumnVector.putBytes(rowId, count, value)Arrays.fill(byte[], ...)OnHeapColumnVector.putShorts(rowId, count, value)Arrays.fill(short[], ...)OnHeapColumnVector.putLongs(rowId, count, value)Arrays.fill(long[], ...)OffHeapColumnVector.putBooleans(rowId, count, value)Platform.setMemorywithSET_MEMORY_THRESHOLDfallbackOffHeapColumnVector.putBytes(rowId, count, value)Platform.setMemorywithSET_MEMORY_THRESHOLDfallbackThe two OffHeap methods reuse the
SET_MEMORY_THRESHOLD = 128constantintroduced in #56072 for
OffHeap.putNulls. Below the threshold, aninline byte loop avoids the JNI fixed cost of
Unsafe.setMemory; at orabove,
setMemorydominates and the gain accelerates up to ~10x atcount >= 4096.This PR is based on top of #56072 since the threshold constant is
defined there. If #56072 lands first, this PR rebases cleanly onto
master.
Why are the changes needed?
The bulk-fill APIs on
WritableColumnVectorare the natural call tomake from any column writer, but their implementations were per-element
loops. Switching to intrinsics:
Arrays.fillis backed by HotSpot's_jbyte_fill/_jshort_fill/_jlong_fillintrinsic stubs; on byte/short arrays C2 can usuallyauto-vectorize the original loop and gains are modest, but for
long[]and at small counts the intrinsic is meaningfully faster.Unsafe.setMemorylowers to a native memset. For OffHeap byte fillsthis is dramatic at large counts because the original per-byte
Platform.putByteloop cannot be vectorized through the JNI call.Measured on Apple M4 Max + OpenJDK 21.0.8, using a new
WritableColumnVectorBulkFillBenchmark(added in a separate change,not part of this PR), Rate (M elements/s):
OffHeap byte fills (putBytes / putBooleans), threshold path:
OnHeap byte fills:
OnHeap longs: +1-14% in the small/medium range, saturated by
memory bandwidth at large counts. Included for consistency with the
byte methods.
OffHeap multi-byte fills (putShorts / putInts / putLongs / putFloats /
putDoubles) are out of scope:
Platform.setMemoryis byte-only and thevalue=0 short-circuit alternative was prototyped under SPARK-57024 and
showed no measurable gain.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests; no behavior change. Ran locally on top of #56072:
VectorizedRleValuesReaderSuiteColumnVectorSuiteColumnarBatchSuiteParquetIOSuite237 tests, all pass.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.7)