[SPARK-57036][SQL] Use intrinsic bulk-fill APIs for constant-value WritableColumnVector methods by viirya · Pull Request #56081 · apache/spark

viirya · 2026-05-24T00:15:23Z

What changes were proposed in this pull request?

Follow-up to #56072 (SPARK-57024). That PR fixed the degenerate
per-element loops in three bulk-fill methods (OnHeap.putNulls,
OnHeap.putInts(rowId, count, value), OffHeap.putNulls). The same
pattern exists in six sibling methods; this PR applies the same
intrinsic substitutions:

Method	Substitution
`OnHeapColumnVector.putBooleans(rowId, count, value)`	`Arrays.fill(byte[], ..., (byte) v)`
`OnHeapColumnVector.putBytes(rowId, count, value)`	`Arrays.fill(byte[], ...)`
`OnHeapColumnVector.putShorts(rowId, count, value)`	`Arrays.fill(short[], ...)`
`OnHeapColumnVector.putLongs(rowId, count, value)`	`Arrays.fill(long[], ...)`
`OffHeapColumnVector.putBooleans(rowId, count, value)`	`Platform.setMemory` with `SET_MEMORY_THRESHOLD` fallback
`OffHeapColumnVector.putBytes(rowId, count, value)`	`Platform.setMemory` with `SET_MEMORY_THRESHOLD` fallback

The two OffHeap methods reuse the SET_MEMORY_THRESHOLD = 128 constant
introduced in #56072 for OffHeap.putNulls. Below the threshold, an
inline byte loop avoids the JNI fixed cost of Unsafe.setMemory; at or
above, setMemory dominates and the gain accelerates up to ~10x at
count >= 4096.

This PR is based on top of #56072 since the threshold constant is
defined there. If #56072 lands first, this PR rebases cleanly onto
master.

Why are the changes needed?

The bulk-fill APIs on WritableColumnVector are the natural call to
make from any column writer, but their implementations were per-element
loops. Switching to intrinsics:

Arrays.fill is backed by HotSpot's _jbyte_fill / _jshort_fill /
_jlong_fill intrinsic stubs; on byte/short arrays C2 can usually
auto-vectorize the original loop and gains are modest, but for
long[] and at small counts the intrinsic is meaningfully faster.
Unsafe.setMemory lowers to a native memset. For OffHeap byte fills
this is dramatic at large counts because the original per-byte
Platform.putByte loop cannot be vectorized through the JNI call.

Measured on Apple M4 Max + OpenJDK 21.0.8, using a new
WritableColumnVectorBulkFillBenchmark (added in a separate change,
not part of this PR), Rate (M elements/s):

OffHeap byte fills (putBytes / putBooleans), threshold path:

count	baseline	patched	delta
8	~1,900	~1,840	parity (small-count fallback)
64	~3,800	~3,760	parity
512	~4,150	~13,100	+3.2x
4,096	~4,340	~31,900	+7.4x
65,536	~4,275	~43,700	+10.2x

OnHeap byte fills:

count	baseline	patched	delta
8	~2,620	~3,230	+23%
64	~19,000	~25,400	+33%
512	~68,800	~86,200	+25%
4,096	~128,400	~133,300	+4%
65,536	~143,200	~143,600	saturated (byte memory bandwidth)

OnHeap longs: +1-14% in the small/medium range, saturated by
memory bandwidth at large counts. Included for consistency with the
byte methods.

OffHeap multi-byte fills (putShorts / putInts / putLongs / putFloats /
putDoubles) are out of scope: Platform.setMemory is byte-only and the
value=0 short-circuit alternative was prototyped under SPARK-57024 and
showed no measurable gain.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests; no behavior change. Ran locally on top of #56072:

VectorizedRleValuesReaderSuite
ColumnVectorSuite
ColumnarBatchSuite
ParquetIOSuite

237 tests, all pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

…uet vectorized reader VectorizedRleValuesReader materializes RLE runs of nulls and definition levels with degenerate per-element loops: for (int k = 0; k < runLen; k++) { nulls.putNull(valueOff + k); } for (int k = 0; k < runLen; k++) { defLevels.putInt(levelIdx + k, runValue); } WritableColumnVector already exposes the bulk equivalents putNulls(rowId, count) and putInts(rowId, count, value), but their implementations were also degenerate loops. This commit switches the callers to the bulk APIs and reimplements the bulk APIs with intrinsics: - OnHeapColumnVector.putNulls -> Arrays.fill(byte[], ..., (byte) 1) - OnHeapColumnVector.putInts(rowId, count, value) -> Arrays.fill(int[], ..., value) - OffHeapColumnVector.putNulls -> Platform.setMemory(addr, (byte) 1, count) with a small-count fallback to an inline byte loop Arrays.fill is a JIT intrinsic and Unsafe.setMemory lowers to a native memset; both are faster than the unrolled-by-JIT loops they replace once runLen grows beyond a handful of elements. For OffHeap.putNulls, Unsafe.setMemory has a non-trivial JNI fixed cost, so it loses to the inline byte loop for very short fills. A threshold of 128 elements is used to pick between the two paths — this avoids a regression at small counts (where the inline loop is faster) while retaining the asymptotic gain (~10x at count >= 4096 in OffHeap fills). Co-authored-by: Claude Code

…itableColumnVector methods SPARK-57024 fixed the degenerate per-element loops in three bulk-fill methods (OnHeap.putNulls, OnHeap.putInts(rowId, count, value), and OffHeap.putNulls). The same pattern still exists in six sibling methods. This change replaces them with the same intrinsic substitutions used in SPARK-57024: OnHeapColumnVector: putBooleans -> Arrays.fill(byte[], ..., (byte) v) putBytes -> Arrays.fill(byte[], ...) putShorts -> Arrays.fill(short[], ...) putLongs -> Arrays.fill(long[], ...) OffHeapColumnVector: putBooleans -> Platform.setMemory with SET_MEMORY_THRESHOLD fallback putBytes -> Platform.setMemory with SET_MEMORY_THRESHOLD fallback The OffHeap methods reuse the same threshold introduced for OffHeap.putNulls in SPARK-57024: below 128 elements, an inline byte loop avoids the JNI fixed cost of Unsafe.setMemory; at or above 128, setMemory dominates the loop and gains accelerate up to ~10x at count=4096+. OffHeap multi-byte fills (putShorts / putInts / putLongs / putFloats / putDoubles) are out of scope: Platform.setMemory is byte-only and the value=0 short-circuit alternative was prototyped under SPARK-57024 and showed no measurable gain. Co-authored-by: Claude Code

viirya added 2 commits May 23, 2026 16:51

viirya closed this May 24, 2026

viirya deleted the SPARK-57036 branch May 24, 2026 00:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57036][SQL] Use intrinsic bulk-fill APIs for constant-value WritableColumnVector methods#56081

[SPARK-57036][SQL] Use intrinsic bulk-fill APIs for constant-value WritableColumnVector methods#56081
viirya wants to merge 2 commits into
apache:masterfrom
viirya:SPARK-57036

viirya commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

viirya commented May 24, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant