Skip to content

[SPARK-57036][SQL] Use intrinsic bulk-fill APIs for constant-value WritableColumnVector methods#56081

Closed
viirya wants to merge 2 commits into
apache:masterfrom
viirya:SPARK-57036
Closed

[SPARK-57036][SQL] Use intrinsic bulk-fill APIs for constant-value WritableColumnVector methods#56081
viirya wants to merge 2 commits into
apache:masterfrom
viirya:SPARK-57036

Conversation

@viirya
Copy link
Copy Markdown
Member

@viirya viirya commented May 24, 2026

What changes were proposed in this pull request?

Follow-up to #56072 (SPARK-57024). That PR fixed the degenerate
per-element loops in three bulk-fill methods (OnHeap.putNulls,
OnHeap.putInts(rowId, count, value), OffHeap.putNulls). The same
pattern exists in six sibling methods; this PR applies the same
intrinsic substitutions:

Method Substitution
OnHeapColumnVector.putBooleans(rowId, count, value) Arrays.fill(byte[], ..., (byte) v)
OnHeapColumnVector.putBytes(rowId, count, value) Arrays.fill(byte[], ...)
OnHeapColumnVector.putShorts(rowId, count, value) Arrays.fill(short[], ...)
OnHeapColumnVector.putLongs(rowId, count, value) Arrays.fill(long[], ...)
OffHeapColumnVector.putBooleans(rowId, count, value) Platform.setMemory with SET_MEMORY_THRESHOLD fallback
OffHeapColumnVector.putBytes(rowId, count, value) Platform.setMemory with SET_MEMORY_THRESHOLD fallback

The two OffHeap methods reuse the SET_MEMORY_THRESHOLD = 128 constant
introduced in #56072 for OffHeap.putNulls. Below the threshold, an
inline byte loop avoids the JNI fixed cost of Unsafe.setMemory; at or
above, setMemory dominates and the gain accelerates up to ~10x at
count >= 4096.

This PR is based on top of #56072 since the threshold constant is
defined there. If #56072 lands first, this PR rebases cleanly onto
master.

Why are the changes needed?

The bulk-fill APIs on WritableColumnVector are the natural call to
make from any column writer, but their implementations were per-element
loops. Switching to intrinsics:

  • Arrays.fill is backed by HotSpot's _jbyte_fill / _jshort_fill /
    _jlong_fill intrinsic stubs; on byte/short arrays C2 can usually
    auto-vectorize the original loop and gains are modest, but for
    long[] and at small counts the intrinsic is meaningfully faster.
  • Unsafe.setMemory lowers to a native memset. For OffHeap byte fills
    this is dramatic at large counts because the original per-byte
    Platform.putByte loop cannot be vectorized through the JNI call.

Measured on Apple M4 Max + OpenJDK 21.0.8, using a new
WritableColumnVectorBulkFillBenchmark (added in a separate change,
not part of this PR), Rate (M elements/s):

OffHeap byte fills (putBytes / putBooleans), threshold path:

count baseline patched delta
8 ~1,900 ~1,840 parity (small-count fallback)
64 ~3,800 ~3,760 parity
512 ~4,150 ~13,100 +3.2x
4,096 ~4,340 ~31,900 +7.4x
65,536 ~4,275 ~43,700 +10.2x

OnHeap byte fills:

count baseline patched delta
8 ~2,620 ~3,230 +23%
64 ~19,000 ~25,400 +33%
512 ~68,800 ~86,200 +25%
4,096 ~128,400 ~133,300 +4%
65,536 ~143,200 ~143,600 saturated (byte memory bandwidth)

OnHeap longs: +1-14% in the small/medium range, saturated by
memory bandwidth at large counts. Included for consistency with the
byte methods.

OffHeap multi-byte fills (putShorts / putInts / putLongs / putFloats /
putDoubles) are out of scope: Platform.setMemory is byte-only and the
value=0 short-circuit alternative was prototyped under SPARK-57024 and
showed no measurable gain.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests; no behavior change. Ran locally on top of #56072:

  • VectorizedRleValuesReaderSuite
  • ColumnVectorSuite
  • ColumnarBatchSuite
  • ParquetIOSuite

237 tests, all pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

viirya added 2 commits May 23, 2026 16:51
…uet vectorized reader

VectorizedRleValuesReader materializes RLE runs of nulls and definition
levels with degenerate per-element loops:

  for (int k = 0; k < runLen; k++) {
    nulls.putNull(valueOff + k);
  }
  for (int k = 0; k < runLen; k++) {
    defLevels.putInt(levelIdx + k, runValue);
  }

WritableColumnVector already exposes the bulk equivalents
putNulls(rowId, count) and putInts(rowId, count, value), but their
implementations were also degenerate loops. This commit switches the
callers to the bulk APIs and reimplements the bulk APIs with intrinsics:

  - OnHeapColumnVector.putNulls -> Arrays.fill(byte[], ..., (byte) 1)
  - OnHeapColumnVector.putInts(rowId, count, value)
        -> Arrays.fill(int[], ..., value)
  - OffHeapColumnVector.putNulls -> Platform.setMemory(addr, (byte) 1, count)
    with a small-count fallback to an inline byte loop

Arrays.fill is a JIT intrinsic and Unsafe.setMemory lowers to a native
memset; both are faster than the unrolled-by-JIT loops they replace once
runLen grows beyond a handful of elements.

For OffHeap.putNulls, Unsafe.setMemory has a non-trivial JNI fixed cost,
so it loses to the inline byte loop for very short fills. A threshold
of 128 elements is used to pick between the two paths — this avoids a
regression at small counts (where the inline loop is faster) while
retaining the asymptotic gain (~10x at count >= 4096 in OffHeap fills).

Co-authored-by: Claude Code
…itableColumnVector methods

SPARK-57024 fixed the degenerate per-element loops in three bulk-fill
methods (OnHeap.putNulls, OnHeap.putInts(rowId, count, value), and
OffHeap.putNulls). The same pattern still exists in six sibling
methods. This change replaces them with the same intrinsic
substitutions used in SPARK-57024:

  OnHeapColumnVector:
    putBooleans -> Arrays.fill(byte[], ..., (byte) v)
    putBytes    -> Arrays.fill(byte[], ...)
    putShorts   -> Arrays.fill(short[], ...)
    putLongs    -> Arrays.fill(long[], ...)
  OffHeapColumnVector:
    putBooleans -> Platform.setMemory with SET_MEMORY_THRESHOLD fallback
    putBytes    -> Platform.setMemory with SET_MEMORY_THRESHOLD fallback

The OffHeap methods reuse the same threshold introduced for
OffHeap.putNulls in SPARK-57024: below 128 elements, an inline byte
loop avoids the JNI fixed cost of Unsafe.setMemory; at or above 128,
setMemory dominates the loop and gains accelerate up to ~10x at
count=4096+.

OffHeap multi-byte fills (putShorts / putInts / putLongs / putFloats /
putDoubles) are out of scope: Platform.setMemory is byte-only and the
value=0 short-circuit alternative was prototyped under SPARK-57024 and
showed no measurable gain.

Co-authored-by: Claude Code
@viirya viirya closed this May 24, 2026
@viirya viirya deleted the SPARK-57036 branch May 24, 2026 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant