Skip to content

[SPARK-57036][SQL] Use intrinsic bulk-fill APIs for constant-value WritableColumnVector methods#56082

Open
viirya wants to merge 2 commits into
apache:masterfrom
viirya:SPARK-57036
Open

[SPARK-57036][SQL] Use intrinsic bulk-fill APIs for constant-value WritableColumnVector methods#56082
viirya wants to merge 2 commits into
apache:masterfrom
viirya:SPARK-57036

Conversation

@viirya
Copy link
Copy Markdown
Member

@viirya viirya commented May 24, 2026

What changes were proposed in this pull request?

Six bulk-fill methods on the column vectors implement constant-value
fills with degenerate per-element loops. This PR replaces them with
intrinsic substitutions:

Method Substitution
OnHeapColumnVector.putBooleans(rowId, count, value) Arrays.fill(byte[], ..., (byte) v)
OnHeapColumnVector.putBytes(rowId, count, value) Arrays.fill(byte[], ...)
OnHeapColumnVector.putShorts(rowId, count, value) Arrays.fill(short[], ...)
OnHeapColumnVector.putLongs(rowId, count, value) Arrays.fill(long[], ...)
OffHeapColumnVector.putBooleans(rowId, count, value) Platform.setMemory with small-count fallback
OffHeapColumnVector.putBytes(rowId, count, value) Platform.setMemory with small-count fallback

The two OffHeap methods share a SET_MEMORY_THRESHOLD = 128 constant.
Below the threshold, an inline byte loop avoids the JNI fixed cost of
Unsafe.setMemory; at or above, setMemory dominates and the gain
accelerates rapidly.

This PR also adds WritableColumnVectorBulkFillBenchmark to measure
these constant-value bulk-fill APIs across a count sweep covering both
the small-count (call-overhead dominated) and large-count (memory
bandwidth dominated) regimes.

Why are the changes needed?

The bulk-fill APIs on WritableColumnVector are the natural call to
make from any column writer, but their implementations were per-element
loops. Switching to intrinsics:

  • Arrays.fill is backed by HotSpot's _jbyte_fill / _jshort_fill /
    _jlong_fill intrinsic stubs.
  • Unsafe.setMemory lowers to a native memset. For OffHeap byte
    fills the original per-byte Platform.putByte loop cannot be
    vectorized through the JNI call, so the gain is dramatic at large
    counts.

Benchmark numbers (GitHub Actions, JDK 17, Scala 2.13)

Measured by running WritableColumnVectorBulkFillBenchmark via the
Run benchmarks workflow on both the baseline (#56084) and this PR's
branch, so the two runs use identical hardware and JDK. Rate (M
elements/s):

OffHeap byte fills (putBytes / putBooleans) — the headline win:

count baseline patched delta
1 ~290 ~240 within run-to-run noise (~30%)
8 ~1,390 ~1,280 within run-to-run noise (~10%)
64 ~2,550 ~2,450 parity
512 ~2,700 ~19,500 +7.2x
4,096 ~2,770 ~39,200 +14.1x
65,536 ~2,780 ~44,500 +16.0x

(Numbers averaged across putBytes and putBooleans since they share
the same code path.)

At and above the 128-element threshold, setMemory produces a 7-16x
improvement that grows with run length, consistent with memset being
amortized cleanly over long fills. Below the threshold, both runs use
the same inline byte loop, so the small differences at count=1 and
count=8 are GHA run-to-run variance rather than a structural change.

OnHeap fills: on the GHA runner (Linux + Zulu JDK 17) the C2
compiler already auto-vectorizes the original byte loop near the byte
memory-bandwidth ceiling, so Arrays.fill is at parity (~2,790 M/s,
unchanged across putBooleans / putBytes / putShorts / putLongs,
all counts, both baseline and patched). On Apple M4 Max + OpenJDK 21
the same change yields +5-33% in the small/medium count range. The
OnHeap changes are kept for consistency with the OffHeap fixes and to
avoid future divergence between platforms.

OffHeap multi-byte fills (putShorts / putInts / putLongs /
putFloats / putDoubles) are out of scope: Platform.setMemory is
byte-only and a value=0 short-circuit alternative was tried and showed
no measurable gain.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests; no behavior change. Ran locally:

  • VectorizedRleValuesReaderSuite
  • ColumnVectorSuite
  • ColumnarBatchSuite
  • ParquetIOSuite

237 tests, all pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

…itableColumnVector methods

Six bulk-fill methods on the column vectors implement constant-value
fills with degenerate per-element loops:

  OnHeapColumnVector:
    putBooleans(int rowId, int count, boolean value)
    putBytes(int rowId, int count, byte value)
    putShorts(int rowId, int count, short value)
    putLongs(int rowId, int count, long value)
  OffHeapColumnVector:
    putBooleans(int rowId, int count, boolean value)
    putBytes(int rowId, int count, byte value)

Replace them with intrinsic substitutions:

  - OnHeap variants -> Arrays.fill on the typed array.
  - OffHeap variants -> Platform.setMemory with a small-count fallback
    to an inline byte loop, gated by a SET_MEMORY_THRESHOLD of 128.
    Below the threshold, the JNI fixed cost of Unsafe.setMemory loses
    to the inline loop; at or above, setMemory dominates and gains
    accelerate to ~10x at count=4096+.

Also adds WritableColumnVectorBulkFillBenchmark for measuring the
constant-value bulk-fill APIs across a count sweep (1, 8, 64, 512,
4096, 65536), covering both OnHeap and OffHeap paths. This is the
benchmark used to produce the numbers in the PR description.

OffHeap multi-byte fills (putShorts / putInts / putLongs / putFloats /
putDoubles) are out of scope: Platform.setMemory is byte-only and a
value=0 short-circuit alternative was tried and showed no measurable
gain on Apple M4 Max + OpenJDK 21.

Co-authored-by: Claude Code
@viirya viirya requested review from cloud-fan and gengliangwang May 24, 2026 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant