[SPARK-57024][SQL] Use bulk fill APIs to materialize RLE runs in Parquet vectorized reader by viirya · Pull Request #56072 · apache/spark

viirya · 2026-05-23T05:23:41Z

What changes were proposed in this pull request?

VectorizedRleValuesReader materializes RLE runs of nulls and
definition levels with degenerate per-element loops:

// VectorizedRleValuesReader.java
for (int k = 0; k < runLen; k++) {
  nulls.putNull(valueOff + k);
}
for (int k = 0; k < runLen; k++) {
  defLevels.putInt(levelIdx + k, runValue);
}

WritableColumnVector already exposes the bulk equivalents
putNulls(rowId, count) and putInts(rowId, count, value). This PR
switches the three caller sites to the bulk APIs, and reimplements the
bulk APIs themselves (which were also degenerate loops) using JIT
intrinsics:

OnHeapColumnVector.putNulls -> Arrays.fill(byte[], ..., (byte) 1)
OnHeapColumnVector.putInts(rowId, count, value) ->
Arrays.fill(int[], ..., value)
OffHeapColumnVector.putNulls -> Platform.setMemory(addr, (byte) 1, count)
with a small-count fallback to an inline byte loop

Arrays.fill is backed by HotSpot's _jbyte_fill / _jint_fill
intrinsic stubs and Unsafe.setMemory lowers to a native memset; both
are faster than the byte/int loops they replace once runLen grows
beyond a handful of elements.

For OffHeap.putNulls, Unsafe.setMemory has a non-trivial JNI fixed
cost, so it loses to the inline byte loop for very short fills (which
are common in random null patterns). A threshold of 128 is used to pick
between the two paths.

Why are the changes needed?

The bulk-fill APIs on WritableColumnVector were the obviously-correct
calls to make in VectorizedRleValuesReader, but their implementations
were not actually bulk — both the callers and the implementations have
been small per-element loops.

Caller-side (Parquet RLE materialization)

Measured on Apple M4 Max + OpenJDK 21.0.8 using
VectorizedRleValuesReaderBenchmark (Group C, "Nullable batch decode
with def-level materialization", 1M rows, BATCH_SIZE=4096), ns/row:

nullRatio	shape	baseline	patched	delta
0.1	random	4.0	4.2	noise
0.1	clustered	2.8	2.7	+4%
0.3	random	6.2	6.3	noise
0.3	clustered	2.8	2.7	+4%
0.5	random	7.1	7.1	0%
0.5	clustered	2.8	2.6	+7%
0.9	random	3.9	3.5	+10%
0.9	clustered	2.6	2.3	+12%

Gains concentrate on clustered null patterns (long RLE runs), which are
common in real workloads — sparse dimension columns, ETL-staged nulls,
time-bucketed missing metrics. Random null patterns produce short runs
where the bulk-API call cost matches the original loop, hence the
no-op-to-noise band there.

Implementation-side (OffHeap putNulls)

A separate micro-benchmark of OffHeapColumnVector.putNulls (run via
WritableColumnVectorBulkFillBenchmark, not included in this PR) shows
the threshold matters: a naive unconditional Platform.setMemory
regresses small-count fills (count <= 64) by up to 7x against the
original byte loop due to JNI fixed cost, while the count=4096+ path
gains ~10x. The 128-element threshold picks the right path for both
regimes; the crossover on the benchmarked hardware sits between 64 and
512, so 128 is conservative.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests; no behavior change. Ran locally:

VectorizedRleValuesReaderSuite (covers the modified caller paths)
ColumnVectorSuite and ColumnarBatchSuite (cover the modified
OnHeap/OffHeapColumnVector.putNulls / putInts bulk APIs)
ParquetIOSuite (end-to-end vectorized reader coverage)

237 tests, all pass.

Benchmark numbers above produced by the existing
VectorizedRleValuesReaderBenchmark (no benchmark changes in this PR).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

…uet vectorized reader VectorizedRleValuesReader materializes RLE runs of nulls and definition levels with degenerate per-element loops: for (int k = 0; k < runLen; k++) { nulls.putNull(valueOff + k); } for (int k = 0; k < runLen; k++) { defLevels.putInt(levelIdx + k, runValue); } WritableColumnVector already exposes the bulk equivalents putNulls(rowId, count) and putInts(rowId, count, value), but their implementations were also degenerate loops. This commit switches the callers to the bulk APIs and reimplements the bulk APIs with intrinsics: - OnHeapColumnVector.putNulls -> Arrays.fill(byte[], ..., (byte) 1) - OnHeapColumnVector.putInts(rowId, count, value) -> Arrays.fill(int[], ..., value) - OffHeapColumnVector.putNulls -> Platform.setMemory(addr, (byte) 1, count) with a small-count fallback to an inline byte loop Arrays.fill is a JIT intrinsic and Unsafe.setMemory lowers to a native memset; both are faster than the unrolled-by-JIT loops they replace once runLen grows beyond a handful of elements. For OffHeap.putNulls, Unsafe.setMemory has a non-trivial JNI fixed cost, so it loses to the inline byte loop for very short fills. A threshold of 128 elements is used to pick between the two paths — this avoids a regression at small counts (where the inline loop is faster) while retaining the asymptotic gain (~10x at count >= 4096 in OffHeap fills). Co-authored-by: Claude Code

viirya requested review from cloud-fan and gengliangwang May 23, 2026 18:38

viirya force-pushed the SPARK-57024 branch from 3258640 to 34c00df Compare May 23, 2026 23:51

viirya mentioned this pull request May 24, 2026

[SPARK-57036][SQL] Use intrinsic bulk-fill APIs for constant-value WritableColumnVector methods #56081

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57024][SQL] Use bulk fill APIs to materialize RLE runs in Parquet vectorized reader#56072

[SPARK-57024][SQL] Use bulk fill APIs to materialize RLE runs in Parquet vectorized reader#56072
viirya wants to merge 1 commit into
apache:masterfrom
viirya:SPARK-57024

viirya commented May 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

viirya commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Caller-side (Parquet RLE materialization)

Implementation-side (OffHeap putNulls)

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

viirya commented May 23, 2026 •

edited

Loading