Skip to content

[SPARK-57024][SQL] Use bulk fill APIs to materialize RLE runs in Parquet vectorized reader#56072

Open
viirya wants to merge 1 commit into
apache:masterfrom
viirya:SPARK-57024
Open

[SPARK-57024][SQL] Use bulk fill APIs to materialize RLE runs in Parquet vectorized reader#56072
viirya wants to merge 1 commit into
apache:masterfrom
viirya:SPARK-57024

Conversation

@viirya
Copy link
Copy Markdown
Member

@viirya viirya commented May 23, 2026

What changes were proposed in this pull request?

VectorizedRleValuesReader materializes RLE runs of nulls and
definition levels with degenerate per-element loops:

// VectorizedRleValuesReader.java
for (int k = 0; k < runLen; k++) {
  nulls.putNull(valueOff + k);
}
for (int k = 0; k < runLen; k++) {
  defLevels.putInt(levelIdx + k, runValue);
}

WritableColumnVector already exposes the bulk equivalents
putNulls(rowId, count) and putInts(rowId, count, value). This PR
switches the three caller sites to the bulk APIs, and reimplements the
bulk APIs themselves (which were also degenerate loops) using JIT
intrinsics:

  • OnHeapColumnVector.putNulls -> Arrays.fill(byte[], ..., (byte) 1)
  • OnHeapColumnVector.putInts(rowId, count, value) ->
    Arrays.fill(int[], ..., value)
  • OffHeapColumnVector.putNulls -> Platform.setMemory(addr, (byte) 1, count)
    with a small-count fallback to an inline byte loop

Arrays.fill is backed by HotSpot's _jbyte_fill / _jint_fill
intrinsic stubs and Unsafe.setMemory lowers to a native memset; both
are faster than the byte/int loops they replace once runLen grows
beyond a handful of elements.

For OffHeap.putNulls, Unsafe.setMemory has a non-trivial JNI fixed
cost, so it loses to the inline byte loop for very short fills (which
are common in random null patterns). A threshold of 128 is used to pick
between the two paths.

Why are the changes needed?

The bulk-fill APIs on WritableColumnVector were the obviously-correct
calls to make in VectorizedRleValuesReader, but their implementations
were not actually bulk — both the callers and the implementations have
been small per-element loops.

Caller-side (Parquet RLE materialization)

Measured on Apple M4 Max + OpenJDK 21.0.8 using
VectorizedRleValuesReaderBenchmark (Group C, "Nullable batch decode
with def-level materialization", 1M rows, BATCH_SIZE=4096), ns/row:

nullRatio shape baseline patched delta
0.1 random 4.0 4.2 noise
0.1 clustered 2.8 2.7 +4%
0.3 random 6.2 6.3 noise
0.3 clustered 2.8 2.7 +4%
0.5 random 7.1 7.1 0%
0.5 clustered 2.8 2.6 +7%
0.9 random 3.9 3.5 +10%
0.9 clustered 2.6 2.3 +12%

Gains concentrate on clustered null patterns (long RLE runs), which are
common in real workloads — sparse dimension columns, ETL-staged nulls,
time-bucketed missing metrics. Random null patterns produce short runs
where the bulk-API call cost matches the original loop, hence the
no-op-to-noise band there.

Implementation-side (OffHeap putNulls)

A separate micro-benchmark of OffHeapColumnVector.putNulls (run via
WritableColumnVectorBulkFillBenchmark, not included in this PR) shows
the threshold matters: a naive unconditional Platform.setMemory
regresses small-count fills (count <= 64) by up to 7x against the
original byte loop due to JNI fixed cost, while the count=4096+ path
gains ~10x. The 128-element threshold picks the right path for both
regimes; the crossover on the benchmarked hardware sits between 64 and
512, so 128 is conservative.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests; no behavior change. Ran locally:

  • VectorizedRleValuesReaderSuite (covers the modified caller paths)
  • ColumnVectorSuite and ColumnarBatchSuite (cover the modified
    OnHeap/OffHeapColumnVector.putNulls / putInts bulk APIs)
  • ParquetIOSuite (end-to-end vectorized reader coverage)

237 tests, all pass.

Benchmark numbers above produced by the existing
VectorizedRleValuesReaderBenchmark (no benchmark changes in this PR).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

@viirya viirya requested review from cloud-fan and gengliangwang May 23, 2026 18:38
…uet vectorized reader

VectorizedRleValuesReader materializes RLE runs of nulls and definition
levels with degenerate per-element loops:

  for (int k = 0; k < runLen; k++) {
    nulls.putNull(valueOff + k);
  }
  for (int k = 0; k < runLen; k++) {
    defLevels.putInt(levelIdx + k, runValue);
  }

WritableColumnVector already exposes the bulk equivalents
putNulls(rowId, count) and putInts(rowId, count, value), but their
implementations were also degenerate loops. This commit switches the
callers to the bulk APIs and reimplements the bulk APIs with intrinsics:

  - OnHeapColumnVector.putNulls -> Arrays.fill(byte[], ..., (byte) 1)
  - OnHeapColumnVector.putInts(rowId, count, value)
        -> Arrays.fill(int[], ..., value)
  - OffHeapColumnVector.putNulls -> Platform.setMemory(addr, (byte) 1, count)
    with a small-count fallback to an inline byte loop

Arrays.fill is a JIT intrinsic and Unsafe.setMemory lowers to a native
memset; both are faster than the unrolled-by-JIT loops they replace once
runLen grows beyond a handful of elements.

For OffHeap.putNulls, Unsafe.setMemory has a non-trivial JNI fixed cost,
so it loses to the inline byte loop for very short fills. A threshold
of 128 elements is used to pick between the two paths — this avoids a
regression at small counts (where the inline loop is faster) while
retaining the asymptotic gain (~10x at count >= 4096 in OffHeap fills).

Co-authored-by: Claude Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant