Skip to content

[WIP] Rust Native POC#18570

Draft
siddharthteotia wants to merge 5 commits into
apache:masterfrom
siddharthteotia:feat/rust-native-poc
Draft

[WIP] Rust Native POC#18570
siddharthteotia wants to merge 5 commits into
apache:masterfrom
siddharthteotia:feat/rust-native-poc

Conversation

@siddharthteotia
Copy link
Copy Markdown
Contributor

@siddharthteotia siddharthteotia commented May 22, 2026

WIP - Not yet Ready For Review

Summary

Draft / RFC PR for native (Rust + JNI) acceleration of Pinot's query execution. Design docs are included in this PR and will be migrated to Google Docs for broader community discussion:

  • RUST_REWRITE_DESIGN.md
  • docs/native/phase-1-design.md

POC scope (Phase 1) - Implement a GROUP BY Aggregation Kernel and push down operations

SUM, COUNT, MIN, MAX, DISTINCT_COUNT on primitive fixed-width types (INT/LONG/FLOAT/DOUBLE), plus single-column group-by with a vectorized SwissTable-style hash table and state-of-the-art SIMD kernels for the aggregations. This PR delivers the first vertical slice (SUM(LONG) end-to-end + plumbing); the remaining surface is sub-phased in docs/native/phase-1-design.md §14.

What's in this PR

  • New pinot-native module — Cargo workspace + Maven module + JNI bindings; builds the native library during the normal ./mvnw flow when Cargo is on PATH
  • sum_i64_to_f64 kernel with explicit SIMD intrinsics and runtime ISA dispatch: NEON (aarch64), AVX2 (x86_64), AVX-512DQ (x86_64), scalar fallback
  • NativeAggregationRouter + NativeSumAggregationFunction in pinot-core, plugged into AggregationFunctionFactory at a single point that covers all six aggregation contexts in Pinot (SSE V1, MSE leaf, MSE intermediate, star-tree, realtime, MV refresh)
  • TestNG integration tests: factory routing on/off, null-handling guardrail, JNI plumbing, native↔Java result equivalence

Status

  • Phase 1.A complete (POC plumbing + first SIMD kernel)
  • Native engine is opt-in via -Dpinot.native.aggregation.enabled=true. Default behavior is unchanged; Java path is the fallback and remains authoritative.
  • Library load failures are non-fatal — PinotNativeAgg.isAvailable() returns false and the factory routes to the Java path. Builds without Cargo work via -DskipNativeBuild=true.

Testing

  • Building a harness and testing on Neon , AVX, AVX-512.

Siddharth Teotia and others added 5 commits May 20, 2026 17:48
Strategic doc (RUST_REWRITE_DESIGN.md) and Phase 1 detailed design
(docs/native/phase-1-design.md) for progressively accelerating Pinot's
query execution with Rust kernels exposed via JNI.

Key settled decisions captured in the per-doc decision logs:
- Phase 1 target is aggregation + group-by, not filter scan
- Integration point is AggregationFunctionFactory (covers SSE V1, MSE
  leaf, MSE intermediate, star-tree, realtime, and MV refresh in one
  hook)
- POC scope is SUM(LONG) end-to-end via NativeSumAggregationFunction
- HLL deferred to Phase 1.E to keep clearspring parity off the critical
  path
- Classic JNI for now; FFM/Panama planned when Java 22 becomes the floor

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scaffolds the Phase 0 / 1.A POC plumbing for native acceleration:

- Cargo workspace under pinot-native/native/ with two crates:
  - kernels/: pure-Rust SIMD kernels (sum_i64_to_f64, 4-way unrolled),
    no JNI dep, testable standalone
  - ffi/: cdylib that re-exports kernels via JNI, with panic-catching
    wrappers and GetPrimitiveArrayCritical zero-copy pinning
- Maven module pinot-native/ targeting Java 11 bytecode, with
  exec-maven-plugin invoking cargo during process-resources and test
  phases. OS+arch profiles set ${native.lib.filename} so surefire can
  pass -Dpinot.native.lib.path=... to the JVM.
- PinotNativeAgg (Java) declares the static native entry points. Class
  initializer runs NativeLibLoader, which resolves the library via:
    1. -Dpinot.native.lib.path system property (dev)
    2. classpath resource /native/<os>-<arch>/lib<name>.<ext> (packaged)
    3. System.loadLibrary fallback to java.library.path
  Load failure sets isAvailable() = false and callers fall back to Java.
- Five TestNG smoke tests pass on darwin-aarch64: probe magic number,
  empty input, small-range sum, length-arg respected, 1M-element random
  sum matching Java reference within float tolerance.

Required explicit jsr305 + jspecify deps since the root pom's
package-info-maven-plugin generates @ParametersAreNonnullByDefault and
@NullMarked annotations on every package, and other modules pick those
up transitively via Guava — which this module deliberately doesn't pull
in.

Verified: ./mvnw -pl pinot-native test passes end-to-end in 4.3 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the single integration point: AggregationFunctionFactory checks
NativeAggregationRouter.shouldAccelerate() at the top of
getAggregationFunction() and, on eligibility, constructs the native
subclass instead of the standard Java AggregationFunction. Because all
aggregation contexts in Pinot (SSE V1, MSE leaf, MSE intermediate,
star-tree, realtime, MV refresh) obtain functions from this factory,
this single fork accelerates all of them.

Eligibility rules (short-circuiting):
1. -Dpinot.native.aggregation.enabled=true
2. PinotNativeAgg.isAvailable()
3. nullHandlingEnabled == false (no native null path yet)
4. function name in {SUM, SUM0}  (POC scope; expands in Phase 1.B)
5. single-argument IDENTIFIER expression (no transforms)

NativeSumAggregationFunction extends SumAggregationFunction and
overrides aggregate() to call PinotNativeAgg.sumLong for LONG-typed
single-value columns; falls through to super for all other type /
encoding combinations and for kernel failures (NaN sentinel). Group-by
remains on the Java path for now (Phase 1.D will add).

NativeSumAggregationFunctionTest exercises:
- factory returns NativeSumAggregationFunction with flag on
- factory returns plain SumAggregationFunction with flag off or when
  null handling is enabled
- aggregate() on LONG matches Java reference on 100k random values
- aggregate() falls through to super on INT (out of POC scope)

WIP — test currently fails checkstyle on cosmetic LeftCurly violations
in the StubBlockValSet inner class. Functional path (router + native
function) is implemented; only the test formatting needs cleanup.

See docs/native/phase-1-design.md sections 2, 3, 8 for the engine
landscape and the rationale for routing at the factory layer rather
than the originally-proposed plan-maker fork.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reformats the StubBlockValSet inner class so each method body sits on
its own line (Pinot's LeftCurly rule rejects single-line method bodies)
and drops two unused imports. Also picks up a trivial whitespace
cleanup in PinotNativeAgg.

With this commit, ./mvnw -pl pinot-core -am -Dtest=NativeSumAggregationFunctionTest
-Dsurefire.failIfNoSpecifiedTests=false test passes all five cases:

  factoryReturnsNativeImplWhenFlagOnAndEligible
  factoryReturnsJavaImplWhenFlagOff
  factoryReturnsJavaImplWhenNullHandlingEnabled
  aggregateLongMatchesJavaReference  (100k random longs)
  aggregateFallsThroughForIntColumn

Phase 1.A POC is now demonstrated end-to-end: a SUM(LONG) query routes
through AggregationFunctionFactory → NativeAggregationRouter →
NativeSumAggregationFunction → PinotNativeAgg.sumLong → Rust kernel,
with identical results to the Java path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Phase 1.A POC kernel was scalar Rust with manual 4-way unrolling,
relying on LLVM auto-vectorization. On Apple Silicon the compiler did
emit NEON instructions for the load and convert, but the reduction was
scalarized (extracting one f64 lane at a time and adding into a scalar
accumulator), and on AVX2 x86 the same source would not vectorize the
i64->f64 conversion at all. That fell short of the "state of the art
kernels" Phase 1 spec.

Rewrites sum_i64_to_f64 with four explicit backends behind runtime ISA
dispatch:

  AVX-512DQ (x86_64):  _mm512_cvtepi64_pd + 4 × 512-bit accumulators
                       (8 lanes each = 32-wide ILP)
  AVX2     (x86_64):  scalar vcvtsi2sd + _mm256_set_pd packing into
                       4 × 256-bit accumulators (16-wide ILP). AVX2 has
                       no vcvtqq2pd; the conversion is the bottleneck.
                       Mysticial bit-trick is a future optimization.
  NEON     (aarch64): vld1q_s64 + vcvtq_f64_s64 + vaddq_f64 with 4 ×
                       128-bit accumulators (8-wide ILP). Native i64->f64
                       vector convert exists on ARM.
  scalar   (any):     existing 4-way unrolled fallback, also serves as
                       the reference for property-based equivalence tests.

All non-scalar paths are #[target_feature(...)] unsafe fns called only
after is_x86_feature_detected! / is_aarch64_feature_detected! returns
true. Detection results are cached by std::arch after the first call,
so the per-call dispatch cost is one atomic load.

Generated assembly verified: the hot loop is now full-vector through
the reduction (fadd.2d across all four NEON accumulators, then faddp.2d
for horizontal collapse), where the previous version reduced into a
scalar register.

Java integration test (NativeSumAggregationFunctionTest, 5 cases
including 100k-element correctness vs Java reference) still passes —
the JNI signature is unchanged and result semantics are preserved
within the documented float tolerance.

Updates docs/native/phase-1-design.md §6.1 and adds a decision log
entry noting this work moves from Phase 1.B back to 1.A.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@siddharthteotia siddharthteotia changed the title Rust Native POC [WIP] Rust Native POC May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant