[WIP] Rust Native POC by siddharthteotia · Pull Request #18570 · apache/pinot

siddharthteotia · 2026-05-22T20:19:14Z

WIP - Not yet Ready For Review

Summary

Draft / RFC PR for native (Rust + JNI) acceleration of Pinot's query execution. Design docs are included in this PR and will be migrated to Google Docs for broader community discussion:

RUST_REWRITE_DESIGN.md
docs/native/phase-1-design.md —

POC scope (Phase 1) - Implement a GROUP BY Aggregation Kernel and push down operations

SUM, COUNT, MIN, MAX, DISTINCT_COUNT on primitive fixed-width types (INT/LONG/FLOAT/DOUBLE), plus single-column group-by with a vectorized SwissTable-style hash table and state-of-the-art SIMD kernels for the aggregations. This PR delivers the first vertical slice (SUM(LONG) end-to-end + plumbing); the remaining surface is sub-phased in docs/native/phase-1-design.md §14.

What's in this PR

New pinot-native module — Cargo workspace + Maven module + JNI bindings; builds the native library during the normal ./mvnw flow when Cargo is on PATH
sum_i64_to_f64 kernel with explicit SIMD intrinsics and runtime ISA dispatch: NEON (aarch64), AVX2 (x86_64), AVX-512DQ (x86_64), scalar fallback
NativeAggregationRouter + NativeSumAggregationFunction in pinot-core, plugged into AggregationFunctionFactory at a single point that covers all six aggregation contexts in Pinot (SSE V1, MSE leaf, MSE intermediate, star-tree, realtime, MV refresh)
TestNG integration tests: factory routing on/off, null-handling guardrail, JNI plumbing, native↔Java result equivalence

Status

Phase 1.A complete (POC plumbing + first SIMD kernel)
Native engine is opt-in via -Dpinot.native.aggregation.enabled=true. Default behavior is unchanged; Java path is the fallback and remains authoritative.
Library load failures are non-fatal — PinotNativeAgg.isAvailable() returns false and the factory routes to the Java path. Builds without Cargo work via -DskipNativeBuild=true.

Testing

Building a harness and testing on Neon , AVX, AVX-512.

Strategic doc (RUST_REWRITE_DESIGN.md) and Phase 1 detailed design (docs/native/phase-1-design.md) for progressively accelerating Pinot's query execution with Rust kernels exposed via JNI. Key settled decisions captured in the per-doc decision logs: - Phase 1 target is aggregation + group-by, not filter scan - Integration point is AggregationFunctionFactory (covers SSE V1, MSE leaf, MSE intermediate, star-tree, realtime, and MV refresh in one hook) - POC scope is SUM(LONG) end-to-end via NativeSumAggregationFunction - HLL deferred to Phase 1.E to keep clearspring parity off the critical path - Classic JNI for now; FFM/Panama planned when Java 22 becomes the floor Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@NullMarked

Scaffolds the Phase 0 / 1.A POC plumbing for native acceleration: - Cargo workspace under pinot-native/native/ with two crates: - kernels/: pure-Rust SIMD kernels (sum_i64_to_f64, 4-way unrolled), no JNI dep, testable standalone - ffi/: cdylib that re-exports kernels via JNI, with panic-catching wrappers and GetPrimitiveArrayCritical zero-copy pinning - Maven module pinot-native/ targeting Java 11 bytecode, with exec-maven-plugin invoking cargo during process-resources and test phases. OS+arch profiles set ${native.lib.filename} so surefire can pass -Dpinot.native.lib.path=... to the JVM. - PinotNativeAgg (Java) declares the static native entry points. Class initializer runs NativeLibLoader, which resolves the library via: 1. -Dpinot.native.lib.path system property (dev) 2. classpath resource /native/<os>-<arch>/lib<name>.<ext> (packaged) 3. System.loadLibrary fallback to java.library.path Load failure sets isAvailable() = false and callers fall back to Java. - Five TestNG smoke tests pass on darwin-aarch64: probe magic number, empty input, small-range sum, length-arg respected, 1M-element random sum matching Java reference within float tolerance. Required explicit jsr305 + jspecify deps since the root pom's package-info-maven-plugin generates @ParametersAreNonnullByDefault and @NullMarked annotations on every package, and other modules pick those up transitively via Guava — which this module deliberately doesn't pull in. Verified: ./mvnw -pl pinot-native test passes end-to-end in 4.3 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the single integration point: AggregationFunctionFactory checks NativeAggregationRouter.shouldAccelerate() at the top of getAggregationFunction() and, on eligibility, constructs the native subclass instead of the standard Java AggregationFunction. Because all aggregation contexts in Pinot (SSE V1, MSE leaf, MSE intermediate, star-tree, realtime, MV refresh) obtain functions from this factory, this single fork accelerates all of them. Eligibility rules (short-circuiting): 1. -Dpinot.native.aggregation.enabled=true 2. PinotNativeAgg.isAvailable() 3. nullHandlingEnabled == false (no native null path yet) 4. function name in {SUM, SUM0} (POC scope; expands in Phase 1.B) 5. single-argument IDENTIFIER expression (no transforms) NativeSumAggregationFunction extends SumAggregationFunction and overrides aggregate() to call PinotNativeAgg.sumLong for LONG-typed single-value columns; falls through to super for all other type / encoding combinations and for kernel failures (NaN sentinel). Group-by remains on the Java path for now (Phase 1.D will add). NativeSumAggregationFunctionTest exercises: - factory returns NativeSumAggregationFunction with flag on - factory returns plain SumAggregationFunction with flag off or when null handling is enabled - aggregate() on LONG matches Java reference on 100k random values - aggregate() falls through to super on INT (out of POC scope) WIP — test currently fails checkstyle on cosmetic LeftCurly violations in the StubBlockValSet inner class. Functional path (router + native function) is implemented; only the test formatting needs cleanup. See docs/native/phase-1-design.md sections 2, 3, 8 for the engine landscape and the rationale for routing at the factory layer rather than the originally-proposed plan-maker fork. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reformats the StubBlockValSet inner class so each method body sits on its own line (Pinot's LeftCurly rule rejects single-line method bodies) and drops two unused imports. Also picks up a trivial whitespace cleanup in PinotNativeAgg. With this commit, ./mvnw -pl pinot-core -am -Dtest=NativeSumAggregationFunctionTest -Dsurefire.failIfNoSpecifiedTests=false test passes all five cases: factoryReturnsNativeImplWhenFlagOnAndEligible factoryReturnsJavaImplWhenFlagOff factoryReturnsJavaImplWhenNullHandlingEnabled aggregateLongMatchesJavaReference (100k random longs) aggregateFallsThroughForIntColumn Phase 1.A POC is now demonstrated end-to-end: a SUM(LONG) query routes through AggregationFunctionFactory → NativeAggregationRouter → NativeSumAggregationFunction → PinotNativeAgg.sumLong → Rust kernel, with identical results to the Java path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Phase 1.A POC kernel was scalar Rust with manual 4-way unrolling, relying on LLVM auto-vectorization. On Apple Silicon the compiler did emit NEON instructions for the load and convert, but the reduction was scalarized (extracting one f64 lane at a time and adding into a scalar accumulator), and on AVX2 x86 the same source would not vectorize the i64->f64 conversion at all. That fell short of the "state of the art kernels" Phase 1 spec. Rewrites sum_i64_to_f64 with four explicit backends behind runtime ISA dispatch: AVX-512DQ (x86_64): _mm512_cvtepi64_pd + 4 × 512-bit accumulators (8 lanes each = 32-wide ILP) AVX2 (x86_64): scalar vcvtsi2sd + _mm256_set_pd packing into 4 × 256-bit accumulators (16-wide ILP). AVX2 has no vcvtqq2pd; the conversion is the bottleneck. Mysticial bit-trick is a future optimization. NEON (aarch64): vld1q_s64 + vcvtq_f64_s64 + vaddq_f64 with 4 × 128-bit accumulators (8-wide ILP). Native i64->f64 vector convert exists on ARM. scalar (any): existing 4-way unrolled fallback, also serves as the reference for property-based equivalence tests. All non-scalar paths are #[target_feature(...)] unsafe fns called only after is_x86_feature_detected! / is_aarch64_feature_detected! returns true. Detection results are cached by std::arch after the first call, so the per-call dispatch cost is one atomic load. Generated assembly verified: the hot loop is now full-vector through the reduction (fadd.2d across all four NEON accumulators, then faddp.2d for horizontal collapse), where the previous version reduced into a scalar register. Java integration test (NativeSumAggregationFunctionTest, 5 cases including 100k-element correctness vs Java reference) still passes — the JNI signature is unchanged and result semantics are preserved within the documented float tolerance. Updates docs/native/phase-1-design.md §6.1 and adds a decision log entry noting this work moves from Phase 1.B back to 1.A. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Siddharth Teotia and others added 5 commits May 20, 2026 17:48

siddharthteotia changed the title ~~Rust Native POC~~ [WIP] Rust Native POC May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Rust Native POC#18570

[WIP] Rust Native POC#18570
siddharthteotia wants to merge 5 commits into
apache:masterfrom
siddharthteotia:feat/rust-native-poc

siddharthteotia commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

siddharthteotia commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

siddharthteotia commented May 22, 2026 •

edited

Loading