[WIP] Rust Native POC#18570
Draft
siddharthteotia wants to merge 5 commits into
Draft
Conversation
Strategic doc (RUST_REWRITE_DESIGN.md) and Phase 1 detailed design (docs/native/phase-1-design.md) for progressively accelerating Pinot's query execution with Rust kernels exposed via JNI. Key settled decisions captured in the per-doc decision logs: - Phase 1 target is aggregation + group-by, not filter scan - Integration point is AggregationFunctionFactory (covers SSE V1, MSE leaf, MSE intermediate, star-tree, realtime, and MV refresh in one hook) - POC scope is SUM(LONG) end-to-end via NativeSumAggregationFunction - HLL deferred to Phase 1.E to keep clearspring parity off the critical path - Classic JNI for now; FFM/Panama planned when Java 22 becomes the floor Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scaffolds the Phase 0 / 1.A POC plumbing for native acceleration:
- Cargo workspace under pinot-native/native/ with two crates:
- kernels/: pure-Rust SIMD kernels (sum_i64_to_f64, 4-way unrolled),
no JNI dep, testable standalone
- ffi/: cdylib that re-exports kernels via JNI, with panic-catching
wrappers and GetPrimitiveArrayCritical zero-copy pinning
- Maven module pinot-native/ targeting Java 11 bytecode, with
exec-maven-plugin invoking cargo during process-resources and test
phases. OS+arch profiles set ${native.lib.filename} so surefire can
pass -Dpinot.native.lib.path=... to the JVM.
- PinotNativeAgg (Java) declares the static native entry points. Class
initializer runs NativeLibLoader, which resolves the library via:
1. -Dpinot.native.lib.path system property (dev)
2. classpath resource /native/<os>-<arch>/lib<name>.<ext> (packaged)
3. System.loadLibrary fallback to java.library.path
Load failure sets isAvailable() = false and callers fall back to Java.
- Five TestNG smoke tests pass on darwin-aarch64: probe magic number,
empty input, small-range sum, length-arg respected, 1M-element random
sum matching Java reference within float tolerance.
Required explicit jsr305 + jspecify deps since the root pom's
package-info-maven-plugin generates @ParametersAreNonnullByDefault and
@NullMarked annotations on every package, and other modules pick those
up transitively via Guava — which this module deliberately doesn't pull
in.
Verified: ./mvnw -pl pinot-native test passes end-to-end in 4.3 s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the single integration point: AggregationFunctionFactory checks
NativeAggregationRouter.shouldAccelerate() at the top of
getAggregationFunction() and, on eligibility, constructs the native
subclass instead of the standard Java AggregationFunction. Because all
aggregation contexts in Pinot (SSE V1, MSE leaf, MSE intermediate,
star-tree, realtime, MV refresh) obtain functions from this factory,
this single fork accelerates all of them.
Eligibility rules (short-circuiting):
1. -Dpinot.native.aggregation.enabled=true
2. PinotNativeAgg.isAvailable()
3. nullHandlingEnabled == false (no native null path yet)
4. function name in {SUM, SUM0} (POC scope; expands in Phase 1.B)
5. single-argument IDENTIFIER expression (no transforms)
NativeSumAggregationFunction extends SumAggregationFunction and
overrides aggregate() to call PinotNativeAgg.sumLong for LONG-typed
single-value columns; falls through to super for all other type /
encoding combinations and for kernel failures (NaN sentinel). Group-by
remains on the Java path for now (Phase 1.D will add).
NativeSumAggregationFunctionTest exercises:
- factory returns NativeSumAggregationFunction with flag on
- factory returns plain SumAggregationFunction with flag off or when
null handling is enabled
- aggregate() on LONG matches Java reference on 100k random values
- aggregate() falls through to super on INT (out of POC scope)
WIP — test currently fails checkstyle on cosmetic LeftCurly violations
in the StubBlockValSet inner class. Functional path (router + native
function) is implemented; only the test formatting needs cleanup.
See docs/native/phase-1-design.md sections 2, 3, 8 for the engine
landscape and the rationale for routing at the factory layer rather
than the originally-proposed plan-maker fork.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reformats the StubBlockValSet inner class so each method body sits on its own line (Pinot's LeftCurly rule rejects single-line method bodies) and drops two unused imports. Also picks up a trivial whitespace cleanup in PinotNativeAgg. With this commit, ./mvnw -pl pinot-core -am -Dtest=NativeSumAggregationFunctionTest -Dsurefire.failIfNoSpecifiedTests=false test passes all five cases: factoryReturnsNativeImplWhenFlagOnAndEligible factoryReturnsJavaImplWhenFlagOff factoryReturnsJavaImplWhenNullHandlingEnabled aggregateLongMatchesJavaReference (100k random longs) aggregateFallsThroughForIntColumn Phase 1.A POC is now demonstrated end-to-end: a SUM(LONG) query routes through AggregationFunctionFactory → NativeAggregationRouter → NativeSumAggregationFunction → PinotNativeAgg.sumLong → Rust kernel, with identical results to the Java path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Phase 1.A POC kernel was scalar Rust with manual 4-way unrolling,
relying on LLVM auto-vectorization. On Apple Silicon the compiler did
emit NEON instructions for the load and convert, but the reduction was
scalarized (extracting one f64 lane at a time and adding into a scalar
accumulator), and on AVX2 x86 the same source would not vectorize the
i64->f64 conversion at all. That fell short of the "state of the art
kernels" Phase 1 spec.
Rewrites sum_i64_to_f64 with four explicit backends behind runtime ISA
dispatch:
AVX-512DQ (x86_64): _mm512_cvtepi64_pd + 4 × 512-bit accumulators
(8 lanes each = 32-wide ILP)
AVX2 (x86_64): scalar vcvtsi2sd + _mm256_set_pd packing into
4 × 256-bit accumulators (16-wide ILP). AVX2 has
no vcvtqq2pd; the conversion is the bottleneck.
Mysticial bit-trick is a future optimization.
NEON (aarch64): vld1q_s64 + vcvtq_f64_s64 + vaddq_f64 with 4 ×
128-bit accumulators (8-wide ILP). Native i64->f64
vector convert exists on ARM.
scalar (any): existing 4-way unrolled fallback, also serves as
the reference for property-based equivalence tests.
All non-scalar paths are #[target_feature(...)] unsafe fns called only
after is_x86_feature_detected! / is_aarch64_feature_detected! returns
true. Detection results are cached by std::arch after the first call,
so the per-call dispatch cost is one atomic load.
Generated assembly verified: the hot loop is now full-vector through
the reduction (fadd.2d across all four NEON accumulators, then faddp.2d
for horizontal collapse), where the previous version reduced into a
scalar register.
Java integration test (NativeSumAggregationFunctionTest, 5 cases
including 100k-element correctness vs Java reference) still passes —
the JNI signature is unchanged and result semantics are preserved
within the documented float tolerance.
Updates docs/native/phase-1-design.md §6.1 and adds a decision log
entry noting this work moves from Phase 1.B back to 1.A.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
WIP - Not yet Ready For Review
Summary
Draft / RFC PR for native (Rust + JNI) acceleration of Pinot's query execution. Design docs are included in this PR and will be migrated to Google Docs for broader community discussion:
RUST_REWRITE_DESIGN.mddocs/native/phase-1-design.md—POC scope (Phase 1) - Implement a GROUP BY Aggregation Kernel and push down operations
SUM, COUNT, MIN, MAX, DISTINCT_COUNT on primitive fixed-width types (INT/LONG/FLOAT/DOUBLE), plus single-column group-by with a vectorized SwissTable-style hash table and state-of-the-art SIMD kernels for the aggregations. This PR delivers the first vertical slice (SUM(LONG) end-to-end + plumbing); the remaining surface is sub-phased in
docs/native/phase-1-design.md§14.What's in this PR
pinot-nativemodule — Cargo workspace + Maven module + JNI bindings; builds the native library during the normal./mvnwflow when Cargo is on PATHsum_i64_to_f64kernel with explicit SIMD intrinsics and runtime ISA dispatch: NEON (aarch64), AVX2 (x86_64), AVX-512DQ (x86_64), scalar fallbackNativeAggregationRouter+NativeSumAggregationFunctioninpinot-core, plugged intoAggregationFunctionFactoryat a single point that covers all six aggregation contexts in Pinot (SSE V1, MSE leaf, MSE intermediate, star-tree, realtime, MV refresh)Status
-Dpinot.native.aggregation.enabled=true. Default behavior is unchanged; Java path is the fallback and remains authoritative.PinotNativeAgg.isAvailable()returns false and the factory routes to the Java path. Builds without Cargo work via-DskipNativeBuild=true.Testing