Skip to content

Refine index-based DISTINCT operators (JSON / inverted)#18588

Open
Jackie-Jiang wants to merge 1 commit into
apache:masterfrom
Jackie-Jiang:index_distinct_operator
Open

Refine index-based DISTINCT operators (JSON / inverted)#18588
Jackie-Jiang wants to merge 1 commit into
apache:masterfrom
Jackie-Jiang:index_distinct_operator

Conversation

@Jackie-Jiang
Copy link
Copy Markdown
Contributor

Summary

Refines the two index-based DISTINCT operators added in #17872 / #17820 and rewrites their tests to drive the full broker → operator path.

JsonIndexDistinctOperator

  • Argument validation moves into the constructor, mirroring JsonExtractIndexTransformFunction.init. The operator accepts 3-/4-/5-arg jsonExtractIndex calls (path, type, optional default, optional JSON_MATCH filter expression) and MV _ARRAY types (INT_ARRAY, STRING_ARRAY, etc.).
  • canUseJsonIndexDistinct is simplified to a function-name check; the planner routes every jsonExtractIndex call through the operator and lets the constructor surface invalid arguments as IllegalArgumentException.
  • The runtime path intersects per-value doc ids from the JSON index with the WHERE-clause filter through a single remainingDocs bitmap, so cross-path JSON_MATCH, base-column filters, and IS NULL filters all use the same code path.
  • numDocsScanned is populated from the filtered doc set (or total docs when the filter matches everything), so execution statistics line up with the scan/projection path.
  • New query option jsonIndexDistinctSkipMissingPath: when set, the operator skips parsing the 4-arg default, skips remainingDocs tracking, and suppresses the "Illegal Json Path" throw for the 3-arg form. Useful when the caller knows every doc has the path (or doesn't care about misses).

InvertedIndexDistinctOperator

  • Caches _totalDocs in the constructor instead of recomputing per call.
  • DESC-sorted path short-circuits with intersects (boolean) rather than getLongCardinality, which is orders of magnitude cheaper on dense bitmaps.
  • Drops redundant advanceIfNeeded(startDocId) on the ASC sorted path and the redundant inner filterIter.hasNext() check.
  • Reports a correct numDocsScanned on the sorted / inverted paths (previously zero).
  • Inlines the FilterPreparation helper and renames _numEntriesExamined_numEntriesExaminedPostFilter so the stats name matches its meaning.

Tests

  • Both operators get a new queries-based test (JsonIndexDistinctOperatorQueriesTest, InvertedIndexDistinctOperatorQueriesTest) that drives SELECT DISTINCT through BaseQueriesTest, asserts on result tables, explain strings, and execution statistics (numDocsScanned, numEntriesScannedInFilter, numEntriesScannedPostFilter, numTotalDocs).
  • The older mock-based unit tests are removed — the queries tests cover the same behaviors against real segments.
  • All OPTION(...) syntax in the suites is converted to standard SET a=b; prefixes; repeated query strings are extracted into shared constants.

@Jackie-Jiang Jackie-Jiang added enhancement Improvement to existing functionality query Related to query processing labels May 26, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines Pinot’s opt-in index-based DISTINCT execution paths for JSON and inverted/sorted indexes, adds a JSON missing-path query option, and replaces lower-level mock tests with query-driven coverage.

Changes:

  • Moves JSON DISTINCT validation/execution into JsonIndexDistinctOperator, adds 5-arg and skip-missing-path handling.
  • Refines inverted/sorted DISTINCT path selection, execution stats, and bitmap/range checks.
  • Reworks tests around broker/operator query execution and removes older mock-based unit tests.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java Adds the jsonIndexDistinctSkipMissingPath query option key.
pinot-common/src/main/java/org/apache/pinot/common/utils/config/QueryOptionsUtils.java Adds parsing helper for the new JSON DISTINCT skip option.
pinot-core/src/main/java/org/apache/pinot/core/plan/DistinctPlanNode.java Routes jsonExtractIndex DISTINCT expressions to the JSON index operator by function name.
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java Refactors JSON index transform argument handling and filter expression naming.
pinot-core/src/main/java/org/apache/pinot/core/operator/query/JsonIndexDistinctOperator.java Reworks JSON index DISTINCT validation, filtering, missing-path handling, MV type acceptance, and statistics.
pinot-core/src/main/java/org/apache/pinot/core/operator/query/InvertedIndexDistinctOperator.java Refines inverted/sorted index DISTINCT paths and statistics reporting.
pinot-core/src/test/java/org/apache/pinot/queries/JsonIndexDistinctOperatorQueriesTest.java Adds query-path coverage for JSON index DISTINCT behavior.
pinot-core/src/test/java/org/apache/pinot/queries/InvertedIndexDistinctOperatorQueriesTest.java Updates inverted index DISTINCT query tests and statistics assertions.
pinot-core/src/test/java/org/apache/pinot/core/operator/query/JsonIndexDistinctOperatorTest.java Removes older mock-based JSON operator unit tests.
pinot-core/src/test/java/org/apache/pinot/core/operator/query/InvertedIndexDistinctOperatorUnitTest.java Removes older mock-based inverted operator unit tests.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 26, 2026

Codecov Report

❌ Patch coverage is 52.46637% with 106 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.28%. Comparing base (baccdcc) to head (8351441).

Files with missing lines Patch % Lines
...core/operator/query/JsonIndexDistinctOperator.java 44.44% 74 Missing and 21 partials ⚠️
.../operator/query/InvertedIndexDistinctOperator.java 81.08% 2 Missing and 5 partials ⚠️
...rm/function/JsonExtractIndexTransformFunction.java 66.66% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             master   #18588    +/-   ##
==========================================
  Coverage     64.28%   64.28%            
  Complexity     1137     1137            
==========================================
  Files          3335     3335            
  Lines        205898   205787   -111     
  Branches      32129    32100    -29     
==========================================
- Hits         132355   132286    -69     
+ Misses        62894    62887     -7     
+ Partials      10649    10614    -35     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.28% <52.46%> (+<0.01%) ⬆️
temurin 64.28% <52.46%> (+<0.01%) ⬆️
unittests 64.28% <52.46%> (+<0.01%) ⬆️
unittests1 56.78% <52.46%> (+0.01%) ⬆️
unittests2 36.84% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- JsonIndexDistinctOperator: validate 3/4/5-arg jsonExtractIndex in the
  constructor (mirroring JsonExtractIndexTransformFunction.init), accept
  MV `_ARRAY` types, intersect per-value doc ids with the WHERE-clause
  filter through a unified `remainingDocs` bitmap, and surface
  numDocsScanned in execution statistics.
- New `jsonIndexDistinctSkipMissingPath` query option: when true, the
  operator skips parsing the 4-arg default, skips `remainingDocs`
  tracking, and never throws "Illegal Json Path".
- `canUseJsonIndexDistinct` simplified to a function-name check; planner
  routes any `jsonExtractIndex` call through the operator and lets the
  constructor validate arguments.
- InvertedIndexDistinctOperator: cache `_totalDocs`, short-circuit the
  DESC sorted path with `intersects` instead of `getLongCardinality`,
  drop redundant `advanceIfNeeded` / inner `hasNext`, and report a
  correct numDocsScanned for sorted / inverted paths.
- Tests rewritten as queries-based suites
  (`JsonIndexDistinctOperatorQueriesTest`,
  `InvertedIndexDistinctOperatorQueriesTest`) that drive the full
  broker -> operator path and assert on execution statistics; the older
  mock-based unit tests are removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Jackie-Jiang Jackie-Jiang force-pushed the index_distinct_operator branch from dea8cbd to 8351441 Compare May 26, 2026 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Improvement to existing functionality query Related to query processing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants