Skip to content

Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available#18599

Open
deepthi912 wants to merge 8 commits into
apache:masterfrom
deepthi912:deepthi/approach4-filter-plan-node
Open

Fall back to raw-value REGEXP_LIKE evaluator when no dict-consuming index is available#18599
deepthi912 wants to merge 8 commits into
apache:masterfrom
deepthi912:deepthi/approach4-filter-plan-node

Conversation

@deepthi912
Copy link
Copy Markdown
Collaborator

@deepthi912 deepthi912 commented May 27, 2026

  • When inverted index or range index or dictionary encoding is not enabled, we need to fallback to using RawValueBasedRegexpLikePredicateEvaluator without throwing following exception:
2026/05/20 12:58:09.599 ERROR [BaseCombineOperator] [pqw-6] Caught exception while processing query: QueryContext{_tableName='parquet_datatype_logical_types_OFFLINE', _subquery=null, _selectExpressions=[count(*)], _distinct=false, _aliasList=[null], _filter=regexp_like(col_string,'abc','i'), _groupByExpressions=null, _havingFilter=null, _orderByExpressions=null, _limit=10, _offset=0, _queryOptions={useMultistageEngine=false, serverReturnFinalResult=true, timeoutMs=60000}, _expressionOverrideHints={}, _explain=NONE}
org.apache.pinot.spi.exception.QueryException: Caught exception while doing operator: class org.apache.pinot.core.operator.AcquireReleaseColumnsSegmentOperator on segment 37aa1ac19d2979ca369ed42b42187063: null
	at org.apache.pinot.spi.exception.QueryErrorCode.asException(QueryErrorCode.java:171)
	at org.apache.pinot.core.operator.combine.BaseCombineOperator.wrapOperatorException(BaseCombineOperator.java:307)
	at org.apache.pinot.core.operator.combine.BaseSingleBlockCombineOperator.processSegments(BaseSingleBlockCombineOperator.java:84)
	at org.apache.pinot.core.operator.combine.BaseCombineOperator$1.runJob(BaseCombineOperator.java:218)
	at org.apache.pinot.core.util.trace.TraceRunnable.run(TraceRunnable.java:40)
	at org.apache.pinot.spi.query.QueryThreadContext$1.lambda$decorate$1(QueryThreadContext.java:273)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:128)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:80)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.lang.UnsupportedOperationException
	at org.apache.pinot.core.operator.filter.predicate.BaseDictionaryBasedPredicateEvaluator.applySV(BaseDictionaryBasedPredicateEvaluator.java:133)
	at org.apache.pinot.core.operator.dociditerators.SVScanDocIdIterator$StringMatcher.doesValueMatch(SVScanDocIdIterator.java:308)
	at org.apache.pinot.core.operator.dociditerators.SVScanDocIdIterator$ValueMatcher.matchValues(SVScanDocIdIterator.java:208)
	at org.apache.pinot.core.operator.dociditerators.SVScanDocIdIterator.next(SVScanDocIdIterator.java:86)
	at org.apache.pinot.core.operator.DocIdSetOperator.getNextBlock(DocIdSetOperator.java:76)
	at org.apache.pinot.core.operator.DocIdSetOperator.getNextBlock(DocIdSetOperator.java:40)
	at org.apache.pinot.core.operator.BaseOperator.nextBlock(BaseOperator.java:42)
	at org.apache.pinot.core.operator.ProjectionOperator.getNextBlock(ProjectionOperator.java:87)
	at org.apache.pinot.core.operator.ProjectionOperator.getNextBlock(ProjectionOperator.java:39)
	at org.apache.pinot.core.operator.BaseOperator.nextBlock(BaseOperator.java:42)
	at org.apache.pinot.core.operator.query.AggregationOperator.getNextBlock(AggregationOperator.java:73)
	at org.apache.pinot.core.operator.query.AggregationOperator.getNextBlock(AggregationOperator.java:43)
	at org.apache.pinot.core.operator.BaseOperator.nextBlock(BaseOperator.java:42)
	at org.apache.pinot.core.operator.AcquireReleaseColumnsSegmentOperator.getNextBlock(AcquireReleaseColumnsSegmentOperator.java:74)
	at org.apache.pinot.core.operator.AcquireReleaseColumnsSegmentOperator.getNextBlock(AcquireReleaseColumnsSegmentOperator.java:43)
	at org.apache.pinot.core.operator.BaseOperator.nextBlock(BaseOperator.java:42)
	at org.apache.pinot.core.operator.combine.BaseSingleBlockCombineOperator.processSegments(BaseSingleBlockCombineOperator.java:82)
	... 12 more

deepthi912 and others added 7 commits May 27, 2026 12:51
…ndex is available

When FST/IFST exists but the column has no sorted/inverted index that can consume a
dict-id-based predicate evaluator, FilterPlanNode previously built the FST/IFST
evaluator unconditionally. With a RAW forward index, FilterOperatorUtils then fell
through to ScanBasedFilterOperator, which calls applySV(String) on the dict-id
evaluator — that throws UnsupportedOperationException
(BaseDictionaryBasedPredicateEvaluator), crashing queries such as
`regexp_like(col, 'pat', 'i')` and `LIKE 'pat'` on external/iceberg-backed tables
with `encodingType: RAW` + `dictionary: {}` + `ifst: { enabled: true }`.

Add canConsumeDictIdEvaluator() — only construct the FST/IFST dict-id evaluator
when a sorted or inverted index is available for this data source (matching the
operator-routing logic in FilterOperatorUtils#getLeafFilterOperator). Otherwise
fall through to PredicateEvaluatorProvider, which returns
RawValueBasedRegexpLikePredicateEvaluator — already implements applySV(String)
correctly. No changes to base classes or scan iterator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… selection

- FilterPlanNode: hoist getDictionaryUsableForFiltering call into a local
  `dictUsable` variable so both case-insensitive/case-sensitive branches stay
  under the 120-char line limit (and the dictionary check runs once instead of
  twice per predicate).

- FilterPlanNodeTest: add 5 tests covering the regex evaluator-selection logic:
  - IFST + dict + inverted (RAW forward) → dict-id evaluator (IFST)
  - IFST + dict + no inverted (RAW forward) → raw-value evaluator (the bug)
  - FST + dict + inverted (RAW forward) → dict-id evaluator (FST)
  - FST + dict + no inverted (RAW forward) → raw-value evaluator
  - IFST + dict + dict-encoded forward → dict-id evaluator (scan w/ DictIdMatcher)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mockTextIndexReader() internally calls Mockito.when(...).thenReturn(...). Invoking
it as an argument inside an outer Mockito.when(...).thenReturn(...) chain confuses
Mockito's pending-stubbing tracker and surfaces as UnfinishedStubbing failures on
all 5 new tests. Build the inner mocks into locals first, then pass to the outer
when().thenReturn() calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@deepthi912 deepthi912 added index Related to indexing (general) text-search Related to text/Lucene indexing and search labels May 27, 2026
Consolidates all REGEXP_LIKE evaluator-selection logic in PredicateEvaluatorProvider
so FilterPlanNode just calls the standard getPredicateEvaluator(predicate, dataSource,
queryContext). The dict-based switch's REGEXP_LIKE case prefers the FST/IFST text
index when present on the data source, otherwise falls back to the existing
RegexpLikePredicateEvaluatorFactory.newDictionaryBasedEvaluator. No evaluator is
built and discarded — the upgrade decision happens before any construction.

- buildEvaluator gains a @nullable DataSource parameter; the Dictionary-based public
  overload passes null (no DataSource to read text indexes from).
- FilterPlanNode REGEXP_LIKE case collapses from 25 lines to 5 and drops three
  imports (RegexpLikePredicate, FSTBasedRegexpPredicateEvaluatorFactory,
  IFSTBasedRegexpPredicateEvaluatorFactory).
- getDictionaryUsableForFiltering reverts to package-private — no external caller.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 27, 2026

Codecov Report

❌ Patch coverage is 0% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.81%. Comparing base (baccdcc) to head (0b30fd3).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
...r/filter/predicate/PredicateEvaluatorProvider.java 0.00% 11 Missing ⚠️
...ava/org/apache/pinot/core/plan/FilterPlanNode.java 0.00% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (baccdcc) and HEAD (0b30fd3). Click for more details.

HEAD has 4 uploads less than BASE
Flag BASE (baccdcc) HEAD (0b30fd3)
java-21 5 4
unittests1 1 0
unittests 2 1
temurin 5 4
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #18599       +/-   ##
=============================================
- Coverage     64.28%   36.81%   -27.47%     
+ Complexity     1137     1136        -1     
=============================================
  Files          3335     3335               
  Lines        205898   206038      +140     
  Branches      32129    32142       +13     
=============================================
- Hits         132355    75859    -56496     
- Misses        62894   123315    +60421     
+ Partials      10649     6864     -3785     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 36.81% <0.00%> (-27.47%) ⬇️
temurin 36.81% <0.00%> (-27.47%) ⬇️
unittests 36.81% <0.00%> (-27.47%) ⬇️
unittests1 ?
unittests2 36.81% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

index Related to indexing (general) text-search Related to text/Lucene indexing and search

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants