-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix](parquet)fix parquet reader cannot push down conjuncts for min-max filter #60197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
3 similar comments
|
run buildall |
|
run buildall |
|
run buildall |
|
⏺ Code Review: PR #60197 - Fix Parquet Reader Cannot Push Down Conjuncts for Min-Max Filter Overview This PR refactors the predicate pushdown mechanism for Parquet files, specifically addressing how min-max filters and runtime predicates are handled. The main changes include:
Analysis Strengths
Critical Issues
Severity: High In multiple files, when get_stat_func returns false (indicating no page index stats), the code now adds row_group_range to include all rows: Files affected: comparison_predicate.h:229, in_list_predicate.h:313, null_predicate.h:99, shared_predicate.h:184 Issue: The original behavior when stats are unavailable was to return true (conservative - read everything). Now it explicitly adds the row group range. While semantically similar, this changes behavior and could have performance implications if the calling code doesn't expect row ranges to be modified when stats are missing. Recommendation: Verify this behavioral change is intentional and document why row ranges should be explicitly added when stats are unavailable.
Severity: Medium Lines 527-533 in tablet_reader.cpp removed the registration of TopN filter tablet schemas: Issue: This code was called for internal tablet reads (OLAP engine). The removal suggests TopN filters for internal tables might be broken, or this was dead code. The PR description doesn't explain this change. Recommendation: Clarify whether this affects internal table TopN filtering. If it does, add test coverage.
Severity: Medium In accept_null_predicate.h:110-121, the new evaluate_and implementation adds rows where has_null[page_id] is true: Issue: This only adds pages with nulls but doesn't call the nested predicate's evaluate_and. The logic should be:
Recommendation: The implementation appears incomplete. It should combine results from the nested predicate with null page ranges.
Severity: High vparquet_reader.cpp removed significant logic including:
Issue: The PR claims to "fix parquet reader cannot push down conjuncts" but actually removes pushdown logic for:
Recommendation: This appears contradictory to the PR title. Need clarification on whether this is: Minor Issues
Missing Elements
Recommendations Must Fix Before Merge
Should Consider
Risk Assessment Overall Risk: High
Recommendation: Request changes - This PR needs substantial clarification, testing, and potentially rework before merge. |
TPC-H: Total hot run time: 30987 ms |
TPC-DS: Total hot run time: 172086 ms |
ClickBench: Total hot run time: 26.69 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
e0bfef4 to
ccb2128
Compare
|
run buildall |
TPC-H: Total hot run time: 30500 ms |
TPC-DS: Total hot run time: 171543 ms |
ClickBench: Total hot run time: 26.55 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Problem Summary:
Fixed an issue where the Parquet reader's pushdown predicate row group min-max filter was ineffective due to a refactoring of the column predicate [refactor](predicate) Normalize predicates generation #59187.
Removed tablet_schema from RuntimePredicate when performing topn runtime filters, making RuntimePredicate not limited to olap scan node.
Removed the feature in [Enhancement](parquet)update runtime filter when read next parquet row group. #59053 where the Parquet reader could obtain runtime filters in real time, for the purpose of unifying the implementation of olap and file scan node logic in the future.
Temporarily disable the topn runtime filter for VARBINARY type, as the implementation of this type in column predicate is incomplete, affecting pr [improve](varbinary) support varbinary type with topn runtime filter #58721.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)