Skip to content

Conversation

@hubgeter
Copy link
Contributor

@hubgeter hubgeter commented Jan 23, 2026

What problem does this PR solve?

Problem Summary:

  1. Fixed an issue where the Parquet reader's pushdown predicate row group min-max filter was ineffective due to a refactoring of the column predicate [refactor](predicate) Normalize predicates generation #59187.

  2. Removed tablet_schema from RuntimePredicate when performing topn runtime filters, making RuntimePredicate not limited to olap scan node.

  3. Removed the feature in [Enhancement](parquet)update runtime filter when read next parquet row group. #59053 where the Parquet reader could obtain runtime filters in real time, for the purpose of unifying the implementation of olap and file scan node logic in the future.

  4. Temporarily disable the topn runtime filter for VARBINARY type, as the implementation of this type in column predicate is incomplete, affecting pr [improve](varbinary) support varbinary type with topn runtime filter #58721.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hubgeter
Copy link
Contributor Author

run buildall

3 similar comments
@hubgeter
Copy link
Contributor Author

run buildall

@hubgeter
Copy link
Contributor Author

run buildall

@hubgeter
Copy link
Contributor Author

run buildall

@morningman
Copy link
Contributor

⏺ Code Review: PR #60197 - Fix Parquet Reader Cannot Push Down Conjuncts for Min-Max Filter

Overview

This PR refactors the predicate pushdown mechanism for Parquet files, specifically addressing how min-max filters and runtime predicates are handled. The main changes include:

  1. Simplified predicate handling: Removes the complex late runtime filter callback mechanism from ParquetReader
  2. Enhanced predicate evaluation: Adds missing evaluate_and methods for Parquet-specific column statistics in multiple predicate classes
  3. Runtime predicate refactoring: Removes tablet_schema dependency from RuntimePredicate, replacing it with direct DataType storage
  4. Improved filter handling: Adds proper handling for null statistics and row group ranges

Analysis

Strengths

  1. Architectural improvement: Removing the late RF callback mechanism (_call_late_rf_func) simplifies the code flow and reduces complexity
  2. Consistent null handling: Properly handles is_all_null cases in comparison and in-list predicates (comparison_predicate.h:201, in_list_predicate.h:285)
  3. Better abstraction: Using DataTypePtr instead of TabletSchema in RuntimePredicate is cleaner and more appropriate for external tables

Critical Issues

  1. Missing Row Group Range Addition

Severity: High

In multiple files, when get_stat_func returns false (indicating no page index stats), the code now adds row_group_range to include all rows:

  if (!(statistic->get_stat_func)(&stat, column_id())) {
      row_ranges->add(statistic->row_group_range);
      return true;
  }

Files affected: comparison_predicate.h:229, in_list_predicate.h:313, null_predicate.h:99, shared_predicate.h:184

Issue: The original behavior when stats are unavailable was to return true (conservative - read everything). Now it explicitly adds the row group range. While semantically similar, this changes behavior and could have performance implications if the calling code doesn't expect row ranges to be modified when stats are missing.

Recommendation: Verify this behavioral change is intentional and document why row ranges should be explicitly added when stats are unavailable.

  1. Removed Code in tablet_reader.cpp

Severity: Medium

Lines 527-533 in tablet_reader.cpp removed the registration of TopN filter tablet schemas:

  -    for (int id : read_params.topn_filter_source_node_ids) {
  -        auto& runtime_predicate =
  -                read_params.runtime_state->get_query_ctx()->get_runtime_predicate(id);
  -        RETURN_IF_ERROR(runtime_predicate.set_tablet_schema(read_params.topn_filter_target_node_id,
  -                                                            _tablet_schema));
  -    }

Issue: This code was called for internal tablet reads (OLAP engine). The removal suggests TopN filters for internal tables might be broken, or this was dead code. The PR description doesn't explain this change.

Recommendation: Clarify whether this affects internal table TopN filtering. If it does, add test coverage.

  1. AcceptNullPredicate Row Range Logic

Severity: Medium

In accept_null_predicate.h:110-121, the new evaluate_and implementation adds rows where has_null[page_id] is true:

  for (int page_id = 0; page_id < stat->num_of_pages; page_id++) {
      if (stat->has_null[page_id]) {
          row_ranges->add(stat->ranges[page_id]);
      }
  }

Issue: This only adds pages with nulls but doesn't call the nested predicate's evaluate_and. The logic should be:

  • Add pages that match the nested predicate OR have nulls
  • Currently, it only adds null pages regardless of nested predicate result

Recommendation: The implementation appears incomplete. It should combine results from the nested predicate with null page ranges.

  1. Removed Lazy Read Context Update

Severity: High

vparquet_reader.cpp removed significant logic including:

  • Runtime filter pushdown (lines 450-478)
  • Late arrival RF handling (lines 714-718)
  • The entire _update_lazy_read_ctx method refactoring

Issue: The PR claims to "fix parquet reader cannot push down conjuncts" but actually removes pushdown logic for:

  • Binary predicates (GE, LE) from runtime filters
  • IN predicates from runtime filters
  • Late-arriving runtime filters

Recommendation: This appears contradictory to the PR title. Need clarification on whether this is:
a) Moving pushdown logic elsewhere (where?)
b) Intentionally removing broken logic (why?)
c) A regression

Minor Issues

  1. Inconsistent Error Handling: comparison_predicate.h:198 and in_list_predicate.h:282 now check is_all_null before parsing min/max, but this check wasn't in the original code. Verify this is correct for all Parquet files.
  2. Missing Documentation: The PR description is incomplete - no problem summary, no release note justification, no test plan.
  3. Unused Code Path: file_scan_operator.h:69-73 adds _push_down_topn that always returns true, with a comment saying the reader will determine if it can be pushed. But the reader logic was just removed.

Missing Elements

  1. No Tests: PR checklist shows no test coverage mentioned
  2. No Problem Statement: The PR description is empty
  3. No Verification: How was this tested? What queries benefit?
  4. Breaking Changes: Potential behavior changes for internal tables (TopN filters) not addressed

Recommendations

Must Fix Before Merge

  1. Add comprehensive test coverage including:
    - Parquet files with/without page indexes
    - Null handling in various predicate types
    - TopN filter behavior for both internal and external tables
    - Runtime filter pushdown scenarios
  2. Fill out PR description explaining:
    - What bug this fixes (with example)
    - Why removing pushdown logic is correct
    - Impact on existing functionality
  3. Fix AcceptNullPredicate logic - the row range addition appears incorrect
  4. Verify tablet_reader.cpp changes - ensure TopN filters still work for internal tables

Should Consider

  1. Add comments explaining why row group ranges are added when stats are unavailable
  2. Document the behavior change from callback-based to upfront predicate evaluation
  3. Consider whether _push_down_topn returning true unconditionally is safe

Risk Assessment

Overall Risk: High

  • Significant removal of existing functionality (RF pushdown, late arrival handling)
  • Potential regression in query performance for Parquet files
  • Possible breaking change for internal table TopN filters
  • No test coverage to validate correctness
  • PR description provides no context for reviewing safety

Recommendation: Request changes - This PR needs substantial clarification, testing, and potentially rework before merge.

@doris-robot
Copy link

TPC-H: Total hot run time: 30987 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e0bfef4f7abd7780a1f3576d4e9f66c52849c90c, data reload: false

------ Round 1 ----------------------------------
q1	17692	4885	4623	4623
q2	2028	325	201	201
q3	10219	1330	717	717
q4	10206	831	313	313
q5	7521	2106	1876	1876
q6	200	178	148	148
q7	876	726	599	599
q8	9302	1400	1101	1101
q9	5328	4707	4606	4606
q10	6822	1648	1324	1324
q11	521	302	277	277
q12	344	382	233	233
q13	17774	3859	3083	3083
q14	227	228	210	210
q15	604	521	526	521
q16	657	628	599	599
q17	645	815	528	528
q18	6949	6507	6409	6409
q19	1414	1001	614	614
q20	405	349	231	231
q21	2681	2137	1786	1786
q22	1028	1009	988	988
Total cold run time: 103443 ms
Total hot run time: 30987 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4855	4809	4740	4740
q2	328	387	320	320
q3	2171	2648	2254	2254
q4	1307	1797	1318	1318
q5	4070	4028	3989	3989
q6	217	176	136	136
q7	1929	1876	2105	1876
q8	2630	2481	2469	2469
q9	7179	7094	7164	7094
q10	2492	2736	2348	2348
q11	584	490	458	458
q12	709	757	635	635
q13	3669	4658	3589	3589
q14	312	322	280	280
q15	581	513	506	506
q16	665	721	651	651
q17	1234	1323	1461	1323
q18	8118	7983	8113	7983
q19	858	857	869	857
q20	2060	2079	1907	1907
q21	4736	4198	4043	4043
q22	1086	1031	967	967
Total cold run time: 51790 ms
Total hot run time: 49743 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 172086 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e0bfef4f7abd7780a1f3576d4e9f66c52849c90c, data reload: false

query5	4387	637	488	488
query6	329	212	198	198
query7	4239	445	249	249
query8	329	247	222	222
query9	8706	2844	2826	2826
query10	450	304	277	277
query11	15151	15192	14803	14803
query12	165	116	109	109
query13	1239	455	380	380
query14	5588	3169	2743	2743
query14_1	2680	2648	2626	2626
query15	204	191	176	176
query16	960	498	408	408
query17	908	666	574	574
query18	2425	435	333	333
query19	199	187	155	155
query20	119	118	117	117
query21	222	134	115	115
query22	4027	4238	4079	4079
query23	16158	15701	15360	15360
query23_1	15416	15462	15556	15462
query24	7142	1583	1195	1195
query24_1	1171	1170	1163	1163
query25	552	463	408	408
query26	1246	271	153	153
query27	2761	449	277	277
query28	4551	2147	2144	2144
query29	792	553	452	452
query30	308	241	219	219
query31	803	624	546	546
query32	91	74	72	72
query33	530	378	311	311
query34	883	874	522	522
query35	729	762	668	668
query36	880	909	865	865
query37	139	99	90	90
query38	2682	2667	2559	2559
query39	774	746	721	721
query39_1	723	724	729	724
query40	221	136	118	118
query41	87	67	68	67
query42	95	98	94	94
query43	417	479	413	413
query44	1314	744	755	744
query45	193	187	179	179
query46	844	941	616	616
query47	1392	1528	1398	1398
query48	312	320	240	240
query49	602	427	345	345
query50	668	263	203	203
query51	3759	3794	3714	3714
query52	91	86	80	80
query53	204	219	174	174
query54	272	251	238	238
query55	80	78	73	73
query56	301	293	320	293
query57	1059	1011	955	955
query58	270	258	255	255
query59	2013	2038	2124	2038
query60	327	328	303	303
query61	151	142	140	140
query62	387	360	311	311
query63	190	164	161	161
query64	4946	1163	822	822
query65	3833	3708	3753	3708
query66	1448	416	312	312
query67	15425	15481	15334	15334
query68	2421	1058	716	716
query69	384	306	271	271
query70	1003	957	949	949
query71	303	289	273	273
query72	5244	3124	3206	3124
query73	599	721	321	321
query74	8677	8696	8459	8459
query75	2273	2319	1888	1888
query76	2264	1058	657	657
query77	344	362	293	293
query78	9732	9976	9096	9096
query79	1066	884	567	567
query80	1273	515	422	422
query81	539	262	228	228
query82	1001	146	117	117
query83	334	257	238	238
query84	293	118	102	102
query85	866	504	404	404
query86	408	296	328	296
query87	2817	2840	2763	2763
query88	3549	2601	2573	2573
query89	291	260	238	238
query90	1999	171	165	165
query91	166	155	128	128
query92	74	71	71	71
query93	1130	1029	641	641
query94	641	310	298	298
query95	595	326	376	326
query96	639	500	228	228
query97	2324	2391	2294	2294
query98	220	198	195	195
query99	620	576	497	497
Total cold run time: 244448 ms
Total hot run time: 172086 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 26.69 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit e0bfef4f7abd7780a1f3576d4e9f66c52849c90c, data reload: false

query1	0.06	0.05	0.05
query2	0.10	0.05	0.05
query3	0.26	0.09	0.09
query4	1.61	0.12	0.11
query5	0.27	0.26	0.25
query6	1.16	0.66	0.65
query7	0.03	0.03	0.03
query8	0.05	0.04	0.04
query9	0.57	0.51	0.49
query10	0.54	0.55	0.55
query11	0.15	0.10	0.09
query12	0.14	0.11	0.11
query13	0.59	0.62	0.58
query14	0.94	0.94	0.95
query15	0.79	0.78	0.78
query16	0.42	0.39	0.41
query17	1.07	1.03	1.04
query18	0.23	0.21	0.22
query19	1.89	1.90	1.80
query20	0.02	0.01	0.02
query21	15.40	0.25	0.15
query22	5.24	0.05	0.04
query23	15.89	0.30	0.10
query24	1.34	0.25	0.66
query25	0.08	0.10	0.06
query26	0.14	0.13	0.12
query27	0.12	0.04	0.06
query28	4.46	1.08	0.89
query29	12.50	3.94	3.14
query30	0.27	0.14	0.11
query31	2.83	0.64	0.40
query32	3.24	0.56	0.46
query33	3.00	3.00	3.01
query34	16.21	5.02	4.42
query35	4.44	4.52	4.43
query36	0.66	0.48	0.49
query37	0.12	0.07	0.06
query38	0.08	0.04	0.04
query39	0.04	0.03	0.02
query40	0.17	0.15	0.15
query41	0.09	0.04	0.03
query42	0.04	0.03	0.02
query43	0.04	0.03	0.04
Total cold run time: 97.29 s
Total hot run time: 26.69 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 21.13% (15/71) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.45% (19114/36441)
Line Coverage 35.82% (177563/495648)
Region Coverage 32.28% (137152/424930)
Branch Coverage 33.24% (59420/178780)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 87.32% (62/71) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.09% (26109/35720)
Line Coverage 56.18% (278140/495116)
Region Coverage 54.01% (231931/429443)
Branch Coverage 55.64% (99902/179551)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 87.32% (62/71) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.08% (26104/35720)
Line Coverage 56.15% (277994/495116)
Region Coverage 53.96% (231720/429443)
Branch Coverage 55.59% (99814/179551)

@hubgeter hubgeter force-pushed the fix_parquet_push_down branch from e0bfef4 to ccb2128 Compare January 24, 2026 11:20
@hubgeter
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 30500 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ccb2128cf3202070351b4f493c33dcd3ba4b7341, data reload: false

------ Round 1 ----------------------------------
q1	17626	4791	4569	4569
q2	2031	304	193	193
q3	10244	1289	717	717
q4	10201	797	298	298
q5	7513	2069	1797	1797
q6	194	178	143	143
q7	859	720	574	574
q8	9293	1353	1081	1081
q9	4877	4546	4588	4546
q10	6779	1645	1253	1253
q11	509	279	272	272
q12	337	364	213	213
q13	17771	3749	3103	3103
q14	225	235	213	213
q15	609	525	528	525
q16	636	664	573	573
q17	638	780	473	473
q18	6599	6507	6304	6304
q19	1251	953	604	604
q20	395	337	225	225
q21	2586	1876	1890	1876
q22	1018	1011	948	948
Total cold run time: 102191 ms
Total hot run time: 30500 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4776	4730	4713	4713
q2	323	399	319	319
q3	2163	2664	2253	2253
q4	1340	1718	1283	1283
q5	4101	3984	3980	3980
q6	211	172	132	132
q7	1926	1821	1663	1663
q8	2726	2504	2367	2367
q9	7206	7213	7137	7137
q10	2572	2688	2248	2248
q11	536	480	461	461
q12	722	721	623	623
q13	3558	4117	3472	3472
q14	298	304	274	274
q15	643	593	519	519
q16	660	671	621	621
q17	1110	1298	1361	1298
q18	8274	7901	7638	7638
q19	856	809	818	809
q20	1967	2059	1905	1905
q21	4684	4646	4393	4393
q22	1056	1025	965	965
Total cold run time: 51708 ms
Total hot run time: 49073 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 171543 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ccb2128cf3202070351b4f493c33dcd3ba4b7341, data reload: false

query5	4381	617	515	515
query6	327	212	189	189
query7	4226	446	248	248
query8	335	242	239	239
query9	8709	2813	2842	2813
query10	435	307	270	270
query11	15313	15185	14858	14858
query12	177	123	120	120
query13	1255	464	368	368
query14	5871	3065	2778	2778
query14_1	2640	2637	2647	2637
query15	194	189	174	174
query16	990	475	443	443
query17	1094	666	570	570
query18	2430	432	329	329
query19	203	179	150	150
query20	123	117	113	113
query21	217	134	114	114
query22	3981	3992	3969	3969
query23	15998	15460	15303	15303
query23_1	15383	15397	15436	15397
query24	7178	1537	1167	1167
query24_1	1167	1173	1157	1157
query25	547	461	409	409
query26	1234	267	148	148
query27	2769	448	274	274
query28	4573	2125	2131	2125
query29	821	533	442	442
query30	306	240	206	206
query31	807	634	540	540
query32	83	76	70	70
query33	553	358	309	309
query34	874	865	525	525
query35	713	748	674	674
query36	876	925	829	829
query37	145	99	85	85
query38	2745	2629	2627	2627
query39	775	762	744	744
query39_1	733	700	707	700
query40	220	135	120	120
query41	72	68	66	66
query42	95	94	87	87
query43	421	467	410	410
query44	1304	736	740	736
query45	190	184	217	184
query46	812	940	589	589
query47	1391	1486	1396	1396
query48	317	306	234	234
query49	592	419	327	327
query50	664	260	197	197
query51	3718	3749	3689	3689
query52	87	93	81	81
query53	212	215	165	165
query54	269	244	239	239
query55	79	82	74	74
query56	285	297	297	297
query57	1049	1049	925	925
query58	260	254	242	242
query59	1989	2080	1981	1981
query60	328	331	309	309
query61	155	137	136	136
query62	397	346	302	302
query63	190	162	159	159
query64	4983	1136	810	810
query65	3788	3719	3742	3719
query66	1438	416	302	302
query67	15392	15500	15370	15370
query68	2387	1037	699	699
query69	392	308	268	268
query70	1024	934	957	934
query71	298	276	262	262
query72	5249	3174	3160	3160
query73	611	723	309	309
query74	8672	8615	8561	8561
query75	2259	2315	1849	1849
query76	2273	1065	650	650
query77	370	374	304	304
query78	9698	9941	9087	9087
query79	1058	891	573	573
query80	1281	517	447	447
query81	543	256	231	231
query82	1129	153	119	119
query83	331	256	245	245
query84	262	120	93	93
query85	878	480	401	401
query86	407	298	286	286
query87	2841	2847	2746	2746
query88	3450	2559	2544	2544
query89	299	260	249	249
query90	1984	165	164	164
query91	167	153	131	131
query92	73	76	67	67
query93	1054	989	646	646
query94	656	278	284	278
query95	578	387	308	308
query96	628	501	222	222
query97	2332	2375	2301	2301
query98	216	202	207	202
query99	584	588	502	502
Total cold run time: 245519 ms
Total hot run time: 171543 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 26.55 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ccb2128cf3202070351b4f493c33dcd3ba4b7341, data reload: false

query1	0.05	0.04	0.04
query2	0.10	0.05	0.05
query3	0.26	0.09	0.08
query4	1.61	0.12	0.11
query5	0.27	0.24	0.25
query6	1.14	0.64	0.65
query7	0.03	0.03	0.03
query8	0.05	0.04	0.04
query9	0.56	0.50	0.49
query10	0.55	0.54	0.54
query11	0.14	0.09	0.10
query12	0.14	0.10	0.11
query13	0.60	0.58	0.58
query14	0.95	0.95	0.94
query15	0.79	0.76	0.78
query16	0.40	0.40	0.39
query17	1.08	0.96	1.08
query18	0.22	0.21	0.20
query19	1.95	1.87	1.84
query20	0.02	0.01	0.01
query21	15.41	0.27	0.13
query22	5.04	0.05	0.05
query23	15.71	0.28	0.10
query24	1.04	0.24	0.69
query25	0.08	0.11	0.06
query26	0.15	0.14	0.13
query27	0.08	0.08	0.05
query28	4.80	1.08	0.87
query29	12.52	3.88	3.12
query30	0.27	0.14	0.12
query31	2.82	0.61	0.39
query32	3.24	0.55	0.44
query33	2.94	2.98	3.05
query34	16.62	5.05	4.46
query35	4.43	4.44	4.49
query36	0.65	0.50	0.49
query37	0.11	0.07	0.06
query38	0.06	0.04	0.03
query39	0.05	0.03	0.03
query40	0.17	0.15	0.14
query41	0.09	0.04	0.03
query42	0.04	0.02	0.03
query43	0.05	0.04	0.04
Total cold run time: 97.28 s
Total hot run time: 26.55 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 20.78% (16/77) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.45% (19114/36441)
Line Coverage 35.83% (177569/495654)
Region Coverage 32.28% (137180/424937)
Branch Coverage 33.24% (59419/178784)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 28.57% (22/77) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.41% (25507/35720)
Line Coverage 53.98% (267280/495122)
Region Coverage 51.51% (221226/429450)
Branch Coverage 53.01% (95176/179555)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants