fix: Correct the number of pruned/matched Parquet pages by nuno-faria · Pull Request #22031 · apache/datafusion

nuno-faria · 2026-05-05T21:07:43Z

Which issue does this PR close?

Closes EXPLAIN ANALYZE returns a wrong number of pruned and matched Parquet pages #22030.

Rationale for this change

Show the correct metrics in the execution plan.

What changes are included in this PR?

Keep track of the pages matched by each pruning predicate to avoid double counting them.
Added unit test and updated existing ones.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

nuno-faria · 2026-05-05T21:10:18Z

 Plan with Metrics
 01)SortExec: TopK(fetch=3), expr=[species@0 ASC NULLS LAST], preserve_partitioning=[false], filter=[species@0 < Nlpine Sheep], metrics=[output_rows=3]
-02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/explain_analyze/data.parquet]]}, projection=[species, s], file_type=parquet, predicate=species@0 > M AND s@1 >= 50 AND DynamicFilter [ species@0 < Nlpine Sheep ], pruning_predicate=species_null_count@1 != row_count@2 AND species_max@0 > M AND s_null_count@4 != row_count@2 AND s_max@3 >= 50 AND species_null_count@1 != row_count@2 AND species_min@5 < Nlpine Sheep, required_guarantees=[], metrics=[output_rows=3, files_ranges_pruned_statistics=1 total → 1 matched, row_groups_pruned_statistics=4 total → 3 matched -> 1 fully matched, row_groups_pruned_bloom_filter=3 total → 3 matched, page_index_pages_pruned=6 total → 6 matched, limit_pruned_row_groups=0 total → 0 matched, scan_efficiency_ratio=22.13% (521/2.35 K)]
+02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/explain_analyze/data.parquet]]}, projection=[species, s], file_type=parquet, predicate=species@0 > M AND s@1 >= 50 AND DynamicFilter [ species@0 < Nlpine Sheep ], pruning_predicate=species_null_count@1 != row_count@2 AND species_max@0 > M AND s_null_count@4 != row_count@2 AND s_max@3 >= 50 AND species_null_count@1 != row_count@2 AND species_min@5 < Nlpine Sheep, required_guarantees=[], metrics=[output_rows=3, files_ranges_pruned_statistics=1 total → 1 matched, row_groups_pruned_statistics=4 total → 3 matched -> 1 fully matched, row_groups_pruned_bloom_filter=3 total → 3 matched, page_index_pages_pruned=3 total → 3 matched, limit_pruned_row_groups=0 total → 0 matched, scan_efficiency_ratio=22.13% (521/2.35 K)]


The file used contains 4 row groups (1 data page each) and the query matched only 3. So the previous 6 was due to double counting by having two predicates.

Also the same reasoning applies to the remaining changes.

2010YOUY01

LGTM, thank you!

For performance considerations: it does not appear to have any performance issues for now. However, if we want to reduce the overhead of metric updates in the future, we could change this metric to a ratio (for rows passing page pruning), which can be cheaply derived from overall_selection.

adriangb

Thanks @nuno-faria, I checked the change and it looks like a real bug and correct fix. I asked Codex to review and it did point out a related issue, but that can be fixed in a followup if you prefer.

adriangb · 2026-05-06T06:23:10Z

                    predicate.predicate_expr(),
                );

+                total_pages_in_group = pages.len();


Codex suggested:

If every predicate skips due to a missing column index on line 239, the total_pages_in_group remains at 0 and matched_pages_in_group remains None. This causes the row group to silently contribute 0 total, 0 matched, and 0 pruned to the metrics, even though N pages will actually be scanned. While this behavior is not a new regression, it is an inaccuracy worth addressing while modifying this part of the codebase.

To fix this, total_pages_in_group should be derived from the offset index upfront as a property of the row group rather than within the predicate loop. By initializing matched_pages_in_group to (0..total_pages_in_group).collect(), an abstaining pruner will correctly report $N \to N$ matched and 0 pruned. This refactor also allows the Option<HashSet<_>> to be simplified into a plain HashSet<_> and removes the redundant assignment of total_pages_in_group inside the loop.

I thought returning 0 was expected, as it meant that the page index was not used. This is the same behavior as the number of rows.

If it's not supposed to be 0 I think it's better to leave as a follow up, since the number of rows also needs to be fixed. Any thoughts @2010YOUY01?

Yes this is the same behavior as other metrics. I've found this behavior confusing. This seemed like a case where there's no good reason for this to be the case. I do agree it should be a followup.

fix: Correct the number of pruned/matched Parquet pages

d015b06

github-actions Bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels May 5, 2026

nuno-faria commented May 5, 2026

View reviewed changes

2010YOUY01 approved these changes May 6, 2026

View reviewed changes

adriangb approved these changes May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Correct the number of pruned/matched Parquet pages#22031

fix: Correct the number of pruned/matched Parquet pages#22031
nuno-faria wants to merge 1 commit intoapache:mainfrom
nuno-faria:fix_parquet_matched_pages

nuno-faria commented May 5, 2026

Uh oh!

nuno-faria May 5, 2026 •

edited

Loading

Uh oh!

2010YOUY01 left a comment

Uh oh!

adriangb left a comment •

edited

Loading

Uh oh!

adriangb May 6, 2026

Uh oh!

nuno-faria May 6, 2026

Uh oh!

adriangb May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nuno-faria commented May 5, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

nuno-faria May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

adriangb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb May 6, 2026

Choose a reason for hiding this comment

Uh oh!

nuno-faria May 6, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nuno-faria May 5, 2026 •

edited

Loading

adriangb left a comment •

edited

Loading