Skip to content

fix: Correct the number of pruned/matched Parquet pages#22031

Open
nuno-faria wants to merge 1 commit intoapache:mainfrom
nuno-faria:fix_parquet_matched_pages
Open

fix: Correct the number of pruned/matched Parquet pages#22031
nuno-faria wants to merge 1 commit intoapache:mainfrom
nuno-faria:fix_parquet_matched_pages

Conversation

@nuno-faria
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Show the correct metrics in the execution plan.

What changes are included in this PR?

  • Keep track of the pages matched by each pruning predicate to avoid double counting them.
  • Added unit test and updated existing ones.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels May 5, 2026
Plan with Metrics
01)SortExec: TopK(fetch=3), expr=[species@0 ASC NULLS LAST], preserve_partitioning=[false], filter=[species@0 < Nlpine Sheep], metrics=[output_rows=3]
02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/explain_analyze/data.parquet]]}, projection=[species, s], file_type=parquet, predicate=species@0 > M AND s@1 >= 50 AND DynamicFilter [ species@0 < Nlpine Sheep ], pruning_predicate=species_null_count@1 != row_count@2 AND species_max@0 > M AND s_null_count@4 != row_count@2 AND s_max@3 >= 50 AND species_null_count@1 != row_count@2 AND species_min@5 < Nlpine Sheep, required_guarantees=[], metrics=[output_rows=3, files_ranges_pruned_statistics=1 total → 1 matched, row_groups_pruned_statistics=4 total → 3 matched -> 1 fully matched, row_groups_pruned_bloom_filter=3 total → 3 matched, page_index_pages_pruned=6 total → 6 matched, limit_pruned_row_groups=0 total → 0 matched, scan_efficiency_ratio=22.13% (521/2.35 K)]
02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/explain_analyze/data.parquet]]}, projection=[species, s], file_type=parquet, predicate=species@0 > M AND s@1 >= 50 AND DynamicFilter [ species@0 < Nlpine Sheep ], pruning_predicate=species_null_count@1 != row_count@2 AND species_max@0 > M AND s_null_count@4 != row_count@2 AND s_max@3 >= 50 AND species_null_count@1 != row_count@2 AND species_min@5 < Nlpine Sheep, required_guarantees=[], metrics=[output_rows=3, files_ranges_pruned_statistics=1 total → 1 matched, row_groups_pruned_statistics=4 total → 3 matched -> 1 fully matched, row_groups_pruned_bloom_filter=3 total → 3 matched, page_index_pages_pruned=3 total → 3 matched, limit_pruned_row_groups=0 total → 0 matched, scan_efficiency_ratio=22.13% (521/2.35 K)]
Copy link
Copy Markdown
Contributor Author

@nuno-faria nuno-faria May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file used contains 4 row groups (1 data page each) and the query matched only 3. So the previous 6 was due to double counting by having two predicates.

Also the same reasoning applies to the remaining changes.

Copy link
Copy Markdown
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

For performance considerations: it does not appear to have any performance issues for now. However, if we want to reduce the overhead of metric updates in the future, we could change this metric to a ratio (for rows passing page pruning), which can be cheaply derived from overall_selection.

Copy link
Copy Markdown
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nuno-faria, I checked the change and it looks like a real bug and correct fix. I asked Codex to review and it did point out a related issue, but that can be fixed in a followup if you prefer.

predicate.predicate_expr(),
);

total_pages_in_group = pages.len();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex suggested:

If every predicate skips due to a missing column index on line 239, the total_pages_in_group remains at 0 and matched_pages_in_group remains None. This causes the row group to silently contribute 0 total, 0 matched, and 0 pruned to the metrics, even though N pages will actually be scanned. While this behavior is not a new regression, it is an inaccuracy worth addressing while modifying this part of the codebase.

To fix this, total_pages_in_group should be derived from the offset index upfront as a property of the row group rather than within the predicate loop. By initializing matched_pages_in_group to (0..total_pages_in_group).collect(), an abstaining pruner will correctly report $N \to N$ matched and 0 pruned. This refactor also allows the Option<HashSet<_>> to be simplified into a plain HashSet<_> and removes the redundant assignment of total_pages_in_group inside the loop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought returning 0 was expected, as it meant that the page index was not used. This is the same behavior as the number of rows.

If it's not supposed to be 0 I think it's better to leave as a follow up, since the number of rows also needs to be fixed. Any thoughts @2010YOUY01?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is the same behavior as other metrics. I've found this behavior confusing. This seemed like a case where there's no good reason for this to be the case. I do agree it should be a followup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EXPLAIN ANALYZE returns a wrong number of pruned and matched Parquet pages

3 participants