feat: Add constant column extraction and rewriting for projections in ParquetOpener #19136

Weijun-H · 2025-12-06T19:35:35Z

Which issue does this PR close?

Closes Replace constant columns with literals #19089

Rationale for this change

Use file/group statistics to detect constant (including all-NULL) columns so we can avoid reading/decoding useless data and simplify predicates for earlier pruning.

What changes are included in this PR?

Detect constant columns during Parquet scan setup, rewrite those columns to Literals, shrink the projection mask, and fold constants into predicates.
Rebuild file pruner after constant folding to allow early pruning when possible.

Are these changes tested?

Yes

Are there any user-facing changes?

No

… ParquetOpener

adriangb · 2025-12-07T09:49:03Z

So quick!

adriangb

This will create some conflicts with #19128 but this is small and self contained so I see no reason to not proceed with it now and I can deal with the conflicts later... would help to get a review over there and associated PRs 🎣

adriangb · 2025-12-07T09:55:05Z

datafusion/datasource-parquet/src/opener.rs

+            if !constant_columns.is_empty() {
+                predicate = predicate
+                    .map(|expr| {
+                        if is_dynamic_physical_expr(&expr) {


Why do we need this clause? What breaks if we remove it? I'd think that rewriting the dynamic expression would work - it would try to rewrite it's children, which shouldn't cause any issues. Once snapshot is called the produces expression should have the remapped children.

adriangb · 2025-12-07T09:56:19Z

datafusion/datasource-parquet/src/opener.rs

+    projection.try_map_exprs(|expr| rewrite_physical_expr_with_constants(expr, constants))
+}
+
+fn rewrite_physical_expr_with_constants(


This looks very similar to https://github.com/apache/datafusion/pull/19128/files#diff-6bad7e4ee6dbc3a498e3fee746f2c3c18bdcf237d7cd12226e392f9b9c3d2fbe, we should be able to use it for partition values as well 😄

adriangb · 2025-12-08T21:24:13Z

@Weijun-H let me know if you want to merge or chat about the feedback, I don't want to merge something that conflicts and then have this good work go stale

This PR does some refactoring of `PhysicalExprAdapter` and `PhysicalExprSimplifier` that I found necessary and/or beneficial while working on #19111. ## Changes made ### Replace `PhysicalExprAdapter::with_partition_values` with `replace_columns_with_literals` This is a nice improvement because it: 1. Makes the `PhysicalExprAdapter` trait that users might need to implement simpler (less boilerplate for users). 2. Decouples these two transformations so that we can replace partition values and then apply a projection without having to also do the schema mapping (it would be from the logical schema to the logical schema, confusing and a waste of compute). I ran into this need in #19111. I think there may be other ways of doing it (e.g. piping in the expected output schema from ParquetSource) but it felt nicer this way and I expect other places may also need the decoupled transformation. 3. I think we can use it in the future to implement #19089 (edit: evidently I was right, see identical function in #19136). 4. It's less lines of code 😄 This will require any users calling `PhysicalExprAdapter` directly to change their code, I can add an entry to the upgrade guide. ### Remove partition pruning logic from `FilePruner` and deprecate now unused `PrunableStatistics` and `CompositePruningStatistics`. Since we replace partition values with literals we no longer need these structures, they get handled like any other literals. This is a good chunk of code / complexity that we can bin off. ### Use `TableSchema` instead of `SchemaRef` + `Vec<FieldRef>` in `ParquetOpener` `TableSchema` is basically `SchemaRef` + `Vec<FieldRef>` already and since `ParquetSource` has a `TableSchema` its less code and less clones to propagate that into `ParquetOpener` --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

feat: Add constant column extraction and rewriting for projections in…

28392d7

… ParquetOpener

github-actions bot added the datasource Changes to the datasource crate label Dec 6, 2025

Weijun-H added 3 commits December 6, 2025 23:04

fix

f01f1b7

fix clippy

fe5f2b0

fix clippy

5104e8e

adriangb approved these changes Dec 7, 2025

View reviewed changes

adriangb mentioned this pull request Dec 8, 2025

Move partition handling out of PhysicalExprAdapter #19128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add constant column extraction and rewriting for projections in ParquetOpener #19136

feat: Add constant column extraction and rewriting for projections in ParquetOpener #19136

Uh oh!

Weijun-H commented Dec 6, 2025 •

edited

Loading

Uh oh!

adriangb commented Dec 7, 2025

Uh oh!

adriangb left a comment

Uh oh!

adriangb Dec 7, 2025

Uh oh!

adriangb Dec 7, 2025

Uh oh!

adriangb commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add constant column extraction and rewriting for projections in ParquetOpener #19136

Are you sure you want to change the base?

feat: Add constant column extraction and rewriting for projections in ParquetOpener #19136

Uh oh!

Conversation

Weijun-H commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

adriangb commented Dec 7, 2025

Uh oh!

adriangb left a comment

Choose a reason for hiding this comment

Uh oh!

adriangb Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Weijun-H commented Dec 6, 2025 •

edited

Loading