Skip to content

Conversation

@Weijun-H
Copy link
Member

@Weijun-H Weijun-H commented Dec 6, 2025

Which issue does this PR close?

Rationale for this change

Use file/group statistics to detect constant (including all-NULL) columns so we can avoid reading/decoding useless data and simplify predicates for earlier pruning.

What changes are included in this PR?

  • Detect constant columns during Parquet scan setup, rewrite those columns to Literals, shrink the projection mask, and fold constants into predicates.
  • Rebuild file pruner after constant folding to allow early pruning when possible.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the datasource Changes to the datasource crate label Dec 6, 2025
@adriangb
Copy link
Contributor

adriangb commented Dec 7, 2025

So quick!

Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will create some conflicts with #19128 but this is small and self contained so I see no reason to not proceed with it now and I can deal with the conflicts later... would help to get a review over there and associated PRs 🎣

if !constant_columns.is_empty() {
predicate = predicate
.map(|expr| {
if is_dynamic_physical_expr(&expr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this clause? What breaks if we remove it? I'd think that rewriting the dynamic expression would work - it would try to rewrite it's children, which shouldn't cause any issues. Once snapshot is called the produces expression should have the remapped children.

projection.try_map_exprs(|expr| rewrite_physical_expr_with_constants(expr, constants))
}

fn rewrite_physical_expr_with_constants(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very similar to https://github.com/apache/datafusion/pull/19128/files#diff-6bad7e4ee6dbc3a498e3fee746f2c3c18bdcf237d7cd12226e392f9b9c3d2fbe, we should be able to use it for partition values as well 😄

@adriangb
Copy link
Contributor

adriangb commented Dec 8, 2025

@Weijun-H let me know if you want to merge or chat about the feedback, I don't want to merge something that conflicts and then have this good work go stale

github-merge-queue bot pushed a commit that referenced this pull request Dec 9, 2025
This PR does some refactoring of `PhysicalExprAdapter` and
`PhysicalExprSimplifier` that I found necessary and/or beneficial while
working on #19111.

## Changes made

### Replace `PhysicalExprAdapter::with_partition_values` with
`replace_columns_with_literals`

This is a nice improvement because it:
1. Makes the `PhysicalExprAdapter` trait that users might need to
implement simpler (less boilerplate for users).
2. Decouples these two transformations so that we can replace partition
values and then apply a projection without having to also do the schema
mapping (it would be from the logical schema to the logical schema,
confusing and a waste of compute). I ran into this need in
#19111. I think there may be
other ways of doing it (e.g. piping in the expected output schema from
ParquetSource) but it felt nicer this way and I expect other places may
also need the decoupled transformation.
3. I think we can use it in the future to implement #19089 (edit:
evidently I was right, see identical function in
#19136).
4. It's less lines of code 😄

This will require any users calling `PhysicalExprAdapter` directly to
change their code, I can add an entry to the upgrade guide.


### Remove partition pruning logic from `FilePruner` and deprecate now
unused `PrunableStatistics` and `CompositePruningStatistics`.

Since we replace partition values with literals we no longer need these
structures, they get handled like any other literals.
This is a good chunk of code / complexity that we can bin off.


### Use `TableSchema` instead of `SchemaRef` + `Vec<FieldRef>` in
`ParquetOpener`

`TableSchema` is basically `SchemaRef` + `Vec<FieldRef>` already and
since `ParquetSource` has a `TableSchema` its less code and less clones
to propagate that into `ParquetOpener`

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace constant columns with literals

2 participants