Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Dec 6, 2025

This PR does some refactoring of PhysicalExprAdapter and PhysicalExprSimplifier that I found necessary and/or beneficial while working on #19111.

Changes made

Replace PhysicalExprAdapter::with_partition_values with replace_columns_with_literals

This is a nice improvement because it:

  1. Makes the PhysicalExprAdapter trait that users might need to implement simpler (less boilerplate for users).
  2. Decouples these two transformations so that we can replace partition values and then apply a projection without having to also do the schema mapping (it would be from the logical schema to the logical schema, confusing and a waste of compute). I ran into this need in Push down projection expressions into ParquetOpener #19111. I think there may be other ways of doing it (e.g. piping in the expected output schema from ParquetSource) but it felt nicer this way and I expect other places may also need the decoupled transformation.
  3. I think we can use it in the future to implement Replace constant columns with literals #19089 (edit: evidently I was right, see identical function in feat: Add constant column extraction and rewriting for projections in ParquetOpener #19136).
  4. It's less lines of code 😄

This will require any users calling PhysicalExprAdapter directly to change their code, I can add an entry to the upgrade guide.

Remove partition pruning logic from FilePruner and deprecate now unused PrunableStatistics and CompositePruningStatistics.

Since we replace partition values with literals we no longer need these structures, they get handled like any other literals.
This is a good chunk of code / complexity that we can bin off.

Use TableSchema instead of SchemaRef + Vec<FieldRef> in ParquetOpener

TableSchema is basically SchemaRef + Vec<FieldRef> already and since ParquetSource has a TableSchema its less code and less clones to propagate that into ParquetOpener

@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate physical-expr Changes to the physical-expr crates labels Dec 6, 2025
@adriangb adriangb changed the title Split out partition column handling from PhysicalExprAdapter trait Clean up PhysicalExprAdapter and PhysicalExprSimplifier Dec 6, 2025
&mut self,
expr: Arc<dyn PhysicalExpr>,
) -> Result<Arc<dyn PhysicalExpr>> {
pub fn simplify(&self, expr: Arc<dyn PhysicalExpr>) -> Result<Arc<dyn PhysicalExpr>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is backwards compatible (aside from compiler warnings that the variable doesn't need to be mutable) and means we can keep multiple references, arc it, etc.

@adriangb
Copy link
Contributor Author

adriangb commented Dec 6, 2025

This is blocked by one of #19129 or #19130, I think either should make it work. Currently some tests fail because we can't prune based on 1 = 2.

As a follow up to this we should be able to delete PartitionPruningStatistics and hence remove CompositePruningStatistics and simplify FilePruner. They'll need to go through the deprecation process, and we'll need to document that you should call replace_columns_with_literals instead.

@github-actions github-actions bot added the common Related to common crate label Dec 6, 2025
@adriangb
Copy link
Contributor Author

adriangb commented Dec 6, 2025

I've merged #19129 into this branch to get CI passing. I'll fix the conflicts once we merge that.

@adriangb
Copy link
Contributor Author

adriangb commented Dec 8, 2025

I've merged #19129 into this branch to get CI passing. I'll fix the conflicts once we merge that.

I've reverted this since #19130 got merged and covers this functionality.

/// - `Result<Arc<dyn PhysicalExpr>>`: The rewritten physical expression with columns replaced by literals.
pub fn replace_columns_with_literals(
expr: Arc<dyn PhysicalExpr>,
replacements: &HashMap<&str, &ScalarValue>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to accepting HashMap<String, ScalarValue> or something if that makes lifetimes easier

@adriangb adriangb force-pushed the move-adapter-partition-col-logic branch from 569010f to 1949a45 Compare December 8, 2025 15:27
@github-actions github-actions bot removed the physical-expr Changes to the physical-expr crates label Dec 8, 2025
@adriangb adriangb force-pushed the move-adapter-partition-col-logic branch from 387fc6d to c981725 Compare December 8, 2025 20:25
@adriangb adriangb requested review from alamb and xudong963 December 8, 2025 21:15
@adriangb adriangb changed the title Clean up PhysicalExprAdapter and PhysicalExprSimplifier Clean up PhysicalExprAdapter Dec 9, 2025
@adriangb adriangb changed the title Clean up PhysicalExprAdapter Move partition handling out of PhysicalExprAdapter Dec 9, 2025
@adriangb adriangb force-pushed the move-adapter-partition-col-logic branch from 6147632 to 5595f37 Compare December 9, 2025 00:32
@alamb
Copy link
Contributor

alamb commented Dec 9, 2025

Starting to check this one out

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this one carefully and it makes a lot of sense to me -- "Partition Columns" is a higher level concept in my mind than expression rewriting. Thank you @adriangb

Replacing column references with expressions is a better match

While reviewing this PR, I worked some on the documentation and made a PR targeting this one for your contemplation (I can update it to target main if you want to merge this PR as is):

Finally, should we mark this PR as an API change / update the upgrading.md doc? I can't keep track of what versions these APIs were changed


/// Replace column references in the given physical expression with literal values.
///
/// This is used to substitute partition column references with their literal values during expression rewriting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential usecase is filling values for columns with default values (rather than NULL)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! I'm going to update datafusion-examples/examples/custom_data_source/default_column_values.rs to use this new method precisely for that.

replacements: &HashMap<&str, &ScalarValue>,
) -> Result<Arc<dyn PhysicalExpr>> {
expr.transform(|expr| {
if let Some(column) = expr.as_any().downcast_ref::<Column>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the let chains are quite cool

.map(|(field, value)| (field.name().as_str(), value))
.collect();

self.predicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth skipping this clone/rewrite if there are no partition_values (aka if partition_values.is_empty()?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

// Make a FilePruner only if there is either a dynamic expr in the predicate or the file has file-level statistics.
// File-level statistics may allow us to prune the file without loading any row groups or metadata.
// If there is a dynamic filter we may be able to prune the file later as the dynamic filter is updated.
// This does allow the case where there is a dynamic filter but no statistics, in which case
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a bit to parse this comment

I left a suggestion of how we might be able to improve it

limit: None,
predicate: Some(predicate),
logical_file_schema: schema.clone(),
table_schema: TableSchema::from_file_schema(schema.clone()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
table_schema: TableSchema::from_file_schema(schema.clone()),
table_schema: TableSchema::from_file_schema(Arc::clone(schema.clone)),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why clippy didn't catch this

predicate: Some(predicate),
logical_file_schema: file_schema.clone(),
table_schema: TableSchema::new(
file_schema.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
file_schema.clone(),
Arc::clone(file_schema),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and several other examples below)

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 9, 2025
return Ok(Transformed::yes(default_literal));
}
}
// Pre-compute replacements for missing columns with default values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great

@adriangb adriangb removed the request for review from xudong963 December 9, 2025 17:18
@adriangb adriangb added this pull request to the merge queue Dec 9, 2025
Merged via the queue into apache:main with commit dc78613 Dec 9, 2025
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants