-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
When preserve_file_partitions is enabled, we currently disable dynamic filtering to avoid incorrect assumptions about hash partitioning. This avoids a bug but removes critical filtering. We want to retain dynamic filtering without breaking file‑partitioning guarantees.
Describe the solution you'd like
Enable dynamic filtering for file‑partitioned scans by pruning file groups based on partition values when the dynamic filter keys are a subset (which also implies exact match) of the partition columns.
If keys do not align, apply dynamic filtering as a row‑level predicate but do not prune file groups.
Here are scenarios:
All Cases: Partition columns = [id, date]
Case 1: Exact match
- Dynamic filter keys: [id, date]
- Example filter: id IN ('A', 'B') AND date IN ('2024-01-01')
- Action: Prune file groups
- Reason: each file group has a single (id, date) tuple
Case 2: Subset (still safe)
- Dynamic filter keys: [id]
- Example filter: id IN ('A','B')
- Action: Prune file groups
- Reason: if a group’s id isn’t in the filter, the whole group can’t match
Case 3: Superset (not safe for group pruning)
- Dynamic filter keys: [id, date, service]
- Example filter: id IN ('A') AND date IN ('2024-01-01') AND service IN ('log')
- Action: No group pruning, apply filter at row level
- Reason: service is not a partition column, so partition values don’t determine whether rows match
Case 4: Partial overlap (not safe for group pruning)
- Dynamic filter keys: [date, service]
- Example filter: date IN ('2024-01-01') AND service IN ('log')
- Action: No group pruning, apply filter at row level
- Reason: includes non‑partition column service
This should allow dynamic filtering to work safely without requiring repartition or changing join planning.
Describe alternatives you've considered
Introduce a new partitioning implementation via a generic trait‑based scheme (hash/value/range/custom), where users (and datafusion) can implement any type of partitioning scheme they desire.
This would provide the interface needed to determine how partitioning schemes are compatible, what satisfies what, etc. The exact details are not fleshed out but this would be a powerful addition and clear up ambiguities in DataFusion's partitioning modes today.