Skip to content

support dynamic filtering on partitioned data from file source #20195

@gene-bordegaray

Description

@gene-bordegaray

Is your feature request related to a problem or challenge?

When preserve_file_partitions is enabled, we currently disable dynamic filtering to avoid incorrect assumptions about hash partitioning. This avoids a bug but removes critical filtering. We want to retain dynamic filtering without breaking file‑partitioning guarantees.

Describe the solution you'd like

Enable dynamic filtering for file‑partitioned scans by pruning file groups based on partition values when the dynamic filter keys are a subset (which also implies exact match) of the partition columns.

If keys do not align, apply dynamic filtering as a row‑level predicate but do not prune file groups.

Here are scenarios:

All Cases: Partition columns = [id, date]

Case 1: Exact match
- Dynamic filter keys: [id, date]
- Example filter: id IN ('A', 'B') AND date IN ('2024-01-01')
- Action: Prune file groups
- Reason: each file group has a single (id, date) tuple

Case 2: Subset (still safe)
- Dynamic filter keys: [id]
- Example filter: id IN ('A','B')
- Action: Prune file groups
- Reason: if a group’s id isn’t in the filter, the whole group can’t match

Case 3: Superset (not safe for group pruning)
- Dynamic filter keys: [id, date, service]
- Example filter: id IN ('A') AND date IN ('2024-01-01') AND service IN ('log')
- Action: No group pruning, apply filter at row level
- Reason: service is not a partition column, so partition values don’t determine whether rows match

Case 4: Partial overlap (not safe for group pruning)
- Dynamic filter keys: [date, service]
- Example filter: date IN ('2024-01-01') AND service IN ('log')
- Action: No group pruning, apply filter at row level
- Reason: includes non‑partition column service

This should allow dynamic filtering to work safely without requiring repartition or changing join planning.

Describe alternatives you've considered

Introduce a new partitioning implementation via a generic trait‑based scheme (hash/value/range/custom), where users (and datafusion) can implement any type of partitioning scheme they desire.

This would provide the interface needed to determine how partitioning schemes are compatible, what satisfies what, etc. The exact details are not fleshed out but this would be a powerful addition and clear up ambiguities in DataFusion's partitioning modes today.

Additional context

cc: @adriangb @NGA-TRAN @fmonjalet @gabotechs @LiaCastaneda

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions