Predicate Pushdown in Spark Structured Streaming (DataSource V2).#55679
Draft
jalpan-randeri wants to merge 1 commit intoapache:masterfrom
Draft
Predicate Pushdown in Spark Structured Streaming (DataSource V2).#55679jalpan-randeri wants to merge 1 commit intoapache:masterfrom
jalpan-randeri wants to merge 1 commit intoapache:masterfrom
Conversation
This allows DSv2 connectors (like Apache Iceberg) to enabling metadata-level file pruning and reduced I/O for streaming micro-batches. Currently, Spark Structured streaming via the DSv2 api does not pushdown predicate. This results in more data being scan and filtered out at engine layer, results in excessive I/O, driver bottlenecks and increased latency. - Added PushDownPredicateInMicroBatchExecutionSuite tests - Manual Testing
This was referenced May 5, 2026
|
Deleted some comments I made because I didn't realize there were two PR's (one iceberg, one spark) to address this issue. Sorry! |
Author
|
No problem at all, Scott— I completely understand. It’s a bit of a moving target across the two PRs. I'd still love to incorporate the subquery insight you had; could you re-share that test query? I want to make sure I’ve got full coverage in both the Spark and Iceberg logic before I refresh the PRs. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR introduces support for Predicate Pushdown in Spark Structured Streaming (DataSource V2).
This allows DSv2 connectors (like Apache Iceberg) to enabling metadata-level file pruning and reduced I/O for streaming micro-batches.
Fixes - #55680
Why are the changes needed?
Currently, Spark Structured streaming via the DSv2 api does not pushdown predicate. This results in more data being scan and filtered out at engine layer. This results in excessive I/O, driver bottlenecks and increased latency.
Does this PR introduce any user-facing change?
No There is no change to the user-facing API
This change improves the performance in the presence of filter at partition level.
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No