-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add struct pushdown query benchmark and projection pushdown tests #19962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR extracts benchmarks and sqllogictest cases from PR #19538 for easier review, focusing on testing struct field access projection pushdown optimization in DataFusion.
Changes:
- Added comprehensive benchmark suite for SQL queries on struct columns in Parquet files with 20 different query patterns
- Added 1000+ line SQLLogicTest file covering projection pushdown behavior with get_field expressions through various operators
- Updated Cargo.toml to register the new benchmark
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| datafusion/core/benches/parquet_struct_query.rs | New benchmark file testing struct field queries on Parquet data with various SQL patterns (filters, joins, aggregations, etc.) |
| datafusion/core/Cargo.toml | Added benchmark entry for parquet_struct_query with parquet feature requirement |
| datafusion/sqllogictest/test_files/projection_pushdown.slt | Comprehensive test suite for get_field projection pushdown through Filter, Sort, TopK, and multi-partition scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Extract benchmarks and sqllogictest cases from apache#19538 for easier review. Includes a new benchmark for SQL queries on struct columns in Parquet files, covering struct access, filtering, joins, and aggregations with 524K rows and 8 row groups. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
414b451 to
30b5888
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me -- thank you @adriangb
| logical_plan | ||
| 01)Projection: simple_struct.id, get_field(simple_struct.s, Utf8("value")) | ||
| 02)--TableScan: simple_struct projection=[id, s] | ||
| physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[id, get_field(s@1, value) as simple_struct.s[value]], file_type=parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is interesting that these expressions have already been pushed down to the datasource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep in some cases (no sort, no repartition, etc) it already works, but only because all projections are pushed down.
| [[bench]] | ||
| harness = false | ||
| name = "parquet_query_sql" | ||
| required-features = ["parquet"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason not to just add the benchmarks to parquet_query_sql?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could but it’s kind of nice to be able to run them in isolation easily at least for now while we’re developing just these. And in some sense the feature we’re working on needn’t be parquet specific (eg Vortex). We can always fold them later.
|
Thanks @alamb ! |
Summary
Extract benchmarks and sqllogictest cases from #19538 for easier review.
This PR includes:
New Benchmark:
parquet_struct_query.rs- Benchmarks SQL queries on struct columns in Parquet filesid(Int32) ands(Struct withid/Int32 andvalue/Utf8 fields)SQLLogicTest:
projection_pushdown.slt- Tests for projection pushdown optimizationChanges
datafusion/core/benches/parquet_struct_query.rsdatafusion/core/Cargo.tomlwith benchmark entrydatafusion/sqllogictest/test_files/projection_pushdown.sltTest Plan
cargo bench --profile dev --bench parquet_struct_query🤖 Generated with Claude Code