Skip to content

Comments

chore: Add TPC-* queries to repo#3562

Merged
andygrove merged 2 commits intoapache:mainfrom
andygrove:bundle-tpc-queries
Feb 23, 2026
Merged

chore: Add TPC-* queries to repo#3562
andygrove merged 2 commits intoapache:mainfrom
andygrove:bundle-tpc-queries

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Feb 21, 2026

Which issue does this PR close?

N/A

Rationale for this change

The benchmark scripts in benchmarks/tpc currently require the user to provide the queries. It is more convenient to add them to the repository.

What changes are included in this PR?

Add query files. These are copied from datafusion-benchmarks repo.

How are these changes tested?

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove!

@@ -0,0 +1,26 @@
-- CometBench-DS query 1 derived from TPC-DS query 1 under the terms of the TPC Fair Use Policy.
-- TPC-DS queries are Copyright 2021 Transaction Processing Performance Council.
-- This query was generated at scale factor 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard is it to parameterize this in the future? I wonder what values change, considering we usually run SF100 or 1000.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could try regenerating at different scale factors and do a diff

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove one question though.
One day we investigated with @mbutrovich efficiency of pregenerated queries and that time we got 40% of TPCDS queries returning no results which might affect benchmarks.
We managed to improve the set to have on 18% of such queries.

For this TPC* set how many of them return 0 rows?

@andygrove
Copy link
Member Author

Thanks @andygrove one question though. One day we investigated with @mbutrovich efficiency of pregenerated queries and that time we got 40% of TPCDS queries returning no results which might affect benchmarks. We managed to improve the set to have on 18% of such queries.

For this TPC* set how many of them return 0 rows?

Thanks @andygrove one question though. One day we investigated with @mbutrovich efficiency of pregenerated queries and that time we got 40% of TPCDS queries returning no results which might affect benchmarks. We managed to improve the set to have on 18% of such queries.

For this TPC* set how many of them return 0 rows?

I don't know. The goal for this PR is just to move them from datafusion-benchmarks to this repo so that we can include them in docker images for docker-compose and k8s without having dependency on another repo.

when I do the next benchmark run I will record how many rows are returned

@andygrove
Copy link
Member Author

Thanks for the reviews @mbutrovich @comphead. I'll have the next PR up to day to add support for docker-compose.

@andygrove andygrove merged commit d2e3c26 into apache:main Feb 23, 2026
112 checks passed
@andygrove andygrove deleted the bundle-tpc-queries branch February 23, 2026 16:22
@andygrove
Copy link
Member Author

Thanks @andygrove one question though. One day we investigated with @mbutrovich efficiency of pregenerated queries and that time we got 40% of TPCDS queries returning no results which might affect benchmarks. We managed to improve the set to have on 18% of such queries.
For this TPC* set how many of them return 0 rows?

Thanks @andygrove one question though. One day we investigated with @mbutrovich efficiency of pregenerated queries and that time we got 40% of TPCDS queries returning no results which might affect benchmarks. We managed to improve the set to have on 18% of such queries.
For this TPC* set how many of them return 0 rows?

I don't know. The goal for this PR is just to move them from datafusion-benchmarks to this repo so that we can include them in docker images for docker-compose and k8s without having dependency on another repo.

when I do the next benchmark run I will record how many rows are returned

@comphead I created #3582 to start recording row counts and result hashes when running benchmarks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants