Skip to content

Comments

chore: Add consistency checks and result hashing to TPC benchmarks#3582

Merged
andygrove merged 10 commits intoapache:mainfrom
andygrove:tpc-benchmark-consistency-checks
Feb 24, 2026
Merged

chore: Add consistency checks and result hashing to TPC benchmarks#3582
andygrove merged 10 commits intoapache:mainfrom
andygrove:tpc-benchmark-consistency-checks

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Feb 24, 2026

Summary

  • Add result hash (MD5) and row count tracking to tpcbench.py to detect correctness differences between benchmark runs
  • Update generate-comparison.py to support both old (list) and new (dict) result formats
  • Add cross-file consistency validation (row count and result hash mismatches) with warnings
  • Handle missing queries gracefully by computing the common query set across all result files

Example result file format:

  "1": {
      "durations": [
          1.65323805809021
      ],
      "row_count": 4,
      "result_hash": "a6d3abeac576021dcf8f68d02fe03073"
  }

@andygrove andygrove changed the title Add consistency checks and result hashing to TPC benchmarks chore: Add consistency checks and result hashing to TPC benchmarks Feb 24, 2026
@andygrove andygrove force-pushed the tpc-benchmark-consistency-checks branch from d869058 to f1783e5 Compare February 24, 2026 14:26
andygrove and others added 3 commits February 24, 2026 07:58
Change `+ N days` to `+ interval N days` in 15 TPC-DS queries
(q5, q12, q16, q20, q21, q32, q37, q40, q77, q80, q82, q92, q94,
q95, q98) so they parse correctly in Spark SQL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, thanks @andygrove!

Rename comet.toml to comet-hashjoin.toml (keeps replaceSortMergeJoin=true)
and create a new comet.toml without it, since replacing SMJ with hash
join causes OOM on TPC-DS workloads.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove andygrove marked this pull request as ready for review February 24, 2026 15:11
andygrove and others added 5 commits February 24, 2026 09:01
Same approach as comet: comet-iceberg no longer replaces SMJ, and
comet-iceberg-hashjoin is available for when hash join is desired.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The spark.eventLog.dir was hardcoded to /results/spark-events, a
Docker-only path. Make it configurable via SPARK_EVENT_LOG_DIR env var,
defaulting to /tmp/spark-events so local runs work out of the box.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…in TPC-DS q30

The customer table column is c_last_review_date, not c_last_review_date_sk.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove andygrove merged commit 77fdeb7 into apache:main Feb 24, 2026
108 checks passed
@andygrove andygrove deleted the tpc-benchmark-consistency-checks branch February 24, 2026 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants