-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
I'm opening this epic to track improvements / changes we want to our benchmarking setup.
I'll start by collecting some relevant issues:
- Run all benchmarks on merge to main branch #15511
- Run DataFusion benchmarks regularly and track performance history over time #5504
- Support multiple (>2) results comparison in benchmark scripts #13446
- [DISCUSS] Release retrospective #21034
- Fix memory accounting in Datafusion #20714
I think we should discuss in this issue what we want from our benchmarking setup and use that to guide how we improve it.
Trackable over time
This is important for gating releases and catching regressions early. Currently we only ever really run benchmarks in a PR to compare to main or when we go to update ClickBench. My proposal would be that (cost permitting) we run benchmarks on every merge to main and post a comment in the PR if they regressed (so 1 run per PR vs. every commit) or at the very least we run them on RC branches comparing to the previous release.
Proposal
I think we should target Codspeed compatibility. They are generous with open source and offer a great platform to track results over time.
Can run slow/complex SQL benchmarks
Think ClickBench. I think this might discard the use of criterion at least for this style of benchmark, see bheisler/criterion.rs#320. We can use criterion for smaller, faster benchmarks.
We also need a harness that supports loading data, can give us both cold and hot numbers.
Proposal
Because of limitations with criterion I think we should try divan at lest for the slow tests. It might make sense to just use it for all tests, although I have found the APIs to be a bit less flexible than criterion. Divan is recommended by Codspeed.
Reliable, deterministic results
Although Codspeed simulation mode can combat some noisy neighbor effects it doesn't solve the issue completely. We also have benchmarks like ClickBench (anything over ~100ms really) that would be too slow to run instrumented this way.
Proposal
We set up self-hosted runners on GKE that launch each run as a k8s Job on a Performance class (isolated) node and use these for benchmark runs. This would let us contain the entire setup in the datafusion repo, sidestep most of the faff with auth. We can also use these runners for anything else that needs more ooomph.
Small/quick benchmarks would be run in Codspeed simulation mode, slow ones in Codspeed walltime mode.
Can write SQL benchmarks
Currently writing benchmarks requires using dataframe APIs and quite a bit of ceremony (recent example: #21180).
I would like it to be possible to write SQL benchmarks, including with some SQL or non-SQL setup (could be a bash script to download data), even if a bit of rust is required (e.g. sql_bench!("../q1/")).
This has several advantages:
- Less code / boilerplate.
- Benchmarks are more in line with real world usage.
- Can tweak benchmarks without recompiling.
Proposal
Macros/harness code to easily point at a directory of SQL files and generate test cases with setup, etc.
Can run w/ different configs without recompile
We often find it useful to run e.g. with a memory limit, filter pushdown on or off or other DataFusion configs.
Our solution should be able to run with custom settings, including different settings for the base and branch (e.g. if adding a feature flagged changed we should be able to test main without the feature flag vs. change with the feature flag on).
Changing configs should not require a recompile, they should be settable via SQL or via env vars.
Can do a "quick run"
We want to be able to run benchmarks just to verify the results are correct, or as a test during development.
Proposal
We should have CI run benchmarks in non-release mode and with 1 iteration / sample of each benchmark. TBD if Divan supports setting 1 sample via CLI args.
Triggerable on PRs
We need to be able to manually trigger a benchmark run on PRs that we think might be impacting performance.
Proposal
If we have self hosted runners we can do comment scraping easily within GitHub Actions.
Memory profiling / allocation tracking
We don't really have any benchmarks for memory use and there have been multiple memory use regressions.
Proposal
Divan has a d-hat like allocation mode.