Skip to content

[EPIC] Benchmark improvements #21165

@adriangb

Description

@adriangb

I'm opening this epic to track improvements / changes we want to our benchmarking setup.

I'll start by collecting some relevant issues:

I think we should discuss in this issue what we want from our benchmarking setup and use that to guide how we improve it.

Trackable over time

This is important for gating releases and catching regressions early. Currently we only ever really run benchmarks in a PR to compare to main or when we go to update ClickBench. My proposal would be that (cost permitting) we run benchmarks on every merge to main and post a comment in the PR if they regressed (so 1 run per PR vs. every commit) or at the very least we run them on RC branches comparing to the previous release.

Proposal

I think we should target Codspeed compatibility. They are generous with open source and offer a great platform to track results over time.

Can run slow/complex SQL benchmarks

Think ClickBench. I think this might discard the use of criterion at least for this style of benchmark, see bheisler/criterion.rs#320. We can use criterion for smaller, faster benchmarks.

We also need a harness that supports loading data, can give us both cold and hot numbers.

Proposal

Because of limitations with criterion I think we should try divan at lest for the slow tests. It might make sense to just use it for all tests, although I have found the APIs to be a bit less flexible than criterion. Divan is recommended by Codspeed.

Reliable, deterministic results

Although Codspeed simulation mode can combat some noisy neighbor effects it doesn't solve the issue completely. We also have benchmarks like ClickBench (anything over ~100ms really) that would be too slow to run instrumented this way.

Proposal

We set up self-hosted runners on GKE that launch each run as a k8s Job on a Performance class (isolated) node and use these for benchmark runs. This would let us contain the entire setup in the datafusion repo, sidestep most of the faff with auth. We can also use these runners for anything else that needs more ooomph.

Small/quick benchmarks would be run in Codspeed simulation mode, slow ones in Codspeed walltime mode.

Can write SQL benchmarks

Currently writing benchmarks requires using dataframe APIs and quite a bit of ceremony (recent example: #21180).
I would like it to be possible to write SQL benchmarks, including with some SQL or non-SQL setup (could be a bash script to download data), even if a bit of rust is required (e.g. sql_bench!("../q1/")).

This has several advantages:

  1. Less code / boilerplate.
  2. Benchmarks are more in line with real world usage.
  3. Can tweak benchmarks without recompiling.

Proposal

Macros/harness code to easily point at a directory of SQL files and generate test cases with setup, etc.

Can run w/ different configs without recompile

We often find it useful to run e.g. with a memory limit, filter pushdown on or off or other DataFusion configs.
Our solution should be able to run with custom settings, including different settings for the base and branch (e.g. if adding a feature flagged changed we should be able to test main without the feature flag vs. change with the feature flag on).

Changing configs should not require a recompile, they should be settable via SQL or via env vars.

Can do a "quick run"

We want to be able to run benchmarks just to verify the results are correct, or as a test during development.

Proposal

We should have CI run benchmarks in non-release mode and with 1 iteration / sample of each benchmark. TBD if Divan supports setting 1 sample via CLI args.

Triggerable on PRs

We need to be able to manually trigger a benchmark run on PRs that we think might be impacting performance.

Proposal

If we have self hosted runners we can do comment scraping easily within GitHub Actions.

Memory profiling / allocation tracking

We don't really have any benchmarks for memory use and there have been multiple memory use regressions.

Proposal

Divan has a d-hat like allocation mode.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions