[EPIC] Benchmark improvements

I'm opening this epic to track improvements / changes we want to our benchmarking setup.

I'll start by collecting some relevant issues:
- https://github.com/apache/datafusion/issues/15511
- https://github.com/apache/datafusion/issues/5504
- https://github.com/apache/datafusion/issues/13446
- #21034
- https://github.com/apache/datafusion/issues/20714

I think we should discuss in this issue what we want from our benchmarking setup and use that to guide how we improve it.

## Trackable over time

This is important for gating releases and catching regressions early. Currently we only ever really run benchmarks in a PR to compare to main or when we go to update ClickBench. My proposal would be that (cost permitting) we run benchmarks on every merge to main and post a comment in the PR if they regressed (so 1 run per PR vs. every commit) or at the very least we run them on RC branches comparing to the previous release.

### Proposal

I think we should target Codspeed compatibility. They are generous with open source and offer a great platform to track results over time.

## Can run slow/complex SQL benchmarks

Think ClickBench. I think this might discard the use of criterion at least for this style of benchmark, see https://github.com/bheisler/criterion.rs/issues/320. We can use criterion for smaller, faster benchmarks.

We also need a harness that supports loading data, can give us both cold and hot numbers.

### Proposal

Because of limitations with [criterion](https://github.com/bheisler/criterion.rs/issues/320) I think we should try [divan](https://github.com/nvzqz/divan) at lest for the slow tests. It might make sense to just use it for all tests, although I have found the APIs to be a bit less flexible than criterion. Divan is [recommended by Codspeed](https://codspeed.io/docs/benchmarks/rust/index#writing-benchmarks-in-rust). 

## Reliable, deterministic results

Although Codspeed [simulation mode](https://codspeed.io/docs/instruments/cpu) can combat some noisy neighbor effects  it doesn't solve the issue completely. We also have benchmarks like ClickBench (anything over ~100ms really) that would be too slow to run instrumented this way.

### Proposal

We set up self-hosted runners on GKE that launch each run as a k8s Job on a Performance class (isolated) node and use these for benchmark runs. This would let us contain the entire setup in the `datafusion` repo, sidestep most of the faff with auth. We can also use these runners for anything else that needs more ooomph.

Small/quick benchmarks would be run in Codspeed simulation mode, slow ones in Codspeed [walltime mode](https://codspeed.io/docs/instruments/walltime).

## Can write SQL benchmarks

Currently writing benchmarks requires using dataframe APIs and quite a bit of ceremony (recent example: #21180).
I would like it to be possible to write SQL benchmarks, including with some SQL or non-SQL setup (could be a bash script to download data), even if a bit of rust is required (e.g. `sql_bench!("../q1/")`).

This has several advantages:
1. Less code / boilerplate.
2. Benchmarks are more in line with real world usage.
3. Can tweak benchmarks without recompiling.

### Proposal

Macros/harness code to easily point at a directory of SQL files and generate test cases with setup, etc.

## Can run w/ different configs without recompile

We often find it useful to run e.g. with a memory limit, filter pushdown on or off or other [DataFusion configs](https://datafusion.apache.org/user-guide/configs.html).
Our solution should be able to run with custom settings, including different settings for the base and branch (e.g. if adding a feature flagged changed we should be able to test main without the feature flag vs. change with the feature flag on).

Changing configs should not require a recompile, they should be settable via SQL or via env vars.

##  Can do a "quick run" 

We want to be able to run benchmarks just to verify the results are correct, or as a test during development.

### Proposal

We should have CI run benchmarks in non-release mode and with 1 iteration / sample of each benchmark. TBD if Divan supports setting 1 sample via CLI args.

## Triggerable on PRs

We need to be able to manually trigger a benchmark run on PRs that we think might be impacting performance.

### Proposal

If we have self hosted runners we can do comment scraping easily within GitHub Actions.

## Memory profiling / allocation tracking

We don't really have *any* benchmarks for memory use and there have been multiple memory use regressions.

### Proposal

Divan has a [d-hat like allocation mode](https://nikolaivazquez.com/blog/divan/#measure-allocations).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Benchmark improvements #21165

Trackable over time

Proposal

Can run slow/complex SQL benchmarks

Proposal

Reliable, deterministic results

Proposal

Can write SQL benchmarks

Proposal

Can run w/ different configs without recompile

Can do a "quick run"

Proposal

Triggerable on PRs

Proposal

Memory profiling / allocation tracking

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[EPIC] Benchmark improvements #21165

Description

Trackable over time

Proposal

Can run slow/complex SQL benchmarks

Proposal

Reliable, deterministic results

Proposal

Can write SQL benchmarks

Proposal

Can run w/ different configs without recompile

Can do a "quick run"

Proposal

Triggerable on PRs

Proposal

Memory profiling / allocation tracking

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions