Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 24 additions & 16 deletions datafusion-partitioned/README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,46 @@
# DataFusion

DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. For more information, please check <https://arrow.apache.org/datafusion/user-guide/introduction.html>
[Apache DataFusion] is an extensible query execution framework, written in Rust, that uses [Apache Arrow] as its in-memory format. For more information, please check <https://arrow.apache.org/datafusion/user-guide/introduction.html>

[Apache DataFusion]: https://arrow.apache.org/datafusion/
[Apache Arrow]: https://arrow.apache.org/

We use parquet file here and create an external table for it; and then execute the queries.

## Generate benchmark results
## Cookbook: Generate benchmark results

The benchmark should be completed in under an hour. On-demand pricing is $0.6 per hour while spot pricing is only $0.2 to $0.3 per hour (us-east-2).

1. manually start a AWS EC2 instance
- `c6a.4xlarge`
- Ubuntu 22.04 or later
- Root 500GB gp2 SSD
- no EBS optimized
- no instance store
1. wait for status check passed, then ssh to EC2 `ssh ubuntu@{ip}`
1. `git clone https://github.com/ClickHouse/ClickBench`
1. `cd ClickBench/datafusion`
1. `vi benchmark.sh` and modify following line to target Datafusion version
1. manually start a AWS EC2 instance, the following environments are included in this dir:

| Instance Type | OS | Disk | Arch |
| :-----------: | :---------------------: | :----------------: | :---: |
| `c6a.xlarge` | `Ubuntu 24.04` or later | Root 500GB gp2 SSD | AMD64 |
| `c6a.2xlarge` | | | AMD64 |
| `c6a.4xlarge` | | | AMD64 |
| `c8g.4xlarge` | | | ARM64 |

All with no EBS optimized, no instance store. For `c6a.xlarge` instance, its memory is not capable to compile datafusion. It's recommended to enable a 8GB swap with ```sudo fallocate -l 4G /swapfile && sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile```.

2. wait for status check passed, then ssh to EC2 `ssh ubuntu@{ip}`
3. `git clone https://github.com/ClickHouse/ClickBench`
4. `cd ClickBench/datafusion-partitioned`
5. `vi benchmark.sh` and modify following line to target Datafusion version

```bash
git checkout 46.0.0
```

1. `bash benchmark.sh`
6. `bash benchmark.sh`
7. Update corresponding `.json` file under `results`, or run `./save-result.sh` with instance type like `./save-result.sh c6a.4xlarge`

### Know Issues

1. importing parquet by `datafusion-cli` doesn't support schema, need to add some casting in queries.sql (e.g. converting EventTime from Int to Timestamp via `to_timestamp_seconds`)
2. importing parquet by `datafusion-cli` make column name column name case-sensitive, i change all column name in queries.sql to double quoted literal (e.g. `EventTime` -> `"EventTime"`)
3. `comparing binary with utf-8` and `group by binary` don't work in mac, if you run these queries in mac, you'll get some errors for queries contain binary format apache/arrow-datafusion#3050

## Generate full human readable results (for debugging)

1. install datafusion-cli
2. download the parquet ```wget --continue --progress=dot:giga https://datasets.clickhouse.com/hits_compatible/hits.parquet```
3. execute it ```datafusion-cli -f create_single.sql queries.sql``` or ```bash run2.sh```
2. download the parquet ```seq 0 99 | xargs -P100 -I{} bash -c 'wget --directory-prefix partitioned --continue --progress=dot:giga https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_{}.parquet'```
3. execute it ```datafusion-cli -f create_single.sql queries.sql``` or ```PATH="$(pwd)/datafusion/target/release:$PATH" ./run.sh```
31 changes: 17 additions & 14 deletions datafusion-partitioned/benchmark.sh
Original file line number Diff line number Diff line change
@@ -1,22 +1,25 @@
#!/bin/bash

echo "Install Rust"
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs > rust-init.sh
bash rust-init.sh -y
export HOME=${HOME:=~}
source ~/.cargo/env

echo "Install Dependencies"
sudo apt-get update -y
sudo apt-get install -y gcc

echo "Install DataFusion main branch"
git clone https://github.com/apache/arrow-datafusion.git
cd arrow-datafusion/
git checkout 47.0.0
CARGO_PROFILE_RELEASE_LTO=true RUSTFLAGS="-C codegen-units=1" cargo build --release --package datafusion-cli --bin datafusion-cli
export PATH="`pwd`/target/release:$PATH"
cd ..
echo "Install Homebrew"
# This requires password input for sudo, which is not set by default.
# You may need to run the following command to set a password first:
# ```
# sudo su
# passwd ubuntu
# exit
# ```
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
echo >> /home/ubuntu/.bashrc
echo 'eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv bash)"' >> /home/ubuntu/.bashrc
eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv bash)"

echo "Install datafusion-cli"
# or use `brew install datafusion@52` to install a specific version
brew install datafusion
datafusion-cli --version

echo "Download benchmark target data, partitioned"
mkdir -p partitioned
Expand Down
90 changes: 45 additions & 45 deletions datafusion-partitioned/results/c6a.2xlarge.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"system": "DataFusion (Parquet, partitioned)",
"date": "2025-07-10",
"date": "2026-01-15",
"machine": "c6a.2xlarge",
"cluster_size": 1,
"proprietary": "no",
Expand All @@ -10,48 +10,48 @@
"load_time": 0,
"data_size": 14737666736,
"result": [
[0.068, 0.022, 0.021],
[0.167, 0.06, 0.059],
[0.362, 0.144, 0.147],
[0.523, 0.109, 0.113],
[1.644, 1.224, 1.334],
[1.719, 1.167, 1.174],
[0.13, 0.037, 0.038],
[0.181, 0.07, 0.065],
[1.803, 1.414, 1.398],
[2.079, 1.591, 1.617],
[0.875, 0.396, 0.381],
[1.016, 0.452, 0.44],
[1.702, 1.216, 1.197],
[3.255, 1.883, 1.93],
[1.629, 1.124, 1.237],
[1.816, 1.529, 1.51],
[3.179, 2.585, 2.593],
[2.891, 2.197, 2.287],
[6.073, 4.78, 4.877],
[0.597, 0.1, 0.101],
[9.674, 1.35, 1.344],
[11.432, 1.673, 1.652],
[22.163, 3.015, 3.05],
[55.44, 46.286, 43.371],
[2.831, 0.611, 0.604],
[1.025, 0.535, 0.558],
[2.845, 0.724, 0.724],
[9.733, 2.09, 2.088],
[19.263, 18.559, 18.21],
[0.953, 0.806, 0.774],
[2.548, 1.265, 1.166],
[6.191, 1.162, 1.161],
[5.003, 4.177, 4.193],
[10.349, 4.795, 4.817],
[10.307, 4.831, 4.884],
[2.14, 1.835, 1.843],
[0.352, 0.121, 0.111],
[0.217, 0.056, 0.058],
[0.328, 0.11, 0.109],
[0.47, 0.156, 0.157],
[0.201, 0.05, 0.046],
[0.186, 0.046, 0.046],
[0.174, 0.041, 0.044]
]
[0.052, 0.002, 0.002],
[0.117, 0.040, 0.038],
[0.950, 0.116, 0.111],
[2.713, 0.100, 0.108],
[2.921, 1.162, 1.009],
[3.116, 1.176, 1.047],
[0.055, 0.002, 0.002],
[0.126, 0.041, 0.043],
[3.124, 1.198, 1.194],
[4.286, 1.531, 1.493],
[2.358, 0.276, 0.275],
[2.714, 0.312, 0.290],
[3.249, 1.089, 0.965],
[6.469, 1.600, 1.630],
[3.244, 1.031, 1.036],
[2.522, 1.228, 1.260],
[6.138, 2.155, 2.165],
[6.118, 2.022, 2.108],
[11.294, 4.265, 4.152],
[1.706, 0.091, 0.091],
[20.960, 1.253, 1.267],
[23.958, 1.558, 1.453],
[45.677, 2.494, 2.559],
[108.672, 95.195, 91.845],
[1.474, 0.157, 0.159],
[3.367, 0.327, 0.323],
[1.546, 0.156, 0.155],
[21.312, 1.754, 1.709],
[19.173, 15.870, 15.832],
[0.859, 0.756, 0.750],
[7.448, 0.959, 1.028],
[15.002, 1.040, 1.054],
[11.322, 3.872, 3.830],
[20.749, 4.133, 4.390],
[20.763, 4.043, 4.438],
[1.892, 1.689, 1.658],
[0.170, 0.049, 0.055],
[0.126, 0.037, 0.033],
[0.179, 0.058, 0.058],
[0.464, 0.076, 0.074],
[0.122, 0.020, 0.024],
[0.133, 0.017, 0.021],
[0.094, 0.020, 0.016]
]
}
90 changes: 45 additions & 45 deletions datafusion-partitioned/results/c6a.4xlarge.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"system": "DataFusion (Parquet, partitioned)",
"date": "2025-07-10",
"date": "2026-01-15",
"machine": "c6a.4xlarge",
"cluster_size": 1,
"proprietary": "no",
Expand All @@ -10,48 +10,48 @@
"load_time": 0,
"data_size": 14737666736,
"result": [
[0.058, 0.017, 0.015],
[0.116, 0.035, 0.037],
[0.2, 0.084, 0.088],
[0.43, 0.081, 0.084],
[1.086, 0.78, 0.799],
[0.977, 0.751, 0.756],
[0.086, 0.026, 0.026],
[0.125, 0.04, 0.037],
[1.011, 0.882, 0.862],
[1.349, 0.971, 0.983],
[0.565, 0.231, 0.24],
[0.677, 0.264, 0.265],
[1.062, 0.816, 0.82],
[2.769, 1.346, 1.201],
[1.135, 0.792, 0.78],
[1.021, 0.926, 0.916],
[2.638, 1.639, 1.63],
[2.585, 1.555, 1.592],
[5.159, 3.238, 3.24],
[0.26, 0.077, 0.077],
[10.045, 1.067, 1.082],
[11.424, 1.291, 1.269],
[22.117, 2.487, 2.511],
[55.492, 9.765, 9.851],
[2.825, 0.432, 0.423],
[0.853, 0.328, 0.33],
[2.837, 0.508, 0.504],
[9.744, 1.469, 1.478],
[9.444, 9.445, 9.475],
[0.515, 0.405, 0.415],
[2.433, 0.729, 0.735],
[6.158, 0.884, 0.891],
[4.608, 3.342, 3.281],
[10.221, 3.481, 3.455],
[10.145, 3.486, 3.46],
[1.261, 1.188, 1.168],
[0.309, 0.114, 0.114],
[0.175, 0.05, 0.048],
[0.313, 0.099, 0.117],
[0.451, 0.166, 0.192],
[0.183, 0.04, 0.043],
[0.171, 0.04, 0.041],
[0.143, 0.035, 0.037]
]
[0.042, 0.002, 0.002],
[0.082, 0.024, 0.023],
[0.177, 0.068, 0.064],
[0.615, 0.076, 0.073],
[1.198, 0.703, 0.718],
[1.059, 0.727, 0.723],
[0.054, 0.002, 0.002],
[0.100, 0.025, 0.026],
[0.996, 0.824, 0.840],
[1.713, 0.942, 0.981],
[0.632, 0.193, 0.192],
[0.849, 0.228, 0.220],
[1.156, 0.736, 0.745],
[2.658, 1.245, 1.244],
[1.188, 0.753, 0.749],
[0.977, 0.810, 0.818],
[2.701, 1.527, 1.521],
[2.655, 1.522, 1.538],
[5.484, 3.126, 3.143],
[0.275, 0.070, 0.065],
[10.288, 0.958, 0.937],
[11.562, 1.139, 1.109],
[22.298, 2.243, 2.250],
[52.816, 8.052, 8.039],
[0.247, 0.115, 0.129],
[1.284, 0.206, 0.208],
[0.481, 0.121, 0.126],
[10.408, 1.285, 1.342],
[9.295, 8.614, 8.565],
[0.487, 0.401, 0.401],
[3.186, 0.721, 0.691],
[6.936, 0.867, 0.894],
[5.055, 3.304, 3.237],
[10.231, 3.302, 3.297],
[10.289, 3.304, 3.270],
[1.182, 1.097, 1.115],
[0.158, 0.058, 0.054],
[0.112, 0.033, 0.035],
[0.161, 0.057, 0.054],
[0.224, 0.088, 0.086],
[0.093, 0.021, 0.024],
[0.092, 0.018, 0.018],
[0.090, 0.016, 0.016]
]
}
Loading