-
Notifications
You must be signed in to change notification settings - Fork 253
Update DataFusion instructions / Enable swap on small machines #804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,38 +1,26 @@ | ||
| # DataFusion | ||
|
|
||
| DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. For more information, please check <https://arrow.apache.org/datafusion/user-guide/introduction.html> | ||
| Partitioned (100-file) Parquet dataset | ||
|
|
||
| We use parquet file here and create an external table for it; and then execute the queries. | ||
| ## Cookbook: Generate benchmark results | ||
|
|
||
| ## Generate benchmark results | ||
| Follow instructions in the [datafusion](../datafusion/README.md) directory. | ||
|
|
||
| The benchmark should be completed in under an hour. On-demand pricing is $0.6 per hour while spot pricing is only $0.2 to $0.3 per hour (us-east-2). | ||
| ### Known Issues | ||
|
|
||
| 1. manually start a AWS EC2 instance | ||
| - `c6a.4xlarge` | ||
| - Ubuntu 22.04 or later | ||
| - Root 500GB gp2 SSD | ||
| - no EBS optimized | ||
| - no instance store | ||
| 1. wait for status check passed, then ssh to EC2 `ssh ubuntu@{ip}` | ||
| 1. `git clone https://github.com/ClickHouse/ClickBench` | ||
| 1. `cd ClickBench/datafusion` | ||
| 1. `vi benchmark.sh` and modify following line to target Datafusion version | ||
| 1. DataFusion follows the SQL standard with case-sensitive identifiers, so all column names in `queries.sql` use double-quoted literals (e.g. `EventTime` -> `"EventTime"`). | ||
|
|
||
| ```bash | ||
| git checkout 46.0.0 | ||
| ``` | ||
| 2. You must set the `('binary_as_string' 'true')` due to an incorrect logical type | ||
| annotation in the partitioned files. See [Issue#7](https://github.com/ClickHouse/ClickBench/issues/7) | ||
|
|
||
| 1. `bash benchmark.sh` | ||
| ## Generate full human-readable results (for debugging) | ||
|
|
||
| ### Know Issues | ||
| 1. Install/build `datafusion-cli`. | ||
|
|
||
| 1. importing parquet by `datafusion-cli` doesn't support schema, need to add some casting in queries.sql (e.g. converting EventTime from Int to Timestamp via `to_timestamp_seconds`) | ||
| 2. importing parquet by `datafusion-cli` make column name column name case-sensitive, i change all column name in queries.sql to double quoted literal (e.g. `EventTime` -> `"EventTime"`) | ||
| 3. `comparing binary with utf-8` and `group by binary` don't work in mac, if you run these queries in mac, you'll get some errors for queries contain binary format apache/arrow-datafusion#3050 | ||
| 2. Download the parquet files: | ||
|
|
||
| ## Generate full human readable results (for debugging) | ||
| ``` | ||
| seq 0 99 | xargs -P100 -I{} bash -c 'wget --directory-prefix partitioned --continue --progress=dot:giga https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_{}.parquet' | ||
| ``` | ||
|
|
||
| 1. install datafusion-cli | ||
| 2. download the parquet ```wget --continue --progress=dot:giga https://datasets.clickhouse.com/hits_compatible/hits.parquet``` | ||
| 3. execute it ```datafusion-cli -f create_single.sql queries.sql``` or ```bash run2.sh``` | ||
| 3. Run the queries: `datafusion-cli -f create.sql -f queries.sql` or `PATH="$(pwd)/arrow-datafusion/target/release:$PATH" ./run.sh`. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,6 +6,20 @@ bash rust-init.sh -y | |
| export HOME=${HOME:=~} | ||
| source ~/.cargo/env | ||
|
|
||
| if [ $(free -g | awk '/^Mem:/{print $2}') -lt 12 ]; then | ||
| echo "LOW MEMORY MODE" | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This (and the corresponding change in |
||
| # Enable swap if not already enabled. This is needed both for rustc and until we have a better | ||
| # solution for low memory machines, see | ||
| # https://github.com/apache/datafusion/issues/18473 | ||
| if [ "$(swapon --noheadings --show | wc -l)" -eq 0 ]; then | ||
| echo "Enabling 8G swap" | ||
| sudo fallocate -l 8G /swapfile | ||
| sudo chmod 600 /swapfile | ||
| sudo mkswap /swapfile | ||
| sudo swapon /swapfile | ||
| fi | ||
| fi | ||
|
|
||
| echo "Install Dependencies" | ||
| sudo apt-get update -y | ||
| sudo apt-get install -y gcc | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| #!/bin/bash | ||
|
|
||
| # This script converts the raw `result.csv` data from `benchmark.sh` into the | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the helper script from #749 (with a different name) for our convenience when producing new run results |
||
| # final json format used by the benchmark dashboard. | ||
| # | ||
| # usage : ./make-json.sh <machine> | ||
| # | ||
| # example (save results/c6a.4xlarge.json) | ||
| # ./make-json.sh c6a.4xlarge | ||
|
|
||
| MACHINE=$1 | ||
| OUTPUT_FILE="results/${MACHINE}.json" | ||
| SYSTEM_NAME="DataFusion (Parquet, partitioned)" | ||
| DATE=$(date +%Y-%m-%d) | ||
|
|
||
|
|
||
| # Read the CSV and build the result array using sed | ||
| RESULT_ARRAY=$(awk -F, '{arr[$1]=arr[$1]","$3} END {for (i=1;i<=length(arr);i++) {gsub(/^,/, "", arr[i]); printf " ["arr[i]"]"; if (i<length(arr)) printf ",\n"}}' result.csv) | ||
|
|
||
| # form the final JSON structure from the template | ||
| cat <<EOF > $OUTPUT_FILE | ||
| { | ||
| "system": "$SYSTEM_NAME", | ||
| "date": "$DATE", | ||
| "machine": "$MACHINE", | ||
| "cluster_size": 1, | ||
| "proprietary": "no", | ||
| "tuned": "no", | ||
| "hardware": "cpu", | ||
| "tags": ["Rust","column-oriented","embedded","stateless"], | ||
| "load_time": 0, | ||
| "data_size": 14737666736, | ||
| "result": [ | ||
| $RESULT_ARRAY | ||
| ] | ||
| } | ||
| EOF | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,38 +1,64 @@ | ||
| # DataFusion | ||
|
|
||
| DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. For more information, please check <https://arrow.apache.org/datafusion/user-guide/introduction.html> | ||
| Single (1 file) Parquet dataset | ||
|
|
||
| We use parquet file here and create an external table for it; and then execute the queries. | ||
| [Apache DataFusion] is an extensible query execution framework, written in Rust, that uses [Apache Arrow] as its in-memory format. | ||
|
|
||
| ## Generate benchmark results | ||
| [Apache DataFusion]: https://datafusion.apache.org/ | ||
| [Apache Arrow]: https://arrow.apache.org/ | ||
|
|
||
| ## Cookbook: Generate benchmark results | ||
|
|
||
| The benchmark should be completed in under an hour. On-demand pricing is $0.6 per hour while spot pricing is only $0.2 to $0.3 per hour (us-east-2). | ||
|
|
||
| 1. manually start a AWS EC2 instance | ||
| - `c6a.4xlarge` | ||
| - Ubuntu 22.04 or later | ||
| - Root 500GB gp2 SSD | ||
| - no EBS optimized | ||
| - no instance store | ||
| 1. wait for status check passed, then ssh to EC2 `ssh ubuntu@{ip}` | ||
| 1. `git clone https://github.com/ClickHouse/ClickBench` | ||
| 1. `cd ClickBench/datafusion` | ||
| 1. `vi benchmark.sh` and modify following line to target Datafusion version | ||
| 1. Manually start an AWS EC2 instance. The following environments are included in this directory: | ||
|
|
||
| | Instance Type | OS | Disk | Arch | | ||
| | :-----------: | :---------------------: | :----------------: | :---: | | ||
| | `c6a.xlarge` | `Ubuntu 24.04` or later | Root 500GB gp2 SSD | AMD64 | | ||
| | `c6a.2xlarge` | | | AMD64 | | ||
| | `c6a.4xlarge` | | | AMD64 | | ||
| | `c8g.4xlarge` | | | ARM64 | | ||
|
|
||
| All with no EBS optimization and no instance store. | ||
|
|
||
| 2. Wait for the status checks to pass, then ssh to EC2: `ssh ubuntu@{ip}` | ||
| 3. `git clone https://github.com/ClickHouse/ClickBench` | ||
| 4. `cd ClickBench/datafusion` | ||
| 5. `vi benchmark.sh` and modify the following line to target the DataFusion version | ||
|
|
||
| ```bash | ||
| git checkout 46.0.0 | ||
| git checkout 47.0.0 | ||
| ``` | ||
|
|
||
| 1. `bash benchmark.sh` | ||
| 6. `bash benchmark.sh` | ||
|
|
||
| You can update/preview the results by running: | ||
| ``` | ||
| ./make-json.sh <machine-name> # Example. ./make-json.sh c6a.xlarge | ||
| ``` | ||
|
|
||
| ### Known Issues | ||
|
|
||
| 1. DataFusion follows the SQL standard with case-sensitive identifiers, so all column names in `queries.sql` use double-quoted literals (e.g. `EventTime` -> `"EventTime"`). | ||
|
|
||
| ## Generate full human-readable results (for debugging) | ||
|
|
||
| 1. Install/build `datafusion-cli`. | ||
|
|
||
| 2. Download the parquet file: | ||
|
|
||
| ### Know Issues | ||
| ``` | ||
| wget --continue https://datasets.clickhouse.com/hits_compatible/hits.parquet | ||
| ``` | ||
|
|
||
| 1. importing parquet by `datafusion-cli` doesn't support schema, need to add some casting in queries.sql (e.g. converting EventTime from Int to Timestamp via `to_timestamp_seconds`) | ||
| 2. importing parquet by `datafusion-cli` make column name column name case-sensitive, i change all column name in queries.sql to double quoted literal (e.g. `EventTime` -> `"EventTime"`) | ||
| 3. `comparing binary with utf-8` and `group by binary` don't work in mac, if you run these queries in mac, you'll get some errors for queries contain binary format apache/arrow-datafusion#3050 | ||
| 3. Run the queries: | ||
|
|
||
| ## Generate full human readable results (for debugging) | ||
| ``` | ||
| datafusion-cli -f create.sql -f queries.sql | ||
| ``` | ||
|
|
||
| 1. install datafusion-cli | ||
| 2. download the parquet ```wget --continue --progress=dot:giga https://datasets.clickhouse.com/hits_compatible/hits.parquet``` | ||
| 3. execute it ```datafusion-cli -f create_single.sql queries.sql``` or ```bash run2.sh``` | ||
| Or use the runner script: | ||
| ``` | ||
| PATH="$(pwd)/arrow-datafusion/target/release:$PATH" ./run.sh | ||
| ``` |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| #!/bin/bash | ||
|
|
||
| # This script converts the raw `result.csv` data from `benchmark.sh` into the | ||
| # final json format used by the benchmark dashboard. | ||
| # | ||
| # usage : ./make-json.sh <machine> | ||
| # | ||
| # example ./make-json.sh c6a.4xlarge # saves results/c6a.4xlarge.json | ||
| # | ||
|
|
||
| MACHINE=$1 | ||
| OUTPUT_FILE="results/${MACHINE}.json" | ||
| SYSTEM_NAME="DataFusion (Parquet, single)" | ||
| DATE=$(date +%Y-%m-%d) | ||
|
|
||
|
|
||
| # Read the CSV and build the result array using sed | ||
| RESULT_ARRAY=$(awk -F, '{arr[$1]=arr[$1]","$3} END {for (i=1;i<=length(arr);i++) {gsub(/^,/, "", arr[i]); printf " ["arr[i]"]"; if (i<length(arr)) printf ",\n"}}' result.csv) | ||
|
|
||
| # form the final JSON structure from the template | ||
| cat <<EOF > $OUTPUT_FILE | ||
| { | ||
| "system": "$SYSTEM_NAME", | ||
| "date": "$DATE", | ||
| "machine": "$MACHINE", | ||
| "cluster_size": 1, | ||
| "proprietary": "no", | ||
| "tuned": "no", | ||
| "hardware": "cpu", | ||
| "tags": ["Rust","column-oriented","embedded","stateless"], | ||
| "load_time": 0, | ||
| "data_size": 14779976446, | ||
| "result": [ | ||
| $RESULT_ARRAY | ||
| ] | ||
| } | ||
| EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reduced the duplication of the documentation and updated the Known Issues