Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -305,7 +305,6 @@ Please help us add more systems and run the benchmarks on more types of VMs:
- [ ] Hive
- [ ] Hydrolix
- [ ] Impala
- [ ] InfluxDB
- [ ] LocustDB
- [ ] Manticore Search
- [ ] MS SQL Server with Column Store Index (without publishing)
Expand Down
37 changes: 37 additions & 0 deletions influxdb/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# InfluxDB

This entry uses [InfluxDB 3 Core](https://docs.influxdata.com/influxdb3/core/), the open-source SQL-capable
release of InfluxDB. The query engine is Apache DataFusion, the storage is local Parquet.

## Caveats

InfluxDB is a time-series database, not a general analytical database, so loading a flat 100M-row
analytical dataset into it stretches the data model:

1. **No bulk CSV/Parquet import.** The only ingestion path is line protocol over HTTP
(`/api/v3/write_lp`). `load.py` streams `hits.tsv`, converts each row to a line-protocol point, and
POSTs in batches. The conversion + ingest is the dominant cost of the load phase and is much slower
than e.g. Postgres `\copy` or DuckDB `COPY FROM`.

2. **Required unique timestamp.** Line protocol merges points that share `(measurement, tags, timestamp)`,
so to preserve all rows we use the row index as the line protocol timestamp (in nanoseconds, offset
from a fixed 2020-01-01 epoch). The original `EventTime` is stored as a regular string field and used
by the queries.

3. **No tags, all fields.** Tags are indexed at ingest time; for a wide flat schema the indexing cost
is prohibitive. Every column is written as a field instead. Numeric columns use the integer
line-protocol type (`...i`); string and date/time columns are written as strings.

4. **Query compatibility.** Most ClickBench queries run unchanged. Q19 and Q43 cast `EventTime` (stored
as string) to a `TIMESTAMP` for `extract(minute ...)` and `date_trunc('minute', ...)`. DataFusion
folds unquoted identifiers to lowercase, so `load.py` writes column names in lowercase to keep the
standard CamelCase queries portable.

## Run

```
./benchmark.sh
```

The server listens on port 8181, stores data under `./influxdb3-data`, and runs without authentication
(`--without-auth`) for the duration of the benchmark.
101 changes: 101 additions & 0 deletions influxdb/benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
#!/bin/bash

set -eu

export DEBIAN_FRONTEND=noninteractive

# Install dependencies and the InfluxDB 3 Core binary directly. We bypass the
# upstream install_influxdb3.sh installer because it is interactive and not
# suited for unattended runs.
sudo apt-get update -qq >/dev/null
sudo apt-get install -y -qq python3 python3-requests curl jq time >/dev/null

INFLUX_VERSION=3.9.2
case "$(uname -m)" in
x86_64|amd64) INFLUX_ARTIFACT=linux_amd64 ;;
aarch64|arm64) INFLUX_ARTIFACT=linux_arm64 ;;
*) echo "Unsupported architecture: $(uname -m)" >&2; exit 1 ;;
esac

INFLUX_TGZ="influxdb3-core-${INFLUX_VERSION}_${INFLUX_ARTIFACT}.tar.gz"
wget --continue -q "https://dl.influxdata.com/influxdb/releases/${INFLUX_TGZ}"
rm -rf "influxdb3-core-${INFLUX_VERSION}"
tar -xzf "${INFLUX_TGZ}"
INFLUXDB3="${PWD}/influxdb3-core-${INFLUX_VERSION}/influxdb3"

# Start the server with local-file storage and authentication disabled.
# The --wal-* tunings reduce per-second fsync churn during the multi-hour
# load and let more write requests accumulate in memory before being
# rejected with back-pressure.
mkdir -p ./influxdb3-data
start_server() {
nohup "${INFLUXDB3}" serve \
--node-id node0 \
--object-store file \
--data-dir "${PWD}/influxdb3-data" \
--http-bind 127.0.0.1:8181 \
--without-auth \
--wal-max-write-buffer-size 1000000 \
--max-http-request-size 67108864 \
--exec-mem-pool-bytes 80% \
> influxdb3.log 2>&1 &
INFLUXDB_PID=$!
echo "InfluxDB PID: ${INFLUXDB_PID}"

for _ in $(seq 1 300); do
curl -sf http://localhost:8181/health > /dev/null && return
sleep 1
done
echo "Timed out waiting for InfluxDB to start" >&2
return 1
}

restart_server() {
# SIGTERM forces the WAL to drain into Parquet and the in-memory write
# buffers to flush; the next start comes up with no WAL to replay.
kill -TERM "${INFLUXDB_PID}" 2>/dev/null || true
wait "${INFLUXDB_PID}" 2>/dev/null || true
start_server
}

start_server

"${INFLUXDB3}" create database hits

# Download the dataset and load it via line protocol.
../download-hits-tsv

# Load in chunks, restarting the server between each chunk so the WAL drains
# into Parquet. With one monolithic load, every Parquet file ends up covering
# the same broad time range (16 parallel writers interleave timestamps across
# the whole dataset), and InfluxDB 3.9.2's regroup_files optimizer hits an
# internal "overlapping ranges within same file" assertion at query time.
# Chunking keeps each Parquet file's [min_time, max_time] bounded to a
# disjoint slice, so subsequent queries can plan successfully.
TOTAL_ROWS=99997497
CHUNKS=10
CHUNK_ROWS=$(( (TOTAL_ROWS + CHUNKS - 1) / CHUNKS ))

load_t0=$(date +%s)
for i in $(seq 0 $((CHUNKS - 1))); do
chunk_start=$((i * CHUNK_ROWS))
chunk_end=$(( (i + 1) * CHUNK_ROWS ))
if [ "$chunk_end" -gt "$TOTAL_ROWS" ]; then chunk_end=$TOTAL_ROWS; fi
echo "Chunk $((i + 1))/${CHUNKS}: rows ${chunk_start}..${chunk_end}"
python3 load.py --start-row "$chunk_start" --end-row "$chunk_end"
# Drain WAL so this chunk lands in its own Parquet files before the
# next chunk starts mixing more timestamps into the in-memory buffer.
restart_server
done
echo "Load time: $(($(date +%s) - load_t0))"

# Server is already freshly restarted from the last chunk's drain, so no
# additional restart is needed before the query phase.

# Run queries.
./run.sh | tee log.txt

echo -n "Data size: "
du -bcs ./influxdb3-data | grep total | awk '{print $1}'

kill "${INFLUXDB_PID}" || true
Loading