ClickHouse · alexey-milovidov · May 7, 2026 · May 8, 2026 · May 8, 2026 · May 8, 2026
diff --git a/README.md b/README.md
@@ -305,7 +305,6 @@ Please help us add more systems and run the benchmarks on more types of VMs:
 - [ ] Hive
 - [ ] Hydrolix
 - [ ] Impala
-- [ ] InfluxDB
 - [ ] LocustDB
 - [ ] Manticore Search
 - [ ] MS SQL Server with Column Store Index (without publishing)

diff --git a/influxdb/README.md b/influxdb/README.md
@@ -0,0 +1,37 @@
+# InfluxDB
+
+This entry uses [InfluxDB 3 Core](https://docs.influxdata.com/influxdb3/core/), the open-source SQL-capable
+release of InfluxDB. The query engine is Apache DataFusion, the storage is local Parquet.
+
+## Caveats
+
+InfluxDB is a time-series database, not a general analytical database, so loading a flat 100M-row
+analytical dataset into it stretches the data model:
+
+1. **No bulk CSV/Parquet import.** The only ingestion path is line protocol over HTTP
+   (`/api/v3/write_lp`). `load.py` streams `hits.tsv`, converts each row to a line-protocol point, and
+   POSTs in batches. The conversion + ingest is the dominant cost of the load phase and is much slower
+   than e.g. Postgres `\copy` or DuckDB `COPY FROM`.
+
+2. **Required unique timestamp.** Line protocol merges points that share `(measurement, tags, timestamp)`,
+   so to preserve all rows we use the row index as the line protocol timestamp (in nanoseconds, offset
+   from a fixed 2020-01-01 epoch). The original `EventTime` is stored as a regular string field and used
+   by the queries.
+
+3. **No tags, all fields.** Tags are indexed at ingest time; for a wide flat schema the indexing cost
+   is prohibitive. Every column is written as a field instead. Numeric columns use the integer
+   line-protocol type (`...i`); string and date/time columns are written as strings.
+
+4. **Query compatibility.** Most ClickBench queries run unchanged. Q19 and Q43 cast `EventTime` (stored
+   as string) to a `TIMESTAMP` for `extract(minute ...)` and `date_trunc('minute', ...)`. DataFusion
+   folds unquoted identifiers to lowercase, so `load.py` writes column names in lowercase to keep the
+   standard CamelCase queries portable.
+
+## Run
+
+```
+./benchmark.sh
+```
+
+The server listens on port 8181, stores data under `./influxdb3-data`, and runs without authentication
+(`--without-auth`) for the duration of the benchmark.
diff --git a/influxdb/benchmark.sh b/influxdb/benchmark.sh
@@ -0,0 +1,101 @@
+#!/bin/bash
+
+set -eu
+
+export DEBIAN_FRONTEND=noninteractive
+
+# Install dependencies and the InfluxDB 3 Core binary directly. We bypass the
+# upstream install_influxdb3.sh installer because it is interactive and not
+# suited for unattended runs.
+sudo apt-get update -qq >/dev/null
+sudo apt-get install -y -qq python3 python3-requests curl jq time >/dev/null
+
+INFLUX_VERSION=3.9.2
+case "$(uname -m)" in
+    x86_64|amd64)  INFLUX_ARTIFACT=linux_amd64 ;;
+    aarch64|arm64) INFLUX_ARTIFACT=linux_arm64 ;;
+    *) echo "Unsupported architecture: $(uname -m)" >&2; exit 1 ;;
+esac
+
+INFLUX_TGZ="influxdb3-core-${INFLUX_VERSION}_${INFLUX_ARTIFACT}.tar.gz"
+wget --continue -q "https://dl.influxdata.com/influxdb/releases/${INFLUX_TGZ}"
+rm -rf "influxdb3-core-${INFLUX_VERSION}"
+tar -xzf "${INFLUX_TGZ}"
+INFLUXDB3="${PWD}/influxdb3-core-${INFLUX_VERSION}/influxdb3"
+
+# Start the server with local-file storage and authentication disabled.
+# The --wal-* tunings reduce per-second fsync churn during the multi-hour
+# load and let more write requests accumulate in memory before being
+# rejected with back-pressure.
+mkdir -p ./influxdb3-data
+start_server() {
+    nohup "${INFLUXDB3}" serve \
+        --node-id node0 \
+        --object-store file \
+        --data-dir "${PWD}/influxdb3-data" \
+        --http-bind 127.0.0.1:8181 \
+        --without-auth \
+        --wal-max-write-buffer-size 1000000 \
+        --max-http-request-size 67108864 \
+        --exec-mem-pool-bytes 80% \
+        > influxdb3.log 2>&1 &
+    INFLUXDB_PID=$!
+    echo "InfluxDB PID: ${INFLUXDB_PID}"
+
+    for _ in $(seq 1 300); do
+        curl -sf http://localhost:8181/health > /dev/null && return
+        sleep 1
+    done
+    echo "Timed out waiting for InfluxDB to start" >&2
+    return 1
+}
+
+restart_server() {
+    # SIGTERM forces the WAL to drain into Parquet and the in-memory write
+    # buffers to flush; the next start comes up with no WAL to replay.
+    kill -TERM "${INFLUXDB_PID}" 2>/dev/null || true
+    wait "${INFLUXDB_PID}" 2>/dev/null || true
+    start_server
+}
+
+start_server
+
+"${INFLUXDB3}" create database hits
+
+# Download the dataset and load it via line protocol.
+../download-hits-tsv
+
+# Load in chunks, restarting the server between each chunk so the WAL drains
+# into Parquet. With one monolithic load, every Parquet file ends up covering
+# the same broad time range (16 parallel writers interleave timestamps across
+# the whole dataset), and InfluxDB 3.9.2's regroup_files optimizer hits an
+# internal "overlapping ranges within same file" assertion at query time.
+# Chunking keeps each Parquet file's [min_time, max_time] bounded to a
+# disjoint slice, so subsequent queries can plan successfully.
+TOTAL_ROWS=99997497
+CHUNKS=10
+CHUNK_ROWS=$(( (TOTAL_ROWS + CHUNKS - 1) / CHUNKS ))
+
+load_t0=$(date +%s)
+for i in $(seq 0 $((CHUNKS - 1))); do
+    chunk_start=$((i * CHUNK_ROWS))
+    chunk_end=$(( (i + 1) * CHUNK_ROWS ))
+    if [ "$chunk_end" -gt "$TOTAL_ROWS" ]; then chunk_end=$TOTAL_ROWS; fi
+    echo "Chunk $((i + 1))/${CHUNKS}: rows ${chunk_start}..${chunk_end}"
+    python3 load.py --start-row "$chunk_start" --end-row "$chunk_end"
+    # Drain WAL so this chunk lands in its own Parquet files before the
+    # next chunk starts mixing more timestamps into the in-memory buffer.
+    restart_server
+done
+echo "Load time: $(($(date +%s) - load_t0))"
+
+# Server is already freshly restarted from the last chunk's drain, so no
+# additional restart is needed before the query phase.
+
+# Run queries.
+./run.sh | tee log.txt
+
+echo -n "Data size: "
+du -bcs ./influxdb3-data | grep total | awk '{print $1}'
+
+kill "${INFLUXDB_PID}" || true