Skip to content

Refactor: standard install/start/check/stop/load/query interface per system#860

Open
alexey-milovidov wants to merge 150 commits intomainfrom
refactor/per-system-script-interface
Open

Refactor: standard install/start/check/stop/load/query interface per system#860
alexey-milovidov wants to merge 150 commits intomainfrom
refactor/per-system-script-interface

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Summary

  • Split each local system's monolithic benchmark.sh into 7 single-purpose scripts (install, start, check, stop, load, query, data-size) with a stable contract, driven by a new shared lib/benchmark-common.sh.
  • Wrap dataframe / in-process systems (pandas, polars-dataframe, chdb-dataframe, daft-parquet*, duckdb-dataframe, sirius) in small FastAPI servers so they fit the same start/stop/query lifecycle.
  • 88 local systems refactored; cloud/managed systems and a handful of non-functional ones are intentionally untouched.

Why

Previously, every system's benchmark.sh bundled installation, server lifecycle, dataset download, data loading, and query dispatch into one script — and run.sh hard-coded the per-query orchestration. There was no programmatic per-query entry point, so:

  1. Tweaking the dataset, query set, or per-query behavior (e.g. restarting the system between queries to neutralize warm-process effects) required editing every system's scripts individually.
  2. Building an online "run query X against system Y" service was impossible.
  3. Most run.sh ran all 3 tries inside a single CLI invocation, so OS-cache warmth from try 1 leaked into tries 2/3.

The new per-system interface

Script Stdin Stdout Stderr Notes
install - progress progress Idempotent. Env prep + system install.
start - - progress Start daemon. Idempotent. Empty/exit-0 for stateless tools.
check - - progress Trivial query (e.g. SELECT 1). Exit 0 iff responsive.
stop - - progress Stop daemon. Idempotent.
load - progress progress Runs create.sql + loads data; deletes source files then sync.
query one query query result, any format last line: fractional seconds (0.123) Non-zero exit on failure.
data-size - bytes (one integer) - Reports the data footprint.

Each system's benchmark.sh becomes a 4-line shim that sets a couple of env vars and exec's the shared driver:

#!/bin/bash
export BENCH_DOWNLOAD_SCRIPT="download-hits-parquet-partitioned"
export BENCH_RESTARTABLE=yes
exec ../lib/benchmark-common.sh

The shared driver runs install → start+check → download → load (timed) → for each query: flush caches; if BENCH_RESTARTABLE=yes, stop+start; run query 3× → data-size → stop. The output log shape (Load time:, [t1,t2,t3], per query, Data size:) is identical to the old benchmark.sh, so cloud-init.sh.in's POST to play.clickhouse.com keeps working unchanged.

BENCH_RESTARTABLE=no for embedded CLIs (duckdb, sqlite, datafusion, …) and dataframe wrappers — restarting a single CLI/Python process between queries would dominate query time. For these, OS caches are still flushed between queries.

Scope

Refactored (88 systems):

  • Server, restartable: clickhouse, postgresql, mysql, mariadb, monetdb, druid, pinot, vertica, exasol, kinetica, heavyai, questdb, cockroachdb, elasticsearch, ydb, … and the postgres/clickhouse/mysql variants (timescaledb, citus, paradedb, postgresql-indexed, clickhouse-parquet*, clickhouse-datalake*, mysql-myisam, tidb, infobright, …)
  • Embedded CLI, not restartable: duckdb (and variants), sqlite, datafusion (and partitioned), glaredb (and partitioned), hyper, hyper-parquet, octosql, opteryx, sail (and partitioned), drill, turso, chdb, chdb-parquet-partitioned
  • Dataframe with FastAPI wrapper, not restartable: pandas, polars-dataframe, chdb-dataframe, daft-parquet, daft-parquet-partitioned, duckdb-dataframe, sirius
  • Spark family: spark, spark-auron, spark-comet, spark-gluten

Not refactored (intentionally out of scope):

  • Cloud / managed: alloydb, athena, aurora-{mysql,postgresql}, bigquery, clickhouse-cloud, databricks, motherduck, redshift, redshift-serverless, snowflake, hydrolix, firebolt(), hologres, tinybird, hydra, mariadb-columnstore, pg_duckdb, singlestore, supabase, tablespace, tembo-olap, timescale-cloud, crunchy-bridge-for-analytics, s3select, …
  • Non-functional: csvq, dsq, locustdb (panic on first query); exasol, spark-velox (empty dirs)
  • Non-SQL or no SQL CLI: mongodb (JS aggregation pipelines), polars (no SQL CLI; the dataframe variant is wrapped instead)

Validated end-to-end on a 96-core / 185 GB ARM machine

System Data Outcome
clickhouse 14.2 GB / 100M rows Full 43 queries × 3 tries with stop/start between queries; load 124s
duckdb 20.6 GB / 100M rows Full 43 queries × 3 tries (no restart); load 69s
pandas 4.2 GB in-mem (5M-row subset) 42/43 queries; Q43 hit a pandas lambda bug → recorded as null (framework's error path works)
sqlite 3.9 GB (5M-row subset) First 5 queries × 3 tries; load 68s
postgresql 100M rows / 75 GB TSV First 3 queries × 3 tries with restart; load 829s. Cold-cache spike clearly visible (135s → 7s after warmup) — confirms per-query restart actually flushes the page cache

All 88 refactored systems pass bash -n and have executable bits set on the 7 scripts + benchmark.sh.

Bug fixes surfaced during validation

  • lib/benchmark-common.sh: data-size now runs before stop (clickhouse and pandas need the server up to report size).
  • clickhouse/start: idempotent (was erroring when already running).
  • duckdb/load, sqlite/load: rm -f hits.db/mydb for idempotent reruns.
  • postgresql/load: -v ON_ERROR_STOP=1 so COPY data errors actually fail the script instead of silently rolling back.
  • BENCH_DOWNLOAD_SCRIPT may now be empty for systems that read directly from S3 datalakes / remote services (clickhouse-datalake*, duckdb-datalake*, chyt, …).

Flagged for follow-up review

  • duckdb-memory:memory: semantics force a per-query reload; will inflate timings vs. the original single-process flow.
  • cloudberry, greenplum — multi-phase install (reboot between phases); the shim only runs phase 1.
  • sirius — GPU-dependent; long-lived duckdb CLI subprocess proxy; review the stdin/sentinel protocol.
  • paradedb*, pg_ducklake, pg_mooncake — Docker container created in install then docker cp in load (small divergence from the original docker run -v ... due to the lifecycle order: start runs before download).

Test plan

  • bash -n on all 88 systems' scripts
  • clickhouse: full 43-query benchmark.sh on 100M-row real data
  • duckdb: full 43-query benchmark.sh on 100M-row real data
  • pandas: 43-query benchmark.sh on a 5M-row subset
  • sqlite: abbreviated benchmark.sh on a 5M-row subset
  • postgresql: abbreviated benchmark.sh on full 100M-row data
  • Smoke-run on a fresh c6a.metal/equivalent VM via cloud-init for a representative system from each family before merging
  • Verify play.clickhouse.com log-ingestion sink continues to parse the output for at least one production benchmark run

🤖 Generated with Claude Code

alexey-milovidov and others added 3 commits May 7, 2026 12:14
…/data-size

Each local system now exposes a small set of single-purpose scripts with a
stable contract, so they can be driven by a shared lib/benchmark-common.sh
and reused by external tooling (e.g. an online "run query against system X"
service):

  install     env prep + system install (idempotent)
  start       start daemon (idempotent; empty for stateless tools)
  check       trivial query, exit 0 iff responsive
  stop        stop daemon (idempotent)
  load        runs create.sql + loads data, deletes source files, sync
  query       SQL on stdin; result on stdout; runtime in fractional seconds
              on the last line of stderr; non-zero exit on error
  data-size   prints data footprint in bytes (one integer to stdout)

Each system's old monolithic benchmark.sh is replaced by a 4-line shim that
sets a couple of env vars (BENCH_DOWNLOAD_SCRIPT, BENCH_RESTARTABLE) and
exec's lib/benchmark-common.sh. The shared driver runs the unified flow:
install -> start+check -> download -> load (timed) -> for each query
{flush caches; optionally stop+start to neutralize warm-process effects;
run query 3x} -> data-size -> stop. Output format ([t1,t2,t3], Load time,
Data size) matches the previous benchmark.sh exactly so cloud-init.sh.in's
log POST to play.clickhouse.com keeps working unchanged.

For dataframe/in-process systems (pandas, polars-dataframe, chdb-dataframe,
daft-parquet*, duckdb-dataframe, sirius), the engine is wrapped in a small
FastAPI server (server.py) so the start/stop/query interface still applies.
BENCH_RESTARTABLE=no for these (and for embedded CLIs like duckdb, sqlite,
datafusion, etc.) since restarting a single Python/CLI process between
queries would dominate query time.

Scope: 88 local systems refactored. Cloud/managed systems and a handful of
non-functional ones (csvq, dsq, locustdb, mongodb, polars CLI, exasol,
spark-velox) are intentionally left untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves conflict in clickhouse-datalake{,-partitioned}: upstream switched
the datalake variants from filesystem-cache to userspace page-cache (PR #818).
The refactored install/query scripts now adopt the page-cache approach.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mongodb: query takes a MongoDB aggregation pipeline (Extended JSON, one
line) on stdin instead of SQL — these are the same canonical 43 ClickBench
queries, just expressed as mongo pipelines. queries.txt is generated from
queries.js (the source of truth) by replacing JS-only constructors
(NumberLong, ISODate, NumberDecimal) with their EJSON canonical form. The
shim sets BENCH_QUERIES_FILE=queries.txt to point the driver at it.

polars: wrapped in a FastAPI server analogous to polars-dataframe, but the
load step uses pl.scan_parquet (LazyFrame) so the parquet file remains
needed at query time — the load script does NOT delete hits.parquet.
data-size returns the on-disk parquet size since a LazyFrame has no
materialized in-memory size.

Both systems now expose the standard install/start/check/stop/load/query/
data-size scripts and a 4-line benchmark.sh shim, removing the old
benchmark.sh / run.js / query.py / formatResult.js paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread clickhouse-datalake-partitioned/load Outdated
…use in query

Per review: clickhouse-local persists table metadata in its --path dir, so
the CREATE TABLE only needs to run once during ./load. ./query just runs
the query against the persisted table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread clickhouse/query Outdated
Comment thread clickhouse/start Outdated
alexey-milovidov and others added 3 commits May 7, 2026 12:29
…atively

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… readiness

Per review (alexey-milovidov): clickhouse start leaves the system in the
desired state (server running) even when it returns non-zero with "already
running". Make the shared driver tolerate non-zero from ./start and rely on
bench_check_loop as the authoritative readiness signal. This lets per-system
start scripts stay simple — they just need to make a best-effort attempt to
launch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
prmoore77 added a commit to gizmodata/ClickBench that referenced this pull request May 7, 2026
…ouse#860)

Adopts the per-system 7-script interface from ClickHouse#860 for gizmosql/, and
replaces the Java sqlline-based gizmosqlline client with the C++
gizmosql_client shell that ships with gizmosql_server.

Scripts (matching the contract from lib/benchmark-common.sh):
  benchmark.sh - 4-line shim that exec's ../lib/benchmark-common.sh
  install      - apt + curl gizmosql_cli_linux_$ARCH.zip; no openjdk, no
                 separate gizmosqlline download
  start        - idempotent server bring-up (skips if port 31337 is open)
  check        - cheap TCP probe (auth-gated SQL would need credentials)
  stop         - kills tracked PID; pkill belt-and-braces fallback
  load         - rm -f clickbench.db, then create.sql + load.sql via
                 gizmosql_client; deletes hits.parquet and sync's
  query        - reads one query from stdin, runs via gizmosql_client with
                 .timer on + .mode trash; emits fractional seconds as the
                 last stderr line (parsed from "Run Time: X.XXs")
  data-size    - wc -c clickbench.db

Notes:
- BENCH_DOWNLOAD_SCRIPT=download-hits-parquet-single, BENCH_RESTARTABLE=yes
  (gizmosql is a server, so per-query restart neutralizes warm-process
  effects, matching the clickhouse/postgres pattern in ClickHouse#860).
- util.sh now exports GIZMOSQL_HOST/PORT/USER/PASSWORD - the env vars
  gizmosql_client reads natively, so query/load can call gizmosql_client
  with no flags. The server still receives the username via --username.
- PID_FILE moved to a stable /tmp path (was /tmp/gizmosql_server_$$.pid,
  which broke across the start/stop process boundary in the new layout).

This PR depends on ClickHouse#860 (which introduces lib/benchmark-common.sh and the
contract). Once ClickHouse#860 lands, this PR's diff against main will be only
the gizmosql/ files. Validated locally on macOS with gizmosql v1.22.4:
the query script produces the expected fractional-seconds last line on
stdout/stderr separation, and exits non-zero on error paths.

See https://docs.gizmosql.com/#/client for gizmosql_client docs.
alexey-milovidov and others added 18 commits May 9, 2026 01:22
Resolves merge conflicts:

- Removed cedardb/run.sh, gizmosql/run.sh — superseded by the standard
  query interface; the refactor branch already replaced them.
- Restored datafusion{,-partitioned}/make-json.sh, doris{,-parquet}/get-result-json.sh
  with main's dated-results version. These are independent post-run JSON
  builders, still referenced from the per-system READMEs.
- Kept the thin benchmark.sh shim in gizmosql/, spark-{auron,comet,gluten}/,
  trino/. Per-system result-JSON auto-save (added on main while this branch
  was in flight) is intentionally not carried over: under the new interface,
  result.csv is the single timing artifact and JSON construction belongs in
  separate tooling.
- gizmosql/{install,load,query,util.sh}: merge auto-took main's switch from
  gizmosqlline (Java) to gizmosql_client (CLI shipped with the server),
  but the refactor branch's load/query still referenced GIZMOSQL_SERVER_URI
  and GIZMOSQL_USERNAME. Updated install to drop openjdk + gizmosqlline,
  load to use gizmosql_client (and stop the server first to release the
  database file), and query to drive gizmosql_client with .timer/.mode trash
  and parse "Run Time:" instead of "rows selected (... seconds)".
…-system layout

These four entries were added on main while this branch was in flight (the
existing trino/ scripts here were a memory-connector stub that never worked
end-to-end). Rebuild each one against the new install/start/check/stop/load/
query/data-size contract so they share lib/benchmark-common.sh:

- trino, trino-partitioned: Hive connector + file metastore + local Parquet
  hardlinked into data/hits/ (matches main's working impl from PR #856).
- trino-datalake{,-partitioned}: same, plus the AnonymousAWSCredentials shim
  to read clickhouse-public-datasets/hits_compatible/athena from anonymous
  S3 (the published bucket size is reported by data-size since the data is
  read on demand). BENCH_DOWNLOAD_SCRIPT="" — no local dataset to fetch.
- benchmark.sh in all four becomes a 4-line shim. Old run.sh deleted.
…r-system layout

These four entries were added on main while this branch was in flight.
Adapt them to the install/start/check/stop/load/query/data-size contract:

- presto, presto-partitioned: Hive connector + file metastore + local Parquet
  hardlinked into data/hits/.
- presto-datalake{,-partitioned}: same plus the AnonymousAWSCredentials shim
  (compiled in a throwaway trinodb/trino container, since the prestodb image
  ships only a JRE) so the hive-hadoop2 plugin can read the public bucket
  anonymously. BENCH_DOWNLOAD_SCRIPT="" — schema-only load against S3.

Each benchmark.sh becomes a 4-line shim. Old run.sh deleted.
These two entries were added on main while this branch was in flight.
Adapt to the install/start/check/stop/load/query/data-size contract:

- BENCH_DOWNLOAD_SCRIPT="" — the vortex bench binary fetches Parquet and
  converts to .vortex on first invocation.
- BENCH_RESTARTABLE=no — embedded Rust CLI; per-query restart would
  dominate query time.
- query: stages stdin into a temp queries-file and passes -q 0, since the
  bench binary addresses queries by index rather than reading SQL on stdin.
- The single variant uses the `clickbench` binary (vortex 0.34.0); the
  partitioned variant uses `query_bench clickbench` (vortex 0.44.0). Old
  run.sh deleted.
Quickwit was added on main while this branch was in flight. Adapt to the
install/start/check/stop/load/query/data-size contract:

- BENCH_QUERIES_FILE="queries.json" — Quickwit accepts Elasticsearch-format
  JSON queries via the /_elastic compat API, not SQL. queries.json holds one
  ES query per line; queries not expressible in Quickwit are encoded as the
  literal "null".
- BENCH_DOWNLOAD_SCRIPT="" — the load script fetches hits.json.gz directly
  (there is no shared download-hits-json helper) and pipes it through
  `quickwit tool local-ingest`, since v0.9's sharded ingest-v2 endpoint caps
  single-node throughput at a few MB/s.
- BENCH_RESTARTABLE=yes — relies on the common driver's per-query restart
  to flush Quickwit's fast_field_cache and split_footer_cache (the result
  caches are already disabled in node-config.yaml).
- query: returns non-zero for "null" queries so the framework records null
  in the per-query timing array; otherwise reports .took (ms → seconds).

Old run.sh deleted.
The original used /tmp/gizmosql_server_$$.pid where $$ is the calling
process's PID. That worked when benchmark.sh sourced util.sh and called
start/stop in the same shell, but under the new per-system layout each of
start, stop, load, and query sources util.sh in its own subshell — so
stop_gizmosql couldn't find the PID file written by start_gizmosql. Use a
fixed path under the system directory instead. Also expose wait_for_gizmosql
so callers (like load) can wait for readiness without restarting.
Conflict only in gizmosql/benchmark.sh — kept the thin shim. Main switched
gizmosql to the official one-line installer (PR #879); fold that into
gizmosql/install so we stop hand-detecting arch and downloading the zip.

Other changes auto-merged: quickwit/index_config.yaml gained tag_fields on
CounterID + record:basic on text fields (PR #886), and assorted result
JSONs for ClickHouse Cloud / Citus / Cratedb / etc.
start/stop scripts may emit progress lines (clickhouse-server prints PID
table tracking, sudo's chown invocation, postgres's startup messages,
etc.). With BENCH_RESTARTABLE=yes those scripts run before every query,
so their output interleaves with the parseable [t1,t2,t3] / Load time /
Data size lines and breaks the cloud-init log POST to play.clickhouse.com.

Redirect both stdout and stderr from ./start and ./stop to /dev/null at
the three call sites in lib/benchmark-common.sh. The check loop is the
authoritative readiness signal, so losing start's output costs nothing
in steady state; for debugging, run ./start manually outside the driver.
The DuckDB installer at install.duckdb.org drops the binary into
~/.duckdb/cli/latest/duckdb and only suggests adding that directory to
PATH. Previously each install attempted a per-user symlink into
~/.local/bin, which silently no-ops when that directory isn't on PATH
(default for root in cloud-init). The result was ./check failing for
300s with no useful error.

Symlink to /usr/local/bin/duckdb via sudo right after install instead;
that's on PATH for every user, and the symlink is itself idempotent.
Ubuntu's docker.io ships the docker CLI without the v2 compose plugin, so
the existing `command -v docker` short-circuit skipped installation on
boxes that already had docker but no `docker compose`. ./start then ran
`docker compose up -d`, which silently failed, and ./check timed out at
300s. Fall back to docker-compose-v2 for the Ubuntu package name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Throughput variant of ClickBench. N connections (default 10) hold open
sessions and each picks a uniformly random query from the standard
43-query set; the run goes for a fixed wall-clock window (default 600s)
after a warmup. Reports completed queries, QPS, latency p50/p95/p99,
and per-query mean.

Backends: ClickHouse over HTTP (stdlib http.client), StarRocks over the
MySQL wire protocol (pymysql). Each system's recommended path so neither
is paying a wire-format penalty the other isn't.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ned}/query: pass query via temp file

`python3 - <<'PY' ... PY` directs the heredoc into python3's stdin so the
interpreter can read its program from there. Once the heredoc is fully
consumed, sys.stdin (the same FD) is at EOF — so sys.stdin.read() inside
the heredoc returned an empty string, and chdb / hyper / sail dutifully
ran the empty query and reported ~0.000s for every try.

Stage stdin into a temp file in bash before invoking the heredoc and pass
the path as argv[1]; the python script reads the query from that file.

Also include result materialization in the timing window for chdb/query
and chdb-parquet-partitioned/query (move `end = ...` past fetchall /
str(res)) — the timer was previously stopped before the result was
realized, which would have under-counted query time even when the stdin
bug wasn't masking it entirely.
Right now ./check stderr is silently dropped while the loop retries for
300s, then we report "did not succeed within 300s" with no clue why.
For deterministic failures (missing env var like YT_PROXY for chyt, an
install step that didn't run, etc.) the user wastes 5 minutes and still
has to dig through the per-system check script to find out what
happened. Capture the last attempt's stderr and print it on timeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The upstream install path assumes RHEL/Rocky/Alma — yum, grubby, SELinux,
the wheel group, /data0. On Ubuntu/Debian the prereqs phase silently
half-completes (several |||| true skips), the gpadmin user is sometimes
not created, and db-install would later die at `yum install -y go`.
Either way ./check times out at 300s with no diagnostic. Bail with a
clear "needs yum" message before doing anything destructive, and call
out the requirement in the README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cloud-init runs scripts as root with HOME unset. Tools that follow
XDG-ish conventions then fall over: the GizmoSQL one-line installer
exits at line 32 with "HOME: parameter not set" (it runs under `sh -u`),
duckdb-vortex's `INSTALL vortex` writes to /.duckdb/extensions/... and
later fails to find it ("Extension /.duckdb/extensions/v1.5.2/..."),
and duckdb-datalake{,-partitioned} queries crash 43 times each with
"Can't find the home directory at ''" while autoloading httpfs.

Each affected install script tried to paper over this locally with
`export HOME=${HOME:=~}`, but the export only lives for that script —
the sibling load/query scripts the lib runs in fresh subprocesses still
see HOME unset. Set it once here so every per-system step inherits it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
apt's monetdb5-sql post-install creates /var/lib/monetdb as the monetdb
user's home dir, so the existing `if [ ! -d /var/lib/monetdb ]` guard
skipped `monetdbd create` and left the dbfarm uninitialized. ./check
then looped 300s on `mclient: cannot connect: control socket does not
exist` and the run died.

Probe the dbfarm marker file (.merovingian_properties) instead of the
directory, and explicitly `monetdbd start` after create — both are
idempotent, and a daemon that's already up just no-ops.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
paradedb/paradedb:0.10.0 (the prior pin) was rotated out of Docker Hub —
docker pull returned "manifest not found" and ./check timed out. The
oldest tags still hosted are 0.15.x, so move both directories onto a
real Postgres-version-specific tag (latest-pg17) that paradedb still
maintains.

This unblocks the image pull. NOTE: paradedb dropped its pg_lakehouse /
parquet_fdw extension after 0.10.x (the parquet_fdw_handler() function
no longer exists), so create.sql still needs to be reworked away from
the foreign-table approach for queries to succeed end-to-end. That's a
separate change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior URL (qa-build.oss-cn-beijing.aliyuncs.com selectdb-doris-2.1.7-rc01)
returned 404 — SelectDB stopped publishing free standalone tarballs once
the product moved fully to a managed-cloud offering. VeloDB (the company
that now stewards SelectDB) hosts the official Apache Doris release
binaries instead, which are functionally what SelectDB ships today.

Pin to the current stable (4.0.5) and use the symmetric $dir_name path
layout that doris/install already uses, instead of the hardcoded
selectdb-doris-2.1.7 segment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ckHouse/ClickBench into refactor/per-system-script-interface
…ckHouse/ClickBench into refactor/per-system-script-interface
@alexey-milovidov alexey-milovidov force-pushed the refactor/per-system-script-interface branch from d6342fd to 51cb9c6 Compare May 10, 2026 17:46
alexey-milovidov and others added 27 commits May 10, 2026 19:49
…ckHouse/ClickBench into refactor/per-system-script-interface
A JSON file of the form {"error": "..."} marks a failed run for that
system/machine; such entries are now excluded from data.generated.js so
the system is omitted from the report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Combine the small filters (Open source, Hardware, Tuned) into a single
  horizontal row at the top of the selectors table.
- Top-row filters now hide options in System / Type / Machine / Cluster
  size that have no entries satisfying the criteria.
- Hovering a system in the System list, a summary row, or a details
  column header highlights that system's tags in the Type list with a
  green background (light/dark theme aware).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The stored-theme bootstrap was calling setTheme(), which calls render(),
which now references `let systems` via applyTopRowFilters() — throwing a
ReferenceError because the binding is still in its temporal dead zone.
Set the data-theme attribute directly at bootstrap; the final render()
at the end of the script handles the initial render.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lukewarm-cold-run tag predates the May 7 refactor (1f352ad) when
the bench loop didn't reliably restart between queries, so each cold
run could re-use a warm process. After the refactor BENCH_RESTARTABLE
governs that explicitly: systems either fully restart (with the cache
drop landing on an actually-cold process tree thanks to the
stop→wait_stopped→drop_caches→start ordering) or sit out at
BENCH_RESTARTABLE=no for in-process tools where restart would dominate
the timing. Either way the "lukewarm" qualifier no longer applies to
results produced under the new driver.

Strip the tag from:
* every results/2026050[7-9]/*.json and results/20260510/*.json that
  carried it — 295 files across 29 systems (citus, clickhouse,
  clickhouse-web, databend, doris, doris-parquet, greenplum,
  mariadb-columnstore, pg_clickhouse, pg_duckdb-parquet, pg_mooncake,
  pgpro_tam, polars, polars-dataframe, presto + 3 variants, questdb,
  siglens, starrocks, timescaledb, trino + 3 variants, ursa, velodb,
  victorialogs)
* template.json of those same 29 systems, so future runs don't
  re-introduce it.

Older results (pre-refactor) keep the tag — they were produced under
the historical driver and the attribute is genuine for them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Firebolt's "wait until ready" loop did

    curl -sS http://localhost:3473/ --data-binary 'SELECT ...' \
        > /dev/null && break

which exited on the FIRST HTTP response — including the HTTP 200
that carries

    {"errors":[{"description":"Cluster not yet healthy: Node startup
                                is not yet finished"}],
     "statistics":{"elapsed":0.0}}

while the container is still warming up. So the bench would proceed
straight to CREATE TABLE, get the same Cluster-not-healthy error, run
all 43 queries (each replying with "elapsed":0.0), and emit a log
that looked fine: 43/43 timing triplets, load_time present, data_size
present.

The sink.parser MV's "good" predicate then rejected the row for

    arrayExists(x -> arrayExists(y -> toFloat64OrZero(y) > 0.1, x), runtimes)

— every timing is 0.0, so no element exceeds 0.1, the row never lands
in sink.results, and the website has had no new Firebolt result since
2026-02-21 even though the bench has been "running" successfully.

Pipe the response into grep "Firebolt is ready" and only break when the
sentinel actually appears in the body. Same fix for all three variants
(firebolt, firebolt-parquet, firebolt-parquet-partitioned).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the "stateless" tag to all result files and templates of systems
that do not maintain persistent state of their own: polars, sail*,
spark*, *-parquet, *-datalake. With the recent load-metric filter
change, these systems are correctly omitted from the Load Time view.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- When the Load Time metric is selected, exclude entries tagged
  "stateless" — they have no meaningful load time.
- Hovering a tag in the Type list highlights every summary row and
  details column header for systems carrying that tag, mirroring the
  existing system-hover behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pick up the latest result entries that landed upstream during the
stateless-tag rebase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…emplate

Follow-up to e29ce9e: more post-refactor results have landed since
(firebolt finally producing rows, new sweeps on the c6a/c8g/c7a
machines for clickhouse-web, databend, citus, presto/trino variants,
etc.), and they reinstated the tag.

Strip again across every post-refactor result dir for the 68 systems
that have at least one results/202605{07-10}/ entry. The firebolt
template still carried the tag from before this round (it wasn't
touched last time because firebolt had no post-refactor result yet);
clear it now that firebolt is back on the dashboard.

189 result files updated, plus firebolt/template.json. Pre-refactor
results keep the tag — the attribute is genuine for them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs in the start/stop helpers were silently hanging the bench
after load:

1. stop_gizmosql did `kill $pid; wait $pid 2>/dev/null`, but `wait` on
   a non-child returns immediately (errno suppressed). The function
   returned before gizmosql_server was actually gone. The DuckDB fcntl
   lock on clickbench.db hadn't been released yet, so the next
   start_gizmosql kicked off a new server that crashed in ~2 s with

       terminate called after throwing duckdb::IOException
       Could not set lock on file "clickbench.db":
           Conflicting lock is held in . (PID <old-pid>)

   visible in gizmosql_server.log.

2. start_gizmosql then sat in `while ! nc -z ...; sleep 1; done` for
   the dead server's port that would never open — pstree shows ./start
   stuck on the sleep. With ./start invoked as `./start >/dev/null 2>&1`
   from bench_run_query, the user sees nothing on stdout/stderr after
   "Load time: ..." and assumes the bench is hung.

Fix both:

* stop_gizmosql polls `kill -0 $pid` until the process is actually
  gone (up to 60 s), then SIGKILLs if still alive.
* start_gizmosql retries up to 5 times. Each attempt has a 60 s
  bounded wait, and is abandoned early if the child PID dies (the
  lock-conflict case). A 2 s sleep between attempts gives the kernel
  time to release the prior lock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-parquet{,-partitioned},sirius}/load: surface OOM/crash mid-ingest

Every dataframe/server load script was the same one-liner:

    elapsed=$(curl -sS -X POST http://127.0.0.1:8000/load \
        | python3 -c 'import json,sys; print(json.load(sys.stdin)["elapsed"])')

With `set -e` and a `$(...)` capture, neither curl errors nor JSON
parse errors land in the log. When the server gets OOM-killed mid-
ingest (e.g. duckdb-memory on c6a.4xlarge — 13 GB of .tmp spill,
ingest reaches ~99 %, earlyoom takes the process), curl exits with
"connection reset", the pipe collapses silently, the script exits
non-zero with no message, cloud-init prints "Disk usage after",
sink.parser rejects the row for having no timings, and we get a 266 s
"successful-looking" run with zero results on the dashboard.

Capture the response body, branch on curl vs JSON failures, and
print the actual server output (often the OOM traceback or an
HTTPException detail) to stderr before bailing. Same pattern across
all 7 sibling load scripts; the sirius variant gets a tailored
"server may have crashed during GPU-buffer init" message since its
real ingest happens in the duckdb CLI step before the curl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sults

Upstream landed new result files without the "stateless" tag (and the
"lukewarm-cold-run" cleanup pass stripped it from a few). Re-add the
tag uniformly across polars, sail*, spark*, *-parquet, *-datalake.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit's `git add */template.json` glob picked up two
untracked archive directories. Drop them from the index; the files
remain in the working tree as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… cold tries

Semantic shift: the old flag conflated "does this system have a daemon
to restart" with "does the data survive a restart". The new BENCH_DURABLE
asks the second question directly, and the driver acts on it:

  BENCH_DURABLE=yes (default) — data is on disk (daemons like clickhouse,
                                or CLI tools like duckdb operating on a
                                .db file). Cold cycle is the existing
                                stop -> wait_stopped -> drop_caches ->
                                start -> check. Each ./query then runs
                                3 times: first cold, two warm.
  BENCH_DURABLE=no            — data lives in process memory (in-process
                                servers: pandas, polars, duckdb-dataframe,
                                duckdb-memory, chdb-dataframe, daft-*,
                                sirius). The restart wipes it, so after
                                the start+check we re-run ./load and
                                roll its wall-clock into the first
                                ("cold") try. The recorded cold timing
                                is now load+query, which is the honest
                                cost of "fresh-start engine + first
                                query" instead of "warm query against a
                                freshly-loaded RAM dataset".

Migrations:
* All 108 benchmark.sh shims renamed BENCH_RESTARTABLE -> BENCH_DURABLE.
* 9 in-process server shims set DURABLE=no (chdb-dataframe, daft-parquet,
  daft-parquet-partitioned, duckdb-dataframe, duckdb-memory, pandas,
  polars, polars-dataframe, sirius).
* Previously RESTARTABLE=no CLI systems (duckdb, sqlite, datafusion,
  hyper, octosql, opteryx, etc.) become DURABLE=yes — they're stateless
  per-process so the cold cycle is essentially drop_caches + a no-op
  stop/start.
* 6 in-process load scripts (chdb-dataframe, duckdb-dataframe,
  duckdb-memory, pandas, polars-dataframe, sirius) used to rm
  hits.parquet right after the first load — drop that, since
  bench_run_query now needs the source file to reload after each
  restart. Moved the cleanup into bench_main (DURABLE=no branch only).
* Back-compat: BENCH_RESTARTABLE is still read as an alias if BENCH_DURABLE
  isn't set, so stale env/scripts keep working for one cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the "stateless" tag to glaredb and glaredb-partitioned results and
templates so they are excluded from the Load Time view alongside the
other Parquet/data-lake systems.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflicts:
* gizmosql/util.sh — main added '--storage-version latest' to the
  gizmosql_server invocation; refactor branch rewrote the start/stop
  helpers with bounded waits, retry-on-DuckDB-lock-conflict, and a
  polling-based stop. Kept both: the rewritten helpers keep their
  shape, with the storage-version flag folded into the new
  start_gizmosql's nohup call.
* data.generated.js — both sides regenerated; resolved by taking
  this branch's version and re-running generate-results.sh so the
  data.generated.js reflects the union of all post-merge result
  files in tree.
… timed window

Ingest writers (postgres COPY, ClickHouse INSERT, DuckDB CTAS, etc.)
return well before their pages reach disk. Individual per-system load
scripts were inconsistent about calling `sync` themselves — some did,
some didn't, and a few of the recent dataframe scripts dropped it
when we refactored away their `rm -f hits.parquet` step. Without a
final sync the first cold query then pays the writeback as if it
were query work, which is unfair to the systems whose load script
DID flush.

Add `sync` to bench_load between `./load` and the end-of-load
timestamp. Now load_time is the honest "data is on disk" wall-clock
for every system, and the cold timer doesn't catch leftover
writeback. Per-system `sync` calls at the end of individual load
scripts become redundant but stay in place for clarity (they're
inside the same timed window, so the cost is unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror the "ClickHouse derivative" convention: every system whose
engine is DuckDB (vanilla or wrapped) gets a "DuckDB derivative"
tag in template.json and every historical result.

Tagged 17 systems × all their results:

  duckdb, duckdb-{dataframe, memory, datalake, datalake-partitioned,
                  parquet, parquet-partitioned, vortex, vortex-partitioned}
  motherduck                  (managed DuckDB cloud)
  pg_duckdb, pg_duckdb-{indexed, parquet, motherduck}, pg_ducklake
                              (Postgres extension running DuckDB inside;
                              ducklake is built on the same extension)
  gizmosql                    (Arrow Flight SQL frontend over DuckDB)
  sirius                      (GPU-accelerated DuckDB extension via
                              call gpu_processing("..."))

Insertion is right after "column-oriented" (or, failing that, the
language tag) — matches where "ClickHouse derivative" lives.
Multi-line tags arrays get the new tag as its own indented line;
single-line arrays keep their compact shape with the tag inserted
inline.

302 files updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both run DuckDB as the underlying OLAP execution engine even though
they expose a Postgres surface:

* pg_mooncake — Postgres extension by Mooncake Labs that pushes
  columnar query execution through DuckDB.
* crunchy-bridge-for-analytics — Crunchy Bridge's analytics tier built
  on the crunchy_query_engine Postgres extension, which is a forked
  DuckDB.

Audited the rest of the recent batch: arc is a Go time-series engine
(Basekick-Labs, no DuckDB), sail / sail-partitioned are the Rust
Spark-Connect server (DataFusion under the hood, not DuckDB). Those
stay as-is.

Tag inserted right after "column-oriented" in the same surgical
shape as the previous DuckDB-derivative sweep. 16 files total
(2 templates + 14 result JSONs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hovering a row in the summary table now reveals a × to the left of the
table; clicking it deselects that system from the System filter and
re-renders. The hit box overlaps the cell so the hover doesn't drop
while moving the cursor toward the symbol.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Render the entry's YYYY-MM-DD date at the right end of the bar cell in
the summary table, in a small monospace font with a 2px text-stroke
outline in the background color so it stays legible over bar colors
and row stripes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- generate-results.sh: when a JSON omits the .date field, inject the
  YYYY-MM-DD value derived from the YYYYMMDD directory segment.
- Backfill the .date field in 40 sail / sail-partitioned / timescale-cloud
  / timescaledb-no-columnstore result files that were missing it, so the
  source-of-truth JSON also carries the date.
- Regenerate data.generated.js.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bench_main now runs an extra step after the cold/warm sweep:
N workers (default 10) fire queries against the running system for
D seconds (default 600). Each connection picks queries from a
deterministic per-connection permutation seeded from
BENCH_CONCURRENT_SEED + connection_id (SHA-256 → integer to keep it
stable across Python versions), so connection 3 hits the same query
order on every engine — cross-system QPS becomes a comparison of
identical workloads rather than each system rolling its own shuffle.

Before the window starts, do one more stop / wait_stopped /
drop_caches / start / check (and ./load for BENCH_DURABLE=no), so the
test doesn't inherit whatever caches the last bench_run_query left.

A side watchdog polls ./check every 5 s. When the engine dies during
the window the watchdog revives it (./stop + ./start + check loop +
./load for non-durable) WITHOUT halting the workers — they keep
firing queries; errors during the dead window count toward
"Concurrent error ratio" and successes after the revive count toward
"Concurrent QPS". The QPS therefore stays a real number across
mid-test crashes; "null" only when the engine never recovered enough
to serve a single query.

Two new log lines:
  Concurrent QPS: <N.NNN | null>
  Concurrent error ratio: <0.NNN | null>

prepare-database.sql:
* sink.results gains concurrent_qps and concurrent_error_ratio
  (Nullable(Float64)).
* sink.parser extracts both via toFloat64OrNull so the literal "null"
  string lands as a SQL NULL.
* output JSON template carries the new fields so the per-system
  result.json gets them too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant