Skip to content

SlothDB - a parody of a database (disqualified) #930

@alexey-milovidov

Description

@alexey-milovidov

This project https://github.com/SouravRoy-ETL/slothdb/ is aggressively promoted with paid GitHub stars and spam comments on the Internet.

Here is the verdict:

The names (binder, catalog, data_table, row_group, column_data, string_t, LogicalType, Vector, DataChunk, BoundExpression, ExpressionExecutor, ParallelResultCollector, "non-reserved keyword set matches DuckDB's"…) are 1:1 with DuckDB's. SlothDB is a DuckDB-shaped reimplementation: same source layout, same internal API names, same Parquet ingest shape (read_parquet(...), SELECT * REPLACE, epoch_ms, make_date).

Where the speed actually comes from

  1. A 14k-line physical_planner.cpp that pattern-matches ClickBench query shapes and routes them to hand-rolled per-query handlers. The execution directory literally has files named for ClickBench query numbers:
  src/execution/q4_dict_histogram.cpp   - Q4   SELECT AVG(UserID) FROM hits
  src/execution/q6_string_dedup.cpp     - Q6   COUNT(DISTINCT SearchPhrase)
  src/execution/q10_agg.cpp             - Q10  RegionID GROUP BY with mixed aggs
  src/execution/q11_helper.cpp          - Q11  MobilePhoneModel COUNT(DISTINCT)
  src/execution/q21_helper.cpp          - Q21  URL LIKE '%google%' COUNT(*)
  src/execution/q26_helper.cpp          - Q26  ORDER BY SearchPhrase LIMIT N
  + static TryComputeQ14_2Stage()       - Q14  inside physical_planner.cpp

Each handler bakes in assumptions about this dataset:

  • Q4 (q4_dict_histogram.cpp): "Quick gate: every RG must have a dict page on this column" - computes SUM(UserID) and COUNT(*) by walking only the Parquet dictionary page of each row group, never touching the
    data pages. Works because hits.parquet is dictionary-encoded.
  • Q10 (q10_agg.cpp): slot_of is a flat dense array indexed by RegionID - // RegionID is a small dense-ish UInt32 (Q10: 9040 distinct in [0, 131069]); a direct array beats a hash probe on the 100M-row per-row hot
    path.
  • Q21/Q22/Q23 (q21_helper.cpp and physical_planner.cpp lines 475/503): specialized 4-byte / 8-byte needle substring search - loads the first 4 or 8 bytes of the needle as a uint32/uint64 literal and compares
    with one cmp. The comments explicitly say // Q22 "google" hot path and // Q23 ".google." 8-byte needle.
  • Q26 (q26_helper.cpp): "Dict-trust" variant of ORDER BY VARCHAR LIMIT - extracts top-K from the Parquet dictionary alone, skipping the data scan. Comment: // For ClickBench hits.parquet SearchPhrase, dict has
    no orphans - verified empirically. Toggled by SLOTH_Q26_TRUST_DICT.
  • Q14: a dedicated TryComputeQ14_2Stage two-stage COUNT(DISTINCT) path.
  • Numerous shape-specific comments throughout physical_planner: // Q22 URL falls here, // Q31/Q32 hot path, // Q35: 9.76 M groups, // Q27 ORDER BY EventTime, SearchPhrase LIMIT 10, etc.
  1. The "generic" path is markedly slower than ClickHouse (and almost anything else, including DuckDB). Look at src/execution/parallel_executor.cpp::ParallelScanFilter - it builds rows with chunk.GetValue(col, i) per cell, then SetValue(col, result_count, ...), into a std::vector<std::vector> under a mutex. That's row-at-a-time Value boxing - orders of magnitude slower than DuckDB's vectorised pipeline. The "wins" only show up when a query matches one of the recognised shapes and bypasses this path entirely.

  2. Optimizer is three passes. optimizer.cpp only does ConstantFolding, FilterPushdown, TopNOptimization. There's no proper cost model, join reorder, or general aggregate fusion - the planner relies on the per-query handlers to win, and the comments admit it: Q31/Q32 (two-column high-cardinality GROUP BY without a custom handler) are noted as the project's weak spot (the README cites 0.29× and 0.12× of DuckDB).

Summary

The architecture is a DuckDB-clone shell (parser/binder/planner/catalog/storage layered identically) with a flat, three-pass optimiser and a generic execution path that's slower than DuckDB, plus a 14k-line physical planner whose job is to recognise ClickBench query shapes and dispatch to per-query handlers that hard-code dataset assumptions (column cardinalities, dict-page completeness, specific needle bytes, ClickBench RG layout). The README's "33 of 43 beat DuckDB, up to 5×" comes from those handlers firing on the queries they were written for; the queries with no matching handler (Q29, Q31, Q32, Q33) lose by 3–8×.

So it's not a general-purpose engine that happens to win ClickBench - it's a generic engine plus a ClickBench-shape lookup table. The benchmark is essentially part of the codebase's input.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions