A DataFusion extension that integrates USearch HNSW approximate nearest-neighbour (ANN) vector search as a first-class SQL operator.
Queries matching the ORDER BY distance_fn(col, query) LIMIT k pattern are transparently rewritten by an optimizer rule into a native USearch index call — no query rewrite needed from the caller. WHERE clause filters are handled adaptively: high-selectivity filters use USearch's in-graph predicate API; low-selectivity filters bypass HNSW entirely and scan the data directly.
DataFusion: 52.2 USearch: 2.24
[dependencies]
datafusion-vector-search-ext = { git = "https://github.com/hotdata-dev/datafusion-vector-search-ext" }
usearch = "2.24.0"USearchIndexConfig is the single source of truth for index parameters. Persist it alongside your .index file — USearch does not embed metadata inside the saved file.
use datafusion_vector_search_ext::USearchIndexConfig;
use usearch::MetricKind;
let cfg = USearchIndexConfig::new(768, MetricKind::L2sq);
// Build a new index:
let index = cfg.build_index()?;
index.reserve(n_rows)?;
for (key, vec) in rows {
index.add(key, &vec)?; // key must be a unique u64
}
index.save("my_table.index")?;
// Reload from disk:
let index = cfg.load_index("my_table.index")?;Key mapping: USearch stores u64 keys only. Each key must map 1:1 to a row in your providers. A simple monotonic row-ID (0, 1, 2, ...) works well.
Registration requires two providers:
scan_provider(Arc<dyn TableProvider>) — used for WHERE evaluation during the pre-scan phase (scalar columns only).lookup_provider(Arc<dyn PointLookupProvider>) — used for O(k) key-based row fetch after HNSW search. Does not need the vector column.
PointLookupProvider extends DataFusion's TableProvider with a single method:
#[async_trait]
impl PointLookupProvider for MyTable {
async fn fetch_by_keys(
&self,
keys: &[u64],
key_col: &str,
projection: Option<&[usize]>,
) -> Result<Vec<RecordBatch>> {
// Look up each key via your primary-key index.
// Missing keys are silently omitted.
todo!()
}
}For development and tests, use the bundled HashKeyProvider:
use datafusion_vector_search_ext::HashKeyProvider;
let provider = Arc::new(HashKeyProvider::try_new(schema, batches, "id")?);Wire everything together:
use std::sync::Arc;
use datafusion::execution::context::SessionStateBuilder;
use datafusion::prelude::SessionContext;
use datafusion_vector_search_ext::{
USearchRegistry, USearchQueryPlanner, register_all,
};
use usearch::{MetricKind, ScalarKind};
let registry = USearchRegistry::new();
registry.add(
"my_table::vector", // "<table>::<vector_col>"
Arc::new(index),
scan_provider.clone(), // TableProvider (e.g. Parquet)
lookup_provider.clone(),// PointLookupProvider (e.g. SQLite)
"id", // key column name
MetricKind::L2sq,
ScalarKind::F32,
)?;
let registry = registry.into_arc();
let state = SessionStateBuilder::new()
.with_default_features()
.with_query_planner(Arc::new(USearchQueryPlanner::new(registry.clone())))
.build();
let ctx = SessionContext::new_with_state(state);
register_all(&ctx, registry)?;
ctx.register_table("my_table", scan_provider)?;No special syntax — the extension transparently rewrites standard SQL:
SELECT id, title, l2_distance(vector, ARRAY[0.1, 0.2, ...]) AS dist
FROM my_table
ORDER BY dist ASC
LIMIT 10Recognised patterns:
| Distance function | Index metric | Order |
|---|---|---|
l2_distance |
MetricKind::L2sq |
ASC |
cosine_distance |
MetricKind::Cos |
ASC |
negative_dot_product |
MetricKind::IP |
ASC |
All three are lower-is-closer. ORDER BY ... DESC falls back to exact computation. Metric mismatch (e.g. cosine UDF on an L2 index) also falls back silently.
Scalar WHERE conditions are absorbed and handled adaptively:
SELECT id, l2_distance(vector, ARRAY[...]) AS dist
FROM my_table
WHERE category = 'nlp'
ORDER BY dist ASC
LIMIT 10See Adaptive filtering for details on how the execution path is chosen.
For runtime query vectors, complex joins, or explicit over-fetch control:
SELECT vs.key, vs._distance, d.title
FROM vector_usearch('my_table', ARRAY[0.1, 0.2, ...], 20) vs
JOIN my_table d ON d.id = vs.key
ORDER BY vs._distance ASC
LIMIT 10The UDTF always calls index.search() directly — no filter absorption. Apply WHERE on the outer query to post-filter.
Pass USearchTableConfig to add_with_config for per-table tuning:
use datafusion_vector_search_ext::USearchTableConfig;
registry.add_with_config(
"my_table::vector",
Arc::new(index),
scan_provider.clone(),
lookup_provider.clone(),
"id",
MetricKind::L2sq,
ScalarKind::F32,
USearchTableConfig {
expansion_search: 128, // ef_search (default: 64)
brute_force_selectivity_threshold: 0.03, // low-sel cutoff (default: 0.05)
},
)?;| Parameter | Default | Notes |
|---|---|---|
expansion_search |
64 | HNSW beam width at query time. Higher = better recall, slower. Set once at registration. |
brute_force_selectivity_threshold |
0.05 | Below this fraction, bypass HNSW and scan directly. |
USearchIndexConfig build parameters:
| Parameter | Default | Notes |
|---|---|---|
connectivity (M) |
16 | Graph degree. Higher = better recall, more memory. |
expansion_add (ef_construction) |
128 | Must be >= 2xM. Higher = better graph, slower build. |
quantization |
ScalarKind::F32 |
F16 halves memory with ~1% recall loss at high dims. |
src/
lib.rs — public API, register_all()
registry.rs — USearchRegistry, RegisteredTable, USearchIndexConfig, USearchTableConfig
node.rs — USearchNode: custom logical plan leaf
rule.rs — USearchRule: optimizer rewrite rule
planner.rs — USearchExecPlanner, USearchExec: physical execution
udf.rs — l2_distance, cosine_distance, negative_dot_product scalar UDFs
udtf.rs — vector_usearch table function
lookup.rs — PointLookupProvider trait + HashKeyProvider
keys.rs — DatasetLayout, pack_key/unpack_key key encoding
tests/
optimizer_rule.rs — rewrite rule matching/rejection tests
execution.rs — end-to-end execution tests (HNSW + index-get paths)
The rule (rule.rs) matches two logical plan shapes:
Sort(fetch=k, ORDER BY dist ASC)
Projection([..., distance_fn(col, lit) AS dist, ...])
TableScan(name)
Sort(fetch=k, ORDER BY dist ASC)
Projection([..., distance_fn(col, lit) AS dist, ...])
Filter(predicate)
TableScan(name)
Preconditions: sort is ASC, distance UDF matches index metric, table is registered, query vector is a literal. When the rule fires, it replaces the inner nodes with a USearchNode leaf carrying: table name, vector column, query vector, k, distance type, and absorbed filter predicates. The Sort node is preserved above for final ordering.
Physical planning (planner.rs) translates USearchNode into USearchExec, a physical plan node that executes the actual search.
When WHERE filters are present, the execution follows three possible paths:
Query arrives
|
+-- No WHERE clause
| -> USearch HNSW search -> lookup_provider fetch(k) -> result
|
+-- Has WHERE clause
|
+-- Pre-scan: scan_provider (_key + filter cols only, predicate pushdown)
| -> collect valid_keys, compute selectivity
|
+-- Low selectivity (<= threshold, default 5%)
| -> index.get(key) for each valid_key -> compute distances -> top-k
| -> lookup_provider fetch(k) -> result
|
+-- High selectivity (> threshold)
-> HNSW filtered_search(valid_keys predicate)
-> lookup_provider fetch(k) -> result
Pre-scan phase: Projects only _key and columns referenced by the WHERE clause (excludes all other columns for efficiency). Filter expressions are pushed down to the scan provider for Parquet-level pruning (row group statistics, bloom filters, page indexes). Collects the set of valid keys and computes selectivity = valid_keys.len() / index.size().
Low-selectivity path (index-get): When few rows pass the filter, HNSW graph traversal becomes expensive (it must explore ~k/selectivity nodes to find k passing candidates). Instead, vectors are retrieved directly from the USearch index via index.get(key) for each valid key, exact distances are computed, and a top-k heap selects the closest matches. Result rows are fetched from the lookup provider.
High-selectivity path (HNSW filtered): Passes valid keys as a predicate to index.filtered_search() — HNSW skips non-passing nodes during traversal. Result keys are fetched from the lookup provider.
Why 5%: The crossover is approximately sqrt(k * M / n). For k=10, M=16, n=240k this is ~2.6%. The 5% default provides a safety margin.
All three distance functions are lower-is-closer:
| SQL function | Index metric | Kernel |
|---|---|---|
l2_distance(a, b) |
L2sq |
sum((a_i - b_i)^2) |
cosine_distance(a, b) |
Cos |
1 - dot(a,b) / (norm(a) * norm(b)) |
negative_dot_product(a, b) |
IP |
-(a . b) |
l2_distance returns squared L2 (no sqrt), matching USearch's MetricKind::L2sq. This ensures numeric consistency between the UDF, the rewritten index path, and the brute-force path.
cargo testTests cover optimizer rule matching/rejection, end-to-end execution through both HNSW and index-get paths, registration validation, and provider error handling.
| Limitation | Notes |
|---|---|
Stacked Filter nodes |
Only one Filter -> TableScan layer is absorbed. Filter -> Filter -> TableScan falls back to exact execution. DataFusion typically combines multiple WHERE conditions into a single Filter, so this rarely occurs. |
| Runtime query vectors | The query vector must be a compile-time literal (ARRAY[0.1, ...]). Column references or subquery results are not rewritten. Use the UDTF path for runtime vectors. |
ef_search per-query |
expansion_search is global to the index instance. Per-query adjustment is not supported. |
| No DELETE / compaction | USearch soft-deletes entries but requires a full rebuild to reclaim space. |