Skip to content

Commit 83fa0c4

Browse files
authored
docs: update README for index-get low-selectivity path (#13)
Reflect current architecture: pre-scan projects _key + filter cols, low-selectivity path uses index.get() instead of full Parquet scan, scan_provider is only used for WHERE evaluation during pre-scan.
1 parent 4397342 commit 83fa0c4

1 file changed

Lines changed: 8 additions & 9 deletions

File tree

README.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ let index = cfg.load_index("my_table.index")?;
4646

4747
Registration requires two providers:
4848

49-
- **`scan_provider`** (`Arc<dyn TableProvider>`) — used for WHERE evaluation and the low-selectivity Parquet-native path. Should contain all columns including the vector column.
49+
- **`scan_provider`** (`Arc<dyn TableProvider>`) — used for WHERE evaluation during the pre-scan phase (scalar columns only).
5050
- **`lookup_provider`** (`Arc<dyn PointLookupProvider>`) — used for O(k) key-based row fetch after HNSW search. Does not need the vector column.
5151

5252
`PointLookupProvider` extends DataFusion's `TableProvider` with a single method:
@@ -211,7 +211,7 @@ src/
211211
212212
tests/
213213
optimizer_rule.rs — rewrite rule matching/rejection tests
214-
execution.rs — end-to-end execution tests (HNSW + Parquet-native paths)
214+
execution.rs — end-to-end execution tests (HNSW + index-get paths)
215215
```
216216

217217
### Optimizer rewrite
@@ -245,22 +245,21 @@ Query arrives
245245
|
246246
+-- Has WHERE clause
247247
|
248-
+-- Pre-scan: scan_provider (scalar + _key cols only, filter pushdown)
248+
+-- Pre-scan: scan_provider (_key + filter cols only, predicate pushdown)
249249
| -> collect valid_keys, compute selectivity
250250
|
251251
+-- Low selectivity (<= threshold, default 5%)
252-
| -> Full scan from scan_provider (all cols including vector)
253-
| -> evaluate WHERE, compute distances, top-k heap
254-
| -> return directly -- NO USearch, NO lookup_provider
252+
| -> index.get(key) for each valid_key -> compute distances -> top-k
253+
| -> lookup_provider fetch(k) -> result
255254
|
256255
+-- High selectivity (> threshold)
257256
-> HNSW filtered_search(valid_keys predicate)
258257
-> lookup_provider fetch(k) -> result
259258
```
260259

261-
**Pre-scan phase:** Projects only scalar columns and the key column (excludes the vector column for efficiency). Filter expressions are pushed down to the scan provider. Collects the set of valid keys and computes `selectivity = valid_keys.len() / index.size()`.
260+
**Pre-scan phase:** Projects only `_key` and columns referenced by the WHERE clause (excludes all other columns for efficiency). Filter expressions are pushed down to the scan provider for Parquet-level pruning (row group statistics, bloom filters, page indexes). Collects the set of valid keys and computes `selectivity = valid_keys.len() / index.size()`.
262261

263-
**Low-selectivity path (Parquet-native):** When few rows pass the filter, HNSW graph traversal becomes expensive (it must explore ~`k/selectivity` nodes to find k passing candidates). Instead, the full scan streams all columns including the vector, evaluates filters per batch, computes exact distances for passing rows, and maintains a top-k heap (`ScoredRow`). Returns results directly without touching USearch or the lookup provider.
262+
**Low-selectivity path (index-get):** When few rows pass the filter, HNSW graph traversal becomes expensive (it must explore ~`k/selectivity` nodes to find k passing candidates). Instead, vectors are retrieved directly from the USearch index via `index.get(key)` for each valid key, exact distances are computed, and a top-k heap selects the closest matches. Result rows are fetched from the lookup provider.
264263

265264
**High-selectivity path (HNSW filtered):** Passes valid keys as a predicate to `index.filtered_search()` — HNSW skips non-passing nodes during traversal. Result keys are fetched from the lookup provider.
266265

@@ -284,7 +283,7 @@ All three distance functions are **lower-is-closer**:
284283
cargo test
285284
```
286285

287-
Tests cover optimizer rule matching/rejection, end-to-end execution through both HNSW and Parquet-native paths, registration validation, and provider error handling.
286+
Tests cover optimizer rule matching/rejection, end-to-end execution through both HNSW and index-get paths, registration validation, and provider error handling.
288287

289288
---
290289

0 commit comments

Comments
 (0)