You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: update README for index-get low-selectivity path (#13)
Reflect current architecture: pre-scan projects _key + filter cols,
low-selectivity path uses index.get() instead of full Parquet scan,
scan_provider is only used for WHERE evaluation during pre-scan.
Copy file name to clipboardExpand all lines: README.md
+8-9Lines changed: 8 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,7 +46,7 @@ let index = cfg.load_index("my_table.index")?;
46
46
47
47
Registration requires two providers:
48
48
49
-
-**`scan_provider`** (`Arc<dyn TableProvider>`) — used for WHERE evaluation and the low-selectivity Parquet-native path. Should contain all columns including the vector column.
49
+
-**`scan_provider`** (`Arc<dyn TableProvider>`) — used for WHERE evaluation during the pre-scan phase (scalar columns only).
50
50
-**`lookup_provider`** (`Arc<dyn PointLookupProvider>`) — used for O(k) key-based row fetch after HNSW search. Does not need the vector column.
51
51
52
52
`PointLookupProvider` extends DataFusion's `TableProvider` with a single method:
| -> return directly -- NO USearch, NO lookup_provider
252
+
| -> index.get(key) for each valid_key -> compute distances -> top-k
253
+
| -> lookup_provider fetch(k) -> result
255
254
|
256
255
+-- High selectivity (> threshold)
257
256
-> HNSW filtered_search(valid_keys predicate)
258
257
-> lookup_provider fetch(k) -> result
259
258
```
260
259
261
-
**Pre-scan phase:** Projects only scalar columns and the key column (excludes the vector column for efficiency). Filter expressions are pushed down to the scan provider. Collects the set of valid keys and computes `selectivity = valid_keys.len() / index.size()`.
260
+
**Pre-scan phase:** Projects only `_key` and columns referenced by the WHERE clause (excludes all other columns for efficiency). Filter expressions are pushed down to the scan provider for Parquet-level pruning (row group statistics, bloom filters, page indexes). Collects the set of valid keys and computes `selectivity = valid_keys.len() / index.size()`.
262
261
263
-
**Low-selectivity path (Parquet-native):** When few rows pass the filter, HNSW graph traversal becomes expensive (it must explore ~`k/selectivity` nodes to find k passing candidates). Instead, the full scan streams all columns including the vector, evaluates filters per batch, computes exact distances for passing rows, and maintains a top-k heap (`ScoredRow`). Returns results directly without touching USearch or the lookup provider.
262
+
**Low-selectivity path (index-get):** When few rows pass the filter, HNSW graph traversal becomes expensive (it must explore ~`k/selectivity` nodes to find k passing candidates). Instead, vectors are retrieved directly from the USearch index via `index.get(key)` for each valid key, exact distances are computed, and a top-k heap selects the closest matches. Result rows are fetched from the lookup provider.
264
263
265
264
**High-selectivity path (HNSW filtered):** Passes valid keys as a predicate to `index.filtered_search()` — HNSW skips non-passing nodes during traversal. Result keys are fetched from the lookup provider.
266
265
@@ -284,7 +283,7 @@ All three distance functions are **lower-is-closer**:
284
283
cargo test
285
284
```
286
285
287
-
Tests cover optimizer rule matching/rejection, end-to-end execution through both HNSW and Parquet-native paths, registration validation, and provider error handling.
286
+
Tests cover optimizer rule matching/rejection, end-to-end execution through both HNSW and index-get paths, registration validation, and provider error handling.
0 commit comments