feat: GPU-accelerated indexing with Collection API integration by cluster2600 · Pull Request #176 · alibaba/zvec

cluster2600 · 2026-02-25T18:02:13Z

Summary

Implements GPU-accelerated indexing for zvec, addressing #100 and #147.
Incorporates community feedback from #180 (@xXMrNidaXx).

Core architecture

UnifiedGpuIndex ABC with 6 backend adapters behind a single interface
GpuIndex bridge connecting GPU search to Collection.fetch() for full Doc results
C++ first architecture — prefers native _zvec cuVS bindings (zero-copy via GpuBufferLoader), falls back to Python cuVS → FAISS GPU → Apple MPS → FAISS CPU

New in this update (from #180 feedback)

PyTorch-style device= API — collection.index("embedding", device="gpu")
replaces the older gpu_index() method (kept with deprecation warning for backward compat)
build_from_collection(batch_size=) — streams vectors directly from
the collection in batches, no manual extraction needed:
```
gpu = collection.index("embedding", device="cuda:0")
gpu.build_from_collection(batch_size=10_000)
```
ZVEC_GPU_BACKEND_PRIORITY env var — comma-separated backend names
that override the default auto-selection priority:
```
export ZVEC_GPU_BACKEND_PRIORITY=faiss_gpu,cuvs_cagra,faiss_cpu
```
Hybrid CPU/GPU auto-selector — collections below gpu_threshold
(default 50K, configurable via ZVEC_GPU_AUTO_THRESHOLD env var)
automatically route to CPU to avoid GPU transfer overhead on small datasets

Backend priority

Priority	Backend	Path
1	C++ cuVS (CAGRA/IVF-PQ/HNSW)	`_zvec` pybind11 — zero-copy
2	Python cuVS CAGRA	`cuvs.neighbors.cagra`
3	Python cuVS IVF-PQ	`cuvs.neighbors.ivf_pq`
4	FAISS GPU	`faiss.index_cpu_to_gpu`
5	Apple MPS	`torch.backends.mps`
6	FAISS CPU	fallback

Usage

# PyTorch-style device API (new)
gpu = collection.index("embedding", device="gpu")
gpu.build_from_collection(batch_size=10_000)
docs = gpu.query(query_vec, topk=10, output_fields=["title"])

# Explicit backend + manual build
gpu = collection.index("embedding", backend="cuvs_cagra")
gpu.build(vectors, doc_ids)
results = gpu.search(query_vec, k=10)

# Hybrid auto: small collections → CPU, large → GPU
gpu = collection.index("embedding", gpu_threshold=100_000)
gpu.build(vectors, ids)  # auto-selects CPU if len(vectors) < 100K

Test results

Unit tests (44 tests passing)

Backend selection (auto, explicit, device strings, env var priority)
FAISS CPU adapter (train, search, single query)
Apple MPS adapter (numpy fallback, add)
GpuIndex bridge (build, search, query, errors, info)
Device parameter (cpu, gpu override, None fallback)
GPU threshold (small/large collections, env override, explicit override)
build_from_collection (fetch_all, explicit IDs, batch_size, errors, chaining)
Collection.index() + deprecated gpu_index()
Detection (cuVS, optimal backend, GPU availability)

Apple MPS (M-series Mac)

210 QPS on 5,000 vectors (dim=128, brute-force numpy fallback)

NVIDIA RTX 4090 (CUDA 12, sm_89)

Backend	Build (50K vec)	Search (100 queries, k=10)	QPS
cuVS CAGRA	2.3s	2.4ms	43,711
cuVS IVF-PQ	0.2s	2.2ms	45,771
FAISS GPU (flat)	—	0.2ms	529,316

Merge order

This PR shares a common base with #172, #173, #175. Recommended merge order: #172 → #173 → #175 → #176. This PR should merge last.

Test plan

44 unit tests passing (adapters, GpuIndex, device=, threshold, build_from_collection, detection)
Test on NVIDIA GPU with cuVS CAGRA — 43K QPS on RTX 4090
Test on NVIDIA GPU with cuVS IVF-PQ — 45K QPS on RTX 4090
FAISS GPU verified — 529K QPS on RTX 4090
Verify Apple MPS path on Apple Silicon — 210 QPS
Test C++ cuVS path when _zvec is built with CUDA support (requires custom build)

- backends/detect.py: Hardware detection - backends/gpu.py: FAISS GPU integration - backends/quantization.py: Product Quantization - backends/opq.py: OPQ + Scalar Quantization - backends/search.py: Search optimization - backends/hnsw.py: HNSW implementation - backends/apple_silicon.py: Apple Silicon optimization - backends/benchmark.py: Benchmarks Internal sprint work - not for upstream PR.

- ShardManager for vector sharding - DistributedIndex with scatter-gather queries - QueryRouter for routing strategies - ResultMerger for merging results from shards - Support for hash, range, and random sharding

- Add README.md with full API documentation - Add BENCHMARK_README.md with benchmark results - Add test_backends.py with comprehensive tests

- Adjust k to avoid sampling errors - Simplify k-means implementation - Fix codebooks shape

Based on cuVS documentation: - Support for CAGRA, IVF-PQ, HNSW algorithms - 12x faster builds, 8x lower latency target - Dynamic batching for CAGRA

Based on cuVS documentation: - IVF-PQ: 12x faster builds, 8x lower latency - CAGRA: 10x latency with dynamic batching, 8x throughput - Both support fallback when cuVS not available

- 9x speedup target vs CPU - Compatible with DiskANN

Based on arXiv:2401.11324: - Synthetic clustered data generation - FAISS CPU/GPU/IVF-PQ benchmarks - cuVS placeholder benchmarks - Results output to markdown

S3: GPU-PIM collaboration research S4: Memory coalescing kernel (2-8x speedup) S5: Apple ANE optimization guide S6: ANE vs MPS benchmark S7: Graph reordering (15% QPS gain) S8: PIM evaluation framework All based on scientific papers.

1. cuVS C++ bindings (zvec_cuvs.h) - IVFPQ, CAGRA, HNSW index classes - Template-based for float/uint8_t/int8_t 2. CUDA coalesced kernels (coalesce.cuh, coalesce.cu) - Coalesced L2 distance (2-8x speedup) - Warp-level reductions - FP16 support - Tiled shared memory version 3. Metal MPS kernels (distance.metal) - L2 distance with SIMD/NEON - FP16 support for Apple Silicon - Batch processing - Matrix multiplication All based on scientific papers.

1. SIMD CPU optimization (simd_distance.h) - SSE2, AVX2 for x86 - NEON for ARM/Apple Silicon - 4-16x speedup expected 2. CMake build system (CMakeLists.txt) - CUDA coalesced kernels - Metal shaders - SIMD CPU - Optional cuVS integration 3. Graph-based ANN (graph_ann.h) - CAGRA-like implementation - NN-Descent graph construction - Hierarchical search

1. FastScan (simd_distance.h) - SIMD-optimized Product Quantization - AVX2 distance computation - Bitonic sort for k-selection 2. Vamana Graph (vamana.h) - DiskANN algorithm - Robust to search parameters - Used in Azure AI Search 3. NUMA-aware (numa.h) - Per-NUMA-node allocation - Work-stealing thread pool - 6-20x speedup on multi-socket Based on papers: - Quake (OSDI 2025): NUMA-aware partitioning - FAISS (2024): FastScan SIMD optimization - DiskANN: Vamana graph

1. Lock-free concurrent structures (lockfree.h) - LockFreeVector (Stroustrup design) - AtomicIndex for HNSW - Hazard pointer reclamation 2. Memory pool optimizations (memory_pool.h) - Aligned allocator (cache-line, huge pages) - Object pool - Slab allocator - SoA layout 3. Batch processing (batch.h) - Transposed matrix for PQ (30-50% faster) - Loop unrolling - AVX-512 support - PQ distance tables Based on: - FAISS optimization guide - Stroustrup lock-free vector - OptiTrust paper (2024)

Implements GPU-accelerated indexing for zvec (issues alibaba#100, alibaba#147). Architecture (C++ first, Python fallback): 1. C++ cuVS (via _zvec pybind11 — zero-copy, preferred path) 2. Python cuVS CAGRA / IVF-PQ (NVIDIA GPU) 3. FAISS GPU (NVIDIA GPU, general purpose) 4. Apple MPS (Apple Silicon) 5. FAISS CPU (fallback) New files: - backends/unified.py: UnifiedGpuIndex ABC + 6 adapters (CppCuvs, CuvsCAGRA, CuvsIvfPq, FaissGpu, FaissCpu, AppleMps) + factory - gpu_index.py: GpuIndex bridge — build(vectors, ids), search(), query() returning Doc objects via Collection.fetch() - tests/test_gpu_index.py: 20 unit tests (all passing) Modified: - backends/detect.py: cuVS + C++ cuVS detection - model/collection.py: Collection.gpu_index() convenience method - backends/__init__.py, __init__.py: export GpuIndex, select_backend Usage: gpu = collection.gpu_index("embedding") gpu.build(vectors, doc_ids) docs = gpu.query(query_vec, topk=10) Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

- cuvs_cagra.py: use cagra.build(IndexParams, dataset) and cagra.search(SearchParams, index, queries, k) instead of the non-existent Index().build() / Index().search() methods - cuvs_ivf_pq.py: same pattern fix, plus correct import path (cuvs.neighbors.ivf_pq instead of cuvs.ivf_pq) - Both backends now convert numpy queries to cupy device arrays before search (cuVS requires CUDA-compatible memory) Tested on RTX 4090: - cuVS CAGRA: 43K QPS (50K vectors, dim=128) - cuVS IVF-PQ: 45K QPS (50K vectors, dim=128) - FAISS GPU: 529K QPS (50K vectors, dim=128, flat) Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

cluster2600 · 2026-02-26T08:05:53Z

Discussion issue opened: #180 — feedback welcome before review.

Address feedback from issue alibaba#180: 1. PyTorch-style device= API: collection.index("embedding", device="gpu") replaces the older gpu_index() method (kept for backward compat with deprecation warning). 2. build_from_collection(batch_size=) method: streams vectors directly from the collection in batches, avoiding manual extraction. 3. ZVEC_GPU_BACKEND_PRIORITY env var: comma-separated list of backend names that overrides the default auto-selection priority chain. 4. Hybrid CPU/GPU auto-selector: collections below gpu_threshold (default 50k, configurable via ZVEC_GPU_AUTO_THRESHOLD) automatically use CPU to avoid GPU transfer overhead on small datasets. Tests expanded from 14 to 44 covering all new features. Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

- Refactor select_backend into smaller functions to satisfy PLR0911/PLR0915 - Replace _try_create_backend if-chain with dict-based dispatch - Add noqa comments for intentional deferred imports (PLC0415) - Fix else/if → elif (PLR5501), zip strict= (B905), dict() (C416) - Sort import block (I001), mark unused arg (ARG002) Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

Fix NPY002 (numpy random API), PLC0415 (deferred imports), G004 (f-string logging), ARG001/ARG002 (unused args), F821 (undefined names), PTH123 (pathlib), F401 (unused imports), and PGH003 (blanket type-ignore) violations flagged by ruff 0.14.4. Signed-off-by: Maxime Kawawa-Beaudan <maxkb@meta.com> Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

Notebooks (*.ipynb) are interactive benchmarks/demos where print() statements and loose imports are expected. Exclude them from ruff to match the existing test/bench exclusion pattern. Signed-off-by: Maxime Kawawa-Beaudan <maxkb@meta.com> Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

Run `ruff format .` to ensure consistent code formatting across all Python files and notebooks, matching CI formatter check. Signed-off-by: Maxime Kawawa-Beaudan <maxkb@meta.com> Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

Format all C++ headers in src/ailego/ to match the project's Google-based .clang-format style, fixing CI clang-format check. Signed-off-by: Maxime Kawawa-Beaudan <maxkb@meta.com> Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

Reverts the src/CMakeLists.txt to the upstream version which does not require CUDA. The GPU-specific CMake config with CUDA/Metal support was breaking the CI build on runners without CUDA toolkit. GPU C++ headers remain as header-only and don't require CUDA to compile. Signed-off-by: Maxime Kawawa-Beaudan <maxkb@meta.com> Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

- Remove spurious .T in asymmetric_distance_computation() that transposed the (Q, N) lookup result into (N, Q), causing a broadcast shape mismatch - Fix off-by-one in test_distributed_index: assert shard count == 4 instead of checking for non-existent shard index 4 Signed-off-by: Maxime Kawawa-Beaudan <maxkb@meta.com> Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

cluster2600 added 25 commits February 24, 2026 13:59

feat: add distributed index implementation

2be6793

- ShardManager for vector sharding - DistributedIndex with scatter-gather queries - QueryRouter for routing strategies - ResultMerger for merging results from shards - Support for hash, range, and random sharding

docs: add comprehensive documentation and tests

c5407b8

- Add README.md with full API documentation - Add BENCHMARK_README.md with benchmark results - Add test_backends.py with comprehensive tests

fix: PQ encoder - handle small datasets properly

46ce49d

- Adjust k to avoid sampling errors - Simplify k-means implementation - Fix codebooks shape

feat: add cuVS wrapper skeleton

ca1f273

Based on cuVS documentation: - Support for CAGRA, IVF-PQ, HNSW algorithms - 12x faster builds, 8x lower latency target - Dynamic batching for CAGRA

feat: add cuVS IVF-PQ and CAGRA implementations

f5e1567

Based on cuVS documentation: - IVF-PQ: 12x faster builds, 8x lower latency - CAGRA: 10x latency with dynamic batching, 8x throughput - Both support fallback when cuVS not available

feat: add cuVS HNSW wrapper

fee7f2a

- 9x speedup target vs CPU - Compatible with DiskANN

feat: add cuVS vs FAISS benchmark script

0196637

Based on arXiv:2401.11324: - Synthetic clustered data generation - FAISS CPU/GPU/IVF-PQ benchmarks - cuVS placeholder benchmarks - Results output to markdown

feat: complete S3-S8 research and implementations

0b6f99c

S3: GPU-PIM collaboration research S4: Memory coalescing kernel (2-8x speedup) S5: Apple ANE optimization guide S6: ANE vs MPS benchmark S7: Graph reordering (15% QPS gain) S8: PIM evaluation framework All based on scientific papers.

add: Kaggle benchmark notebook

d98a66c

fix: Kaggle notebook path

ab1264f

fix: Kaggle notebook - test Python modules only

0d81b34

fix: Colab notebook - proper path and FAISS GPU test

8e69282

fix: export backends module

b064dcc

fix: Colab notebook - full test

79b837f

fix: clean clone

f61f973

add: simple colab test

c304405

add: full GPU benchmark suite

2e4be16

add: extended GPU benchmarks

48083ab

cluster2600 and others added 9 commits February 26, 2026 15:58

Merge branch 'main' into feat/gpu-accelerated-indexing

833fad5

style: apply ruff formatter to all files

7ede4d0

Run `ruff format .` to ensure consistent code formatting across all Python files and notebooks, matching CI formatter check. Signed-off-by: Maxime Kawawa-Beaudan <maxkb@meta.com> Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

style: apply clang-format to C++ headers

55cb212

Format all C++ headers in src/ailego/ to match the project's Google-based .clang-format style, fixing CI clang-format check. Signed-off-by: Maxime Kawawa-Beaudan <maxkb@meta.com> Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: GPU-accelerated indexing with Collection API integration#176

feat: GPU-accelerated indexing with Collection API integration#176
cluster2600 wants to merge 34 commits intoalibaba:mainfrom
cluster2600:feat/gpu-accelerated-indexing

cluster2600 commented Feb 25, 2026 •

edited

Loading

Uh oh!

cluster2600 commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cluster2600 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Core architecture

New in this update (from #180 feedback)

Backend priority

Usage

Test results

Unit tests (44 tests passing)

Apple MPS (M-series Mac)

NVIDIA RTX 4090 (CUDA 12, sm_89)

Merge order

Test plan

Uh oh!

cluster2600 commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cluster2600 commented Feb 25, 2026 •

edited

Loading