The diskann-providers crate has become a dumping ground for architectural cruft. Containing it in one place is useful, but it blurs what actually belongs here. Here's my opinionated take on how to tackle this.
I think four things need to happen before we can cleanly extract the inmem providers into their own crate (and similarly split out bf-tree), ordered from easy/mechanical to architecturally involved:
Importantly, I don't think much outside the following justifies creating an auxiliary crate — at least until several rounds of cleansing and extraction have been performed. (I reserve the right to be wrong.)
Cruft Removal
There's a bunch of utilities that have been either superseded or are straight up dead code (though not caught due to public exports). These should be audited and either deleted, or upgraded to their new counterparts. A few quick examples:
Test Migration
This might not necessarily be straight forward to decouple the graph invariant tests from the inmem specific code, but using the test provider in diskann will provide much better checks on the indexing code.
If we want inmem tests still, we should look carefully at using a trait-object to wrap the code under test to avoid the insane number of test monomorphizations that are currently present in diskann_async.rs
I do question the usefulness of these tests when it comes to performing long-term guarantees. For really protecting the integrity of the inmem providers in an end-to-end context, I think we'll want the upcoming native A/B test infrastructure in diskann-benchmark and use checked-in baselines as a basis for regression testing.
PQ
Much of the PQ code in diskann-providers predates the diskann-quantization crate. Some items are candidates for moving to diskann-quantization or elsewhere.
What to do with FixedChunkPQTable
I really do not like this data structure. The way it stores the PQ pivot data is almost the worst way one can store pivots data for any kind of efficient processing.
Ideally, we'd use something more like the TransposedTable in diskann-quantization for distance table population (though this comes at the potential cost of slower quant-quant distances).
However, there are a few issues that keep us from being able to just do this:
- For historical reasons, a
centroids vector has been used to center PQ data.
Vanilla PQ based on L2 distances between centers, an affine shift is not really useful.
However, we need to support this shift for backwards compatibility with some implementations.
I'd be comfortable moving application of this shift elsewhere in the pre-processing stack, but I do not think diskann-quantization should be obligated to support this.
- The state of OPQ is pretty nebulous at this point. As far as I know, no one is using it and I'd be surprised if it actually worked correctly. We should make a decision on whether we actually want to support this or if we should simply ditch it. The latter would greatly simplify a lot of code.
- The
FixedChunkPQTable supports quant-quant distances. The implementation is efficient-ish, but could be much better. We may consider building a distance table specifically to accelerate quant-quant distances, but this would be large (at least 64kB per chunk).
Candidates for moving to diskann-quantization
Candidates for moving out
Open Questions
The code in pq_construction.rs is too opinionated on its file operations to justify moving to a lower level crate.
Save/Load Overhaul
Most of the code in storage/* is related to the various ad-hoc storage mechanisms used by various index implementations.
Suhas has been working on a proposal for a structured and unified way of saving and loading.
Hopefully, we can use this to dump the problematic SaveWith/LoadWith traits and get rid of the various ad-hoc methods strewn about.
I do not think it's worth trying to clean this up until the replacement is ready.
Extracting and Fortifying Inmem
There are currently several issues with the in-memory providers, not least of which is being buried 8 layers deep in a directory.
- Deep-rooted unsafety in
AlignedMemoryVectorStore (this is a hard problem to tackle efficiently).
- Tricky generic bounds on some quantizers (this is largely
diskann-quantization's fault, I'm sorry).
- Sketchy APIs around start point creation/constructors.
These do not need to necessarily be fixed to create a diskann-inmem, but there are some considerations that must be made first.
Because of the reliance of the inmem providers on the PQ code, we either need to fix the PQ story before isolating, or make a new diskann-inmem that relies on diskann-providers until the other issues can be sorted out.
Similarly, the save/load story keeps these implementations largely coupled with the storage related code in diskann-providers.
Thus, without these underlying issues being addressed, a crate split wouldn't really make a semantic boundary.
The
diskann-providerscrate has become a dumping ground for architectural cruft. Containing it in one place is useful, but it blurs what actually belongs here. Here's my opinionated take on how to tackle this.I think four things need to happen before we can cleanly extract the inmem providers into their own crate (and similarly split out bf-tree), ordered from easy/mechanical to architecturally involved:
diskann_async.rstodiskann.diskann-quantization.diskann-inmemcrate and others.Importantly, I don't think much outside the following justifies creating an auxiliary crate — at least until several rounds of cleansing and extraction have been performed. (I reserve the right to be wrong.)
Cruft Removal
There's a bunch of utilities that have been either superseded or are straight up dead code (though not caught due to public exports). These should be audited and either deleted, or upgraded to their new counterparts. A few quick examples:
utils/kmeans.rs: KMeans has been replaced for PQ related purposes by the faster implementation indiskann-quantization. The last user of this flow is the disk-index partitioning. It's possible that the BLAS based implementation of thediskann-providersis faster than the approach taken indiskann-quantizationfor full-dimensional data. To that end, we should either:diskann-quantizationand get rid of thediskann-providersimplementation.diskann-quantizationto be on-par for full-dimensional data.diskann-diskwhere it's actually used.utils/math_util.rs: See the abovekmeansuse. Also, vector generation with norm can be replaced by the implementations indiskann-utils.diskann-providers/utilsderive their staying power because they work in-place on raw files. Is this still needed and if so, is there a cleaner way of decoupling the implementation from files?model/graph/traits/graph_data_types.rs: This type should no longer be needed by non-diskann-diskcode (and if it is, changing it to a rawT: VectorReprshould be straight-forward.We should remove this type from
diskann-providersand update remaining uses withindiskann-providersto just use a vector type, ID type, and associated data type independently as needed.The problem with using this large list of associated types to instantiate generic functions that only use a portion is that it can lead to redundant monomorphizations, and makes uses less clear on whether all associated data types are used or not.
Test Migration
diskann_async.rsis pulling double-duty as tests for the inmem index and tests fordiskann/graph/index.rs.The test infrastructure in
diskannhas gotten much better for running such unit tests.All unit tests should be migrated to
diskannand use appropriate baselines.This might not necessarily be straight forward to decouple the graph invariant tests from the inmem specific code, but using the test provider in
diskannwill provide much better checks on the indexing code.If we want inmem tests still, we should look carefully at using a trait-object to wrap the code under test to avoid the insane number of test monomorphizations that are currently present in
diskann_async.rsI do question the usefulness of these tests when it comes to performing long-term guarantees. For really protecting the integrity of the inmem providers in an end-to-end context, I think we'll want the upcoming native A/B test infrastructure in
diskann-benchmarkand use checked-in baselines as a basis for regression testing.PQ
Much of the PQ code in
diskann-providerspredates thediskann-quantizationcrate. Some items are candidates for moving todiskann-quantizationor elsewhere.What to do with
FixedChunkPQTableI really do not like this data structure. The way it stores the PQ pivot data is almost the worst way one can store pivots data for any kind of efficient processing.
Ideally, we'd use something more like the
TransposedTableindiskann-quantizationfor distance table population (though this comes at the potential cost of slower quant-quant distances).However, there are a few issues that keep us from being able to just do this:
centroidsvector has been used to center PQ data.Vanilla PQ based on L2 distances between centers, an affine shift is not really useful.
However, we need to support this shift for backwards compatibility with some implementations.
I'd be comfortable moving application of this shift elsewhere in the pre-processing stack, but I do not think
diskann-quantizationshould be obligated to support this.FixedChunkPQTablesupports quant-quant distances. The implementation is efficient-ish, but could be much better. We may consider building a distance table specifically to accelerate quant-quant distances, but this would be large (at least 64kB per chunk).Candidates for moving to
diskann-quantizationCandidates for moving out
Open Questions
The code in
pq_construction.rsis too opinionated on its file operations to justify moving to a lower level crate.Save/Load Overhaul
Most of the code in
storage/*is related to the various ad-hoc storage mechanisms used by various index implementations.Suhas has been working on a proposal for a structured and unified way of saving and loading.
Hopefully, we can use this to dump the problematic
SaveWith/LoadWithtraits and get rid of the various ad-hoc methods strewn about.I do not think it's worth trying to clean this up until the replacement is ready.
Extracting and Fortifying Inmem
There are currently several issues with the in-memory providers, not least of which is being buried 8 layers deep in a directory.
AlignedMemoryVectorStore(this is a hard problem to tackle efficiently).diskann-quantization's fault, I'm sorry).These do not need to necessarily be fixed to create a
diskann-inmem, but there are some considerations that must be made first.Because of the reliance of the inmem providers on the PQ code, we either need to fix the PQ story before isolating, or make a new
diskann-inmemthat relies ondiskann-providersuntil the other issues can be sorted out.Similarly, the save/load story keeps these implementations largely coupled with the storage related code in
diskann-providers.Thus, without these underlying issues being addressed, a crate split wouldn't really make a semantic boundary.