Overview of the DiskANN Project (2018–present)

Research Ideas

DiskANN started as a research project in 2018–2019 to address the large gap between vector search algorithms in the literature and the rapidly expanding scale and feature needs in industry.

Our research, with co-authors from MSR, Microsoft product groups, CMU, UMD, MIT, IITH, and UCI, addresses the following problems—many of which push the state of the art by an order of magnitude in one or more directions:

  1. The first practical, high-performance SSD-based index that could index 10× more vectors per machine than previous in-memory systems [1].
  2. The first papers on updating graph-structured vector indices with stable recall, either via merges [2] or via in-place edits [4].
  3. The first paper on predicate pushdown for vector-plus-predicate queries that provide high recall and two or more orders of magnitude higher query performance [3].
  4. Deterministic parallel updates to the index (experiments on 192 cores) [5].
  5. A single logical, distributed 50-billion-point index across 1,000 machines with 6× higher efficiency than sharded indices [8].
  6. Investigation of out-of-distribution (OOD) queries [16].
  7. Indices for diverse recommendations [17].
  8. Adaptations of large indices for GPUs [21].
  9. A theoretical analysis of beam search for graph-structured vector indices [25].
  10. Adaptive distances for large vector search with predicates [26].

Some of the ideas are surveyed in a recent bulletin [6].

Adoption

Many of these ideas are implemented in an open-source project [12], and are used widely within Microsoft and industry, and have inspired hardware adaptations. A few examples include:

  1. The code we wrote supports at-scale vector indices at Microsoft in Bing, Ads, Microsoft 365, Windows, and Azure databases.
  2. In the PostgreSQL ecosystem, they are implemented by TimescaleDB as pgvectorscale [14].
  3. In the Cassandra ecosystem, DataStax (now part of IBM) implemented them as JVector [15].
  4. Milvus, Pinecone, Weaviate, and other vector databases have implemented or adapted these ideas.
  5. Storage-only vector search by Kioxia [19].
  6. Intel's adaptations for Optane PMem [20].
  7. NVIDIA's adaptations for the cuVS library [18], [22].

Benchmarks

Along the way, we realized there were few public datasets or benchmarks, so we partnered with other companies and universities to:

  1. Create new datasets for large-scale vector search and its variants [13].
  2. Publish open-source baseline algorithms [12].
  3. Run two competitions at NeurIPS 2021 and NeurIPS 2023 [9], [10]. These have been used in many theses and research papers, including those in database and ML conferences.

Current and Future

The code for this research [12] was forked many times internally and reimplemented externally, which made it hard to manage and develop new algorithms. Further, since the 2023 version of DiskANN [12] was tied to specific points in the storage hierarchy and managed its own index terms, it was hard to integrate into databases, preventing it from being hardened into a highly available and durable vector database.

With this in mind, since 2023 we have rewritten DiskANN in Rust with the following goals:

  1. DiskANN delegates storage of indexing terms to a host database (or key-value store or file system), which it accesses and mutates via a Provider API.
  2. DiskANN is a stateless orchestrator of vector requests between users, indexers, query engines, and the storage backend.
  3. DiskANN provides a minimal API (updates with or without minibatches, paginated search) and integrates into the query planner for predicate evaluation.

This allows DiskANN to be plugged into different databases or systems and to inherit the availability and durability of the host database. The host database can choose to operate DiskANN at different memory tiers suited to target cost-performance points. Our new version has been integrated with five (and counting) backends. It can also be connected to memory buffers to compete with FAISS, hnswlib, or the older "monolithic" in-memory DiskANN.

When integrated with Azure Cosmos DB for NoSQL, Microsoft's highly available geo-distributed database, this integration brings vector indexing into operational databases and is competitive with specialized serverless vector databases [7]. See slides from our VLDB 2025 talk here [23].

For a 25-minute overview of the project, see the slides from an overview talk at VLDB 2025 [24].

References

  1. Fast Accurate Billion-point Nearest Neighbor Search on a Single Node
  2. FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search
  3. FilteredDiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filters
  4. In-Place Updates of a Graph Index for Streaming Approximate Nearest Neighbor Search
  5. ParlayANN: Scalable and Deterministic Parallel Graph-Based Approximate Nearest Neighbor Search Algorithms
  6. The DiskANN library: Graph-Based Indices for Fast, Fresh and Filtered Vector Search
  7. Cost-Effective, Low Latency Vector Search with Azure Cosmos DB
  8. DistributedANN: Efficient Scaling of a Single DiskANN Graph Across Thousands of Computers
  9. Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search
  10. Results of the Big ANN: NeurIPS'23 competition
  11. https://big-ann-benchmarks.com
  12. https://github.com/microsoft/DiskANN
  13. Big ANN Benchmarks dataset list
  14. Timescale DB's pgvectorscale
  15. IBM Datastax Jvector
  16. OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries
  17. Graph-Based Algorithms for Diverse Similarity Search
  18. https://www.nvidia.com/en-us/on-demand/session/gtc25-s72905/
  19. AiSAQ: All-in-Storage ANNS with Product Quantization for DRAM-free Information Retrieval
  20. Intel: Winning the NeurIPS BillionScale Approximate Nearest Neighbor Search Challenge
  21. BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU
  22. NVIDIA CuVS and DiskANN
  23. Cosmos DB Vector Search VLDB 2025 slides
  24. DiskANN overview slides
  25. Sort Before You Prune: Improved Worst-Case Guarantees of the DiskANN Family of Graphs
  26. Learning Filter-Aware Distance Metrics for Nearest Neighbor Search with Multiple Filters