Overview of the DiskANN Project (2018

Research Ideas

DiskANN started as a research project in 2018–2019 to address the large gap between vector search algorithms in the literature and the rapidly expanding scale and feature needs in industry.

Our research, with co-authors from MSR, Microsoft product groups, CMU, UMD, MIT, IITH, and UCI, addresses the following problems—many of which push the state of the art by an order of magnitude in one or more directions:

The first practical, high-performance SSD-based index that could index 10× more vectors per machine than previous in-memory systems [1].
The first papers on updating graph-structured vector indices with stable recall, either via merges [2] or via in-place edits [4].
The first paper on predicate pushdown for vector-plus-predicate queries that provide high recall and two or more orders of magnitude higher query performance [3].
Deterministic parallel updates to the index (experiments on 192 cores) [5].
A single logical, distributed 50-billion-point index across 1,000 machines with 6× higher efficiency than sharded indices [8].
Investigation of out-of-distribution (OOD) queries [16].
Indices for diverse recommendations [17].
Adaptations of large indices for GPUs [21].
A theoretical analysis of beam search for graph-structured vector indices [25].
Adaptive distances for large vector search with predicates [26].

Some of the ideas are surveyed in a recent bulletin [6].

Adoption

Many of these ideas are implemented in an open-source project [12], and are used widely within Microsoft and industry, and have inspired hardware adaptations. A few examples include:

The code we wrote supports at-scale vector indices at Microsoft in Bing, Ads, Microsoft 365, Windows, and Azure databases.
In the PostgreSQL ecosystem, they are implemented by TimescaleDB as pgvectorscale [14].
In the Cassandra ecosystem, DataStax (now part of IBM) implemented them as JVector [15].
Intel re-implemented these ideas and added new quatizers as part of their Scalable Vector Search (SVS) [27].
Redis integrated Intel SVS as part of its vector APIs [28].
Milvus, Pinecone, Weaviate, and other vector databases have implemented or adapted these ideas.
Storage-only vector search by Kioxia [19].
Intel's adaptations for Optane PMem [20].
NVIDIA's adaptations for the cuVS library [18], [22].

Benchmarks

Along the way, we realized there were few public datasets or benchmarks, so we partnered with other companies and universities to:

Create new datasets for large-scale vector search and its variants [13].
Publish open-source baseline algorithms [12].
Run two competitions at NeurIPS 2021 and NeurIPS 2023 [9], [10]. These have been used in many theses and research papers, including those in database and ML conferences.

Current and Future

The code for this research [12] was forked many times internally and reimplemented externally, which made it hard to manage and develop new algorithms. Further, since the 2023 version of DiskANN [12] was tied to specific points in the storage hierarchy and managed its own index terms, it was hard to integrate into databases, preventing it from being hardened into a highly available and durable vector database.

With this in mind, since 2023 we have rewritten DiskANN in Rust with the following goals:

DiskANN delegates storage of indexing terms to a host database (or key-value store or file system), which it accesses and mutates via a Provider API.
DiskANN is a stateless orchestrator of vector requests between users, indexers, query engines, and the storage backend.
DiskANN provides a minimal API (updates with or without minibatches, paginated search) and integrates into the query planner for predicate evaluation.

This allows DiskANN to be plugged into different databases or systems and to inherit the availability and durability of the host database. The host database can choose to operate DiskANN at different memory tiers suited to target cost-performance points. Our new version has been integrated with five (and counting) backends. It can also be connected to memory buffers to compete with FAISS, hnswlib, or the older "monolithic" in-memory DiskANN.

When integrated with Azure Cosmos DB for NoSQL, Microsoft's highly available geo-distributed database, this integration brings vector indexing into operational databases and is competitive with specialized serverless vector databases [7]. See slides from our VLDB 2025 talk here [23].

For a 25-minute overview of the project, see the slides from an overview talk at VLDB 2025 [24].

Overview of the DiskANN Project (2018–present)

Research Ideas

Adoption

Benchmarks

Current and Future

References