How NVIDIA GPU acceleration improved the efficiency of the Milvus vector database

Updated 4 months ago on June 25, 2024

Unstructured data plays a critical role in cutting-edge artificial intelligence applications. From generative AI and similarity search to recommender systems and virtual drug discovery.

  • Real-time indexing: Vector databases often need new vector data to arrive and be indexed constantly and quickly. Real-time indexing capabilities are necessary to keep the database up-to-date without creating bottlenecks and backlogs.
  • High throughput: Many applications using vector databases, such as recommendation engines, semantic search engines, and anomaly detection systems, require real-time or near real-time query processing. High throughput allows vector databases to handle a large volume of incoming queries simultaneously, providing low-latency responses to end users or services.

Vector databases are based on a core set of vector operations, such as similarity computation and matrix operations, which lend themselves well to parallelization and are computationally intensive. Because of their massively parallel architecture, consisting of thousands of cores capable of executing many threads simultaneously, GPUs are the ideal computational engine to accelerate these operations.

Using GPU acceleration

To address these challenges, NVIDIA has developed CUDA-Accelerated Graph Index for Vector Retrieval (CAGRA), a GPU-accelerated framework that leverages the high-performance capabilities of the GPU to deliver exceptional performance when working with vector databases.

Zilliz and NVIDIA unveiled Milvus 2.4 at NVIDIA GTC 2024. Milvus is an open source vector database system designed for large-scale vector similarity search and AI workloads. Originally created by Zilliz, an innovator in unstructured data management and vector database technology, Milvus debuted in 2019. To encourage broad community engagement and adoption, it is hosted by the Linux Foundation starting in 2020.

Exploring the Milvus 2.4 architecture

Milvus is designed for cloud environments and adheres to a modular design philosophy. It divides the system into different components and layers involved in handling client requests, processing data, and managing the storage and retrieval of vector data. This modular design allows Milvus to update or upgrade the implementation of specific modules without changing their interfaces. This modularity makes it relatively easy to incorporate GPU acceleration support into Milvus.

The Milvus 2.4 architecture

The modular architecture includes components such as coordinator, access layer, message queue, worker node and storage layer. The worker node is further subdivided into data nodes, query nodes and index nodes. Index nodes are responsible for creating indexes and query nodes execute queries.

The index nodes incorporate CAGRA support into the index building algorithms, which enables efficient construction and management of high-dimensional vector indexes on GPU hardware. This acceleration significantly reduces the time and resources required to index large vector datasets.

Similarly, CAGRA is used in query nodes to speed up the execution of complex vector similarity search. By utilizing GPU computing power, Milvus can perform high-dimensional distance calculations and similarity searches at unprecedented speeds, resulting in faster query response times and higher overall throughput.

Performance evaluation

For this evaluation, we used three publicly available instance types on AWS:

  • m6id.2xlarge: This instance type is powered by an Intel Xeon 8375C processor.
  • g4dn.2xlarge: This GPU-accelerated instance is equipped with an NVIDIA T4 GPU.
  • g5.2xlarge: This instance type is equipped with NVIDIA A10G GPU.

Using these different instance types, we aimed to evaluate the performance and efficiency of Milvus with CAGRA integration on different hardware configurations. The m6id.2xlarge instance served as a baseline for CPU-based performance, while the g4dn.2xlarge and g5.2xlarge instances allowed us to evaluate the benefits of GPU acceleration using NVIDIA T4 and A10G GPUs, respectively.

Evaluation Environments, AWS

We used two publicly available vector datasets from VectorDBBench:

  • OpenAI-500K-1536-dim: This dataset consists of 500,000 vectors, each with dimensionality 1,536. It is derived from the OpenAI language model.
  • Cohere-1M-768-dim: This dataset contains 1 million vectors, each with dimensionality 768. It is generated based on the Cohere language model.

Index construction time

We compared the index construction time between Milvus with the CAGRA GPU gas pedal and the standard Milvus implementation using the HNSW index on CPUs.

Milvus CAGRA vs HNSW

For the Cohere-1M-768-dim dataset, the index construction time is:

  • PROCESSOR (HNSW): 454 seconds
  • T4 GPU (CAGRA): 66 seconds
  • A10G GPU (CAGRA): 42 seconds

For the OpenAI-500K-1536-dim dataset, the index construction time is:

  • PROCESSOR (HNSW): 359 seconds
  • T4 GPU (CAGRA): 45 seconds
  • A10G (CAGRA) GPU: 22 seconds

The results clearly show that CAGRA, a GPU-accelerated framework, significantly outperforms CPU-based HNSW index construction, with the GPU A10G proving to be the fastest in both datasets. The GPU acceleration provided by CAGRA reduces index construction time by an order of magnitude compared to the CPU-based implementation, demonstrating the benefits of using GPU parallelism for computationally intensive vector operations such as index construction.

Throughput

We also compared the performance of Milvus with the CAGRA GPU gas pedal and the standard Milvus implementation using the HNSW index on CPUs. The metric we evaluated is queries per second (QPS), which measures the throughput of query execution.

During the evaluation process, we varied the packet size, which represents the number of queries processed simultaneously, from 1 to 100. This wide range of packet sizes allowed us to perform a realistic and thorough evaluation by evaluating performance under different query load scenarios.

Capacity assessment

The graphs show that:

  • At a batch size of 1 T4 is 6.4-6.7 times faster than the processor, and the A10G is 8.3-9 times faster.
  • When the batch size is increased to 10, the performance improvement becomes more significant: the T4 is 16.8-18.7 times faster and the A100 is 25.8-29.9 times faster.
  • With a batch size of 100, the performance gains continue to increase, with T4 faster by a factor of 21.9 to 23.3 and A100 faster by a factor of 48.9 to 49.2.

The results demonstrate significant performance gains by utilizing GPU acceleration for vector database queries, especially for large batch sizes and high-dimensional data. Milvus with CAGRA unlocks the power of GPU parallel processing, providing significant performance gains and making it well suited for demanding vector database workloads.

blazing new trails

The Milvus 2.4 collaboration between Zilliz and NVIDIA demonstrates the power of open innovation and community-led development, delivering GPU acceleration for vector databases.

The open source Milvus 2.4 is available now, and enterprises looking for a fully managed vector database service can look forward to GPU acceleration coming to Zilliz Cloud later this year. Zilliz Cloud enables seamless deployment and scaling of Milvus on major cloud providers such as AWS, Google Cloud Platform and Azure, with no operational costs.

Let's get in touch!

Please feel free to send us a message through the contact form.

Drop us a line at mailrequest@nosota.com / Give us a call over skypenosota.skype