Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Enhancing HNSW Index for Real-Time Updates: Addressing Unreachable Points and Performance Degradation

1st Wentao Xiao SCSE
UESTC
Chengdu, China
WentaoXiao1234@gmail.com
   2nd Yueyang Zhan SCSE
UESTC
Chengdu, China
yueyang.zhan@std.uestc.edu.cn
   3rd Rui Xi* {@IEEEauthorhalign} 4th Mengshu Hou SCSE
UESTC
Chengdu, China
ruix.ryan@gmail.com
*Corresponding author SCSE
UESTC
Chengdu, China
mshou@uestc.edu.cn
   5th Jianming Liao SCSE
UESTC
Chengdu, China
liaojm@uestc.edu.cn
Abstract

The approximate nearest neighbor search (ANNS) is a fundamental and essential component in data mining and information retrieval, with graph-based methodologies demonstrating superior performance compared to alternative approaches. Extensive research efforts have been dedicated to improving search efficiency by developing various graph-based indices, such as HNSW (Hierarchical Navigable Small World). However, the performance of HNSW and most graph-based indices become unacceptable when faced with a large number of real-time deletions, insertions, and updates. Furthermore, during update operations, HNSW can result in some data points becoming unreachable, a situation we refer to as the ‘unreachable points phenomenon’. This phenomenon could significantly affect the search accuracy of the graph in certain situations.

To address these issues, we present efficient measures to overcome the shortcomings of HNSW, specifically addressing poor performance over long periods of delete and update operations and resolving the issues caused by the unreachable points phenomenon. Our proposed MN-RU algorithm effectively improves update efficiency and suppresses the growth rate of unreachable points, ensuring better overall performance and maintaining the integrity of the graph. Our results demonstrate that our methods outperform existing approaches. Furthermore, since our methods are based on HNSW, they can be easily integrated with existing indices widely used in the industrial field, making them practical for future real-world applications. Code is available at https://github.com/xwt1/MN-RU.git

Index Terms:
Data Mining, Information Retrieval, Incremental Update, Freshness, Graph-Based Index

I Introduction

The approximate nearest neighbor search (ANNS) has emerged as a focal point in diverse application areas encompassing data mining, information retrieval, and recommendation systems ([1, 2]). Notably, deep learning models, such as large language models (LLMs), possess the capability to encode various data types, ranging from textual documents to visual and auditory inputs, into vector representations. Innovative systems like Retrieval-Augmented Generation (RAG) ([3]) leverage ANNS to efficiently retrieve relevant documents or information, integrating them with generative model outputs to augment the precision and relevance of the generated content. The basic ANNS can be defined as follows: Given a data set Dd𝐷superscript𝑑D\subseteq\mathbb{R}^{d}italic_D ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with n𝑛nitalic_n points in some metric space, a query data point qd𝑞superscript𝑑q\in\mathbb{R}^{d}italic_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and k𝑘k\in\mathbb{N}italic_k ∈ blackboard_N, we seek to find the set L𝐿Litalic_L that efficiently represents the k𝑘kitalic_k nearest neighbors of q𝑞qitalic_q from P𝑃Pitalic_P, while maximizing recall@k=|GL||G|recall@𝑘𝐺𝐿𝐺\text{recall@}k=\frac{|G\cap L|}{|G|}recall@ italic_k = divide start_ARG | italic_G ∩ italic_L | end_ARG start_ARG | italic_G | end_ARG, where |G|=|L|=k𝐺𝐿𝑘|G|=|L|=k| italic_G | = | italic_L | = italic_k. Here, G𝐺Gitalic_G is the ground truth set of q𝑞qitalic_q’s k𝑘kitalic_k nearest neighbors in P𝑃Pitalic_P. Effective indexing is crucial in this context, as it significantly enhances the efficiency and accuracy of the search process, enabling the rapid and reliable retrieval of nearest neighbors from large datasets.

Based on the theory and concepts of ANNS, a large number of indexing methods have been designed and can generally be classified into four types: tree structure-based index [4], hashing-based index [5], quantization-based index [6], and graph-based index ([7, 8, 9, 10]). Among these, graph-based methods demonstrate superior empirical search performance compared to others ([11]).

Nevertheless, despite the commendable search performance exhibited by certain early graph-based techniques, a notable limitation lies in their static nature ([7, 8]), rendering them incapable of accommodating the real-time modifications requisite in practical applications. For instance, consider the RAG system, which relies on kNN-based methodologies to transform documents into vector representations. In scenarios where users seek to alter the content within the system, such modifications entail the insertion of new data points and the removal of existing ones. Consequently, the static nature of the indexing mechanism proves inadequate for addressing real-time updates, thereby necessitating alternative approaches to address this deficiency. In the subsequent section II, we will delve into the strategies devised to mitigate the constraints associated with static graph-based indices.

HNSW (Hierarchical Navigable Small World) is a widely used and well-established graph-based index within the industry ([12]), lauded for its exceptional performance in both search precision and computational efficiency. However, according to the findings and experimental analysis in our preliminary study (see details in section III), it has come to light that the current iteration of the HNSW index manifests two notable shortcomings following a substantial sequence of delete, insert, and query operations. The first and primary concern arises from the utilization of the markDelete algorithm and the replaced_update insertion algorithm (hereafter referred to as replaced_update for brevity), leading to a scenario where specific data points may become inaccessible, a phenomenon we identify as the ‘unreachable points phenomenon.’

This phenomenon can pose problems in practical applications. Imagine two scenarios in typical systems: If certain data points representing online stores in an index become inaccessible after a series of modifications, such as insertions and deletions, it would be highly frustrating for the owners of those online stores. This is especially critical for store owners who rely on recommendation systems; if their stores can never be recommended due to the unreachable points phenomenon, despite being the most relevant to client requests, it would significantly impact their business. Similarly, in a RAG system, if a user seeks knowledge on a specific topic, they would include relevant keywords in their query. However, due to the unreachable points phenomenon, if the data points most relevant to those keywords become unsearchable, the accuracy of the retrieved results will be compromised, subsequently affecting the output of the generation model. Therefore, the problem of some points becoming unreachable after insertions and deletions is a real-world concern. The second issue is that we observed the efficiency of native HNSW operations involving mixed deletions and insertions is lower compared to query operations. This inefficiency leads to query delays in scenarios involving mixed deletions, insertions, and queries, indirectly reducing overall query efficiency.

To address the first issue, we proposed practical solutions to mitigate this problem. Additionally, the deletion and update strategies we designed can reduce the number of unreachable points and their growth rate after repeated deletions and insertions. We redesigned the deletion and update strategies for HNSW to address the second issue. Our new method enhances the efficiency of mixed operations compared to the original approach and mitigates the impact of the phenomenon described in the first issue.

Our contribution can be summarized in the following:

  • Revealing the unreachable point phenomenon in the HNSW index:

    • Through analysis and experimentation, we have identified that HNSW can create unreachable points in the graph after deletion and insertion operations, leading to adverse effects, as previously discussed. We proposed suitable practical solutions and mitigated the impact of this phenomenon in our newly designed deletion and update algorithm.

  • Improved Replaced_Update Strategy:

    • Building on existing methods, we proposed an improved replaced_update algorithm called MN-RU. This algorithm enhances the speed of deletion and insertion operations while also alleviating the unreachable points phenomenon to a significant extent.

  • Comprehensive Performance Evaluations:

    • We implemented a comprehensive strategy and integrated it into a single platform for comparative evaluations. The results validate the superiority of our approach over the native HNSW strategy, offering valuable insights for future practical applications.

II Related Work

II-A Approximate Nearest Neighbor Search Algorithms

Approximate nearest neighbor search (ANNS) has been a central focus of scholarly research for the past two decades, which led many studies to develop effective methodologies to address this complex problem.

In space partitioning methods, such as tree structures, data space is initially recursively divided into multiple regions to facilitate the construction of tree or forest-based indices. These methods are commonly categorized into hierarchical structures and reference point-based structures. While effective for low-dimensional datasets, their performance significantly deteriorates with increasing data dimensionality. Beyond 10 dimensions, tree-based space partitioning methods exhibit reduced efficiency, often slower than brute-force linear-scan methods. Strategies like the pyramid technique [13] and iDistance [14] have been proposed to combat the ’curse of dimensionality.’

Initially developed for addressing ANNS in the Hamming space [15], hash-based methods were later extended to the Euclidean space [16]. These methods involve projecting high-dimensional data points into lower-dimensional spaces using hash functions to devise efficient algorithms to identify nearest neighbors. Despite their efficacy, hash-based methods often require numerous hash tables for satisfactory search outcomes, leading to the emergence of variants like VHP [17], LCCS-LSH [18], PMLSH [19], and R2SLH [20].

Product quantization (PQ) is a widely used quantization-based method for accelerating search processing by compressing input vectors into compact codes for memory-efficient dataset processing. It also provides effective techniques for estimating distances between raw vectors and compressed codes, improving distance estimation accuracy. Diverse quantization-based approaches [21, 22] have been devised to mitigate quantization errors and to improve query precision, focusing on refining quantization techniques.

Graph-based methods [23, 24, 25, 26] have emerged as potent solutions for high-dimensional ANNS, constructing efficient indexing structures. However, constructing exact KNN graphs becomes exponentially complex with increasing nodes. Approximated KNN graph construction has been explored as an alternative, offering similar efficacy as exact graphs in specific applications. Despite this, memory constraints in graph construction necessitate cluster algorithms for simplified subgraph creation and more efficient query processing. Integration of machine learning and deep learning into graph-based nearest neighbor search methods has advanced search capabilities.

The landscape of dynamic update methods has recently advanced a lot. Noteworthy advancements include Fresh-diskann ([27]), which introduces a StreamingMerge protocol to handle node deletions from the DeleteList and subsequent insertion of N𝑁Nitalic_N new points, as detailed in the referenced paper. Additionally, SPANN [28] stands out as the pioneering on-disk vector index, leveraging balanced clustering to achieve minimal tail search latency and deliver exceptional performance. Furthermore, SPFresh ([29]) offers heightened throughput and reduced latency for both search and update operations, ensuring the efficient retrieval of new highly reliable vectors.

II-B HNSW

Previous studies have investigated various strategies aimed at maintaining the freshness of graph-based or hybrid indexes. Notably, HNSW 111https://github.com/nmslib/hnswlib.git serves as a graph-based index in memory and has developed a robust framework for executing delete and update operations. In the case of a delete operation concerning a label xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, HNSW employs the markDelete algorithm to flag and transfer it to a designated deleted set. On the other hand, for an insertion operation, the choice can be made to either directly insert a new point into the index if the index size remains within the predefined limit, or to replace an old deleted label xdsubscript𝑥𝑑x_{d}italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT with the new entry. The whole insertion procedure, referred to as replaced_update in this study, can be represented as follows.

  1. 1.

    Check for Deleted Points: Initially, HNSW examines for any points flagged as deleted. If a deleted point is identified, it is designated for replacement.

  2. 2.

    Collect Neighbors: The algorithm gathers the one-hop and two-hop neighbors of the deleted point. For each one-hop neighbor vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the one-hop and two-hop neighbors and the newly inserted point are considered potential candidates new neighbors for vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

  3. 3.

    Prune Candidate Neighbors: Employing a pruning strategy, an optimal neighbor set N(vj)𝑁subscript𝑣𝑗N(v_{j})italic_N ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is chosen for each one-hop neighbor vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT from the pool of candidates.

  4. 4.

    Update Connections: Directed edges are established from each one-hop neighbor vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to the points in N(vj)𝑁subscript𝑣𝑗N(v_{j})italic_N ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), ensuring graph connectivity.

  5. 5.

    Insert the New Point: The new point is integrated into the index, with connections established based on the rectified neighbor sets.

The above-described processes are implemented within the HNSW source code, particularly encompassing the functions addPoint, updatePoint, and repairConnectionsForUpdate. The significance of the replaced_update method lies in its ability to manage the index efficiently. Without it, inserting new data points would leave deleted entries in the index, wasting storage and causing unnecessary expansion. By using replaced_update, deleted points’ storage is reused for new insertions, preventing the index from growing excessively and maintaining efficient space utilization.

Our approach extends from the foundational framework of HNSW, integrating enhancements to enhance the efficiency of dynamic updates and mitigate the issue of unreachable points. By optimizing the deletions and insertions, we aim to reduce latency and promote enhanced graph connectivity.

III Issues in Update Efficiency and Unreachable Points

Refer to caption
Figure 1: Figure 1: Comparison of query efficiency and replaced_update efficiency at a given recall level on three public datasets.

Although integrating the replaced_update method in the HNSW algorithm to efficiently substitute deleted points with new entries, thus facilitating real-time point deletion and space reclamation, its operational efficiency remains subpar. Illustrated in Figure 1, when approaching a recall rate of approximately 90%, the efficiency of replaced_update lags notably behind the search efficiency of the query in three public datasets. Notably, in datasets like GIST and ImageNet, the insertion speed using the replaced_update method is notably slower by a factor of 5 to 10. Through analyzing the process of replaced_update method, it has come to our attention that this method may give rise to the ‘unreachable points phenomenon’ after repeated deletions and insertions. That is, some specific points become inaccessible. Furthermore, this operational inefficiency compounds the issue of the unreachable points phenomenon. Subsequently, we will delve into the underlying reasons contributing to the occurrence of the unreachable points phenomenon and conduct an empirical experiment to validate our hypothesis.

III-A Causes of Unreachable Points

Definition 1

An unreachable point is defined as a point that possesses outgoing edges but lacks incoming edges in all layers of the navigable small-world graph.

In Figure 2, node v𝑣vitalic_v has only one incoming edge from node d𝑑ditalic_d. After marking node d𝑑ditalic_d for deletion and applying the HNSW replaced_update strategy to remove d𝑑ditalic_d, node v𝑣vitalic_v lacks any incoming edges. In other words, unless the node v𝑣vitalic_v serves as the entry point in the HNSW structure, it will remain unvisited in subsequent search operations.

Refer to caption
Figure 2: Figure 2: Example Of Unreachable Points Phenomenon

It is evident that a point with an in-degree of zero cannot be found during the search process. During the construction of the HNSW index, the likelihood of forming unreachable points is exceedingly small. However, if the replace_update algorithm used by HNSW is applied to replace old points with newly inserted points, unreachable points emerge.

In HNSW, the occurrence of some points having an in-degree of zero after the replaced_update operation is attribute to the specific repair connection strategy employed by the algorithm. During a replaced_update in HNSW, the algorithm first gathers the neighbors of the deleted point, referred to as T𝑇Titalic_T. It then combines T𝑇Titalic_T with their own neighbors and the newly inserted point to form a new candidate neighbor set for each member of T𝑇Titalic_T. Finally, new neighbors are selected from the candidate set for each member of T𝑇Titalic_T. The edges from members of T𝑇Titalic_T to their original neighbors might be deleted during these operations. As a result, these original neighbors may end up with no in-edges after the replaced_update operation. Consequently, these points will not be found during future search processes unless they are the entry points of the HNSW.

Refer to caption
(a) Unreachable points number over iterations
Refer to caption
(b) RECALL over iterations
Figure 3: Figure 3: The demonstration of the unreachable points phenomenon on the Sift dataset. In each iteration, 5% of the data points are randomly deleted from the Sift dataset, while ensuring that none of the deleted points were previously unreachable. These points are then reinserted back.

III-B Demonstration on Specific Datasets

Figure 3 illustrates the presence of unreachable points in HNSW following replaced_update operations. The experiment uses the Sift dataset, where 5% of the data is randomly deleted in each iteration, ensuring that these points do not already belong to the set of unreachable points and are subsequently reinserted. The experiment consists of 3000 iterations to track the evolution of unreachable points after each cycle of deletion and insertion.

Figure 3 (a) illustrates that the number of unreachable points in HNSW consistently increases with continuous deletion and reinsertion. After approximately 3000 iterations, the number of unreachable points reaches between 3% and 4% of the original dataset. This number rises with additional iterations, indicating that points already in the unreachable points set will not be searched in future search processes.

Figure 3 (b) shows that recall decreases as the number of iterations increase. With the same configuration, recall declines by about 3% due to the increasing number of unreachable points over iterations. This decline becomes more severe with more iterations, leading to a gradual reduction in the accuracy of the final search. Furthermore, this decrease in accuracy caused by the unreachable points phenomenon cannot be mitigated by adjusting HNSW searching parameters, such as increasing the ef_ parameter.

Refer to caption
Figure 4: Figure 4: The architecture of our work, both in upper-level application and MN-RU

IV Methodology

As shown in Figure 1, the HNSW replaced_update algorithm exhibits poor update performance, leading to the phenomenon of unreachable points, which adversely affects search performance, as illustrated in Figure 3. To address these issues, we propose the back_up_index_construction and dual_search methods to mitigate the growth of unreachable points from the upper-level application. More importantly, we introduce the MN-RU algorithm, designed to improve update efficiency and significantly reduce the incidence of unreachable points within the index. The architecture of our approach is shown in Figure 4, which illustrates the integration of the HNSW index in the upper-level application and the operation of the MN-RU within the HNSW index.

IV-A Dual Indexes Design

In an HNSW index, points becoming unreachable after replaced_update operations can be attributed to the graph’s inherent connectivity and the algorithm’s maintenance processes. Although reconstruction and reconnection can effectively restore the connectivity of the HNSW graph, they also present several drawbacks, such as significant computational resources and time, service interruption, and system complexity. In practical applications involving high-frequency update scenarios, it is imperative to weigh these disadvantages carefully. This article considers strategies to address the point-disconnected problem by maintaining graph connectivity to balance performance and availability.

Diverging from conventional methodologies reliant on a singular HNSW index, we introduce an additional index designed explicitly for managing unreachable points within the HNSW indexing system, referred to as the HNSW Backup Index, as illustrated in Figure 4. The inception of this algorithm is rooted in minimizing the computational burden and service disruptions associated with recurrent reindexing processes. The procedure for creating the Backup Index involves several steps: First, the HNSW K-NN search function is applied to identify the point set F𝐹Fitalic_F that is closest to the query point q𝑞qitalic_q from the current data set P𝑃Pitalic_P. We set K𝐾Kitalic_K to |P|𝑃|P|| italic_P | to retrieve all the points available in the index. Next, by removing F𝐹Fitalic_F from the set P𝑃Pitalic_P, the remaining set of points U𝑈Uitalic_U, which includes the points not found, is identified and extracted. Subsequently, a new HNSW backup index is constructed for these unfound point sets U𝑈Uitalic_U. This process leverages the multi-level proximity search feature of HNSW to efficiently manage unreachable points, ensuring that the structure and query performance of the original index remains unaffected while reducing the computational overhead and service interruptions typically associated with reconstruction.

Meanwhile, we introduce a threshold τ𝜏\tauitalic_τ to regulate replaced_update operations. When the number of replaced_update operations exceeds τ𝜏\tauitalic_τ, it triggers the reconstruction of the HNSW backup index. The upper application layer can adjust the value of τ𝜏\tauitalic_τ according to specific needs. In our implementation, the value of τ𝜏\tauitalic_τ was empirically configured to 40000.

Based on this novel index structure, it is imperative to propose new query and maintenance strategies to maintain the accuracy of the results, optimize query performance, and prevent any potential degradation in system efficiency. Next, we will describe these strategies comprehensively.

IV-B Dual Index Search

In order to ensure the query efficiency and the accuracy of the results, we propose a query algorithm called dualSearch, which queries a primary index (primary HNSW index) and a backup index (HNSW index dedicated to managing unreachable points) simultaneously to ensure that the query can cover all possible data points, even if some points no longer appear in the primary index after the replaced_update operations.

Algorithm 1 outlines the search process for dualSearch. Initially, it performs a K-NN search on the primary index (HNSWmain𝐻𝑁𝑆subscript𝑊𝑚𝑎𝑖𝑛HNSW_{main}italic_H italic_N italic_S italic_W start_POSTSUBSCRIPT italic_m italic_a italic_i italic_n end_POSTSUBSCRIPT) using the query point q𝑞qitalic_q to retrieve k𝑘kitalic_k nearest neighbors in the set of results Rmainsubscript𝑅𝑚𝑎𝑖𝑛R_{main}italic_R start_POSTSUBSCRIPT italic_m italic_a italic_i italic_n end_POSTSUBSCRIPT. Concurrently, a similar K-NN search is executed on the backup index (HNSWbackup𝐻𝑁𝑆subscript𝑊𝑏𝑎𝑐𝑘𝑢𝑝HNSW_{backup}italic_H italic_N italic_S italic_W start_POSTSUBSCRIPT italic_b italic_a italic_c italic_k italic_u italic_p end_POSTSUBSCRIPT), dedicated to managing unreachable points, yielding the results set Rbackupsubscript𝑅𝑏𝑎𝑐𝑘𝑢𝑝R_{backup}italic_R start_POSTSUBSCRIPT italic_b italic_a italic_c italic_k italic_u italic_p end_POSTSUBSCRIPT. Subsequently, the results of both searches are merged into a unified set Rcombinedsubscript𝑅𝑐𝑜𝑚𝑏𝑖𝑛𝑒𝑑R_{combined}italic_R start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b italic_i italic_n italic_e italic_d end_POSTSUBSCRIPT, which is then sorted according to the distance from point to query to prioritize closer points. Ultimately, the algorithm delivers the top k𝑘kitalic_k points from this ordered set, ensuring a precise and effective query result that takes into account both the accessible and the unreachable points.

Algorithm 1 dualSearch(q, k, j)
Input: Query point: q𝑞qitalic_q, number of neighbors: k𝑘kitalic_k, size of dynamic candidate list: ef𝑒𝑓efitalic_e italic_f
Output: Top k𝑘kitalic_k points from combined search results
Variables: Rmainsubscript𝑅𝑚𝑎𝑖𝑛R_{main}italic_R start_POSTSUBSCRIPT italic_m italic_a italic_i italic_n end_POSTSUBSCRIPT: Result set from main index, Rbackupsubscript𝑅𝑏𝑎𝑐𝑘𝑢𝑝R_{backup}italic_R start_POSTSUBSCRIPT italic_b italic_a italic_c italic_k italic_u italic_p end_POSTSUBSCRIPT: Result set from backup index, Rcombinedsubscript𝑅𝑐𝑜𝑚𝑏𝑖𝑛𝑒𝑑R_{combined}italic_R start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b italic_i italic_n italic_e italic_d end_POSTSUBSCRIPT: Combined result set, Rsortedsubscript𝑅𝑠𝑜𝑟𝑡𝑒𝑑R_{sorted}italic_R start_POSTSUBSCRIPT italic_s italic_o italic_r italic_t italic_e italic_d end_POSTSUBSCRIPT: Sorted result set
Rmainsubscript𝑅𝑚𝑎𝑖𝑛absentR_{main}\leftarrowitalic_R start_POSTSUBSCRIPT italic_m italic_a italic_i italic_n end_POSTSUBSCRIPT ← K-NN-SEARCH(HNSWmain,q,k,ef)𝐻𝑁𝑆subscript𝑊𝑚𝑎𝑖𝑛𝑞𝑘𝑒𝑓(HNSW_{main},q,k,ef)( italic_H italic_N italic_S italic_W start_POSTSUBSCRIPT italic_m italic_a italic_i italic_n end_POSTSUBSCRIPT , italic_q , italic_k , italic_e italic_f ) // K-NN-SEARCH is Algorithm 5 from HNSW
Rbackupsubscript𝑅𝑏𝑎𝑐𝑘𝑢𝑝absentR_{backup}\leftarrowitalic_R start_POSTSUBSCRIPT italic_b italic_a italic_c italic_k italic_u italic_p end_POSTSUBSCRIPT ← K-NN-SEARCH(HNSWbackup,q,k,ef)𝐻𝑁𝑆subscript𝑊𝑏𝑎𝑐𝑘𝑢𝑝𝑞𝑘𝑒𝑓(HNSW_{backup},q,k,ef)( italic_H italic_N italic_S italic_W start_POSTSUBSCRIPT italic_b italic_a italic_c italic_k italic_u italic_p end_POSTSUBSCRIPT , italic_q , italic_k , italic_e italic_f ) // K-NN-SEARCH is Algorithm 5 from HNSW
RcombinedRmainRbackupsubscript𝑅𝑐𝑜𝑚𝑏𝑖𝑛𝑒𝑑subscript𝑅𝑚𝑎𝑖𝑛subscript𝑅𝑏𝑎𝑐𝑘𝑢𝑝R_{combined}\leftarrow R_{main}\cup R_{backup}italic_R start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b italic_i italic_n italic_e italic_d end_POSTSUBSCRIPT ← italic_R start_POSTSUBSCRIPT italic_m italic_a italic_i italic_n end_POSTSUBSCRIPT ∪ italic_R start_POSTSUBSCRIPT italic_b italic_a italic_c italic_k italic_u italic_p end_POSTSUBSCRIPT // Combine results
Rsortedsubscript𝑅𝑠𝑜𝑟𝑡𝑒𝑑absentR_{sorted}\leftarrowitalic_R start_POSTSUBSCRIPT italic_s italic_o italic_r italic_t italic_e italic_d end_POSTSUBSCRIPT ← sort Rcombinedsubscript𝑅𝑐𝑜𝑚𝑏𝑖𝑛𝑒𝑑R_{combined}italic_R start_POSTSUBSCRIPT italic_c italic_o italic_m italic_b italic_i italic_n italic_e italic_d end_POSTSUBSCRIPT by distance to q𝑞qitalic_q
return  Top k𝑘kitalic_k points from Rsortedsubscript𝑅𝑠𝑜𝑟𝑡𝑒𝑑R_{sorted}italic_R start_POSTSUBSCRIPT italic_s italic_o italic_r italic_t italic_e italic_d end_POSTSUBSCRIPT

This approach improves search accuracy and maintains robust query performance, even with unreachable points. By using a dedicated backup index, the system avoids the pitfalls of traditional single-index structures.

IV-C Index Maintenance

When confronted with extensive datasets that undergo frequent updates through insertions and deletions, ensuring the precision of query outcomes requires an efficient strategy to maintain the index that encompasses update, delete, and other operations. This challenging task is pivotal for maintaining the integrity and reliability of the query results.

As mentioned earlier, the original HNSW replaced_update algorithm needs to consider the neighbors of the deleted point(namely one-hop neighbors) and the neighbors of these neighbors (namely two-hop neighbors) as potential new neighbors to one-hop neighbors during the update process. For each one-hop neighbors, the algorithm has to reselect its neighbors from a candidate set of size M2superscript𝑀2M^{2}italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, including one-hop neighbors and the two-hop neighbors. Here, M𝑀Mitalic_M is a parameter in HNSW that defines the maximum out-degree of a data point within a specific layer. The time complexity of this operation is O(M3)𝑂superscript𝑀3O(M^{3})italic_O ( italic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) per layer, which is significantly time-consuming.

Refer to caption
Figure 5: Figure 5: Example of repairing neighbors of points u and v after deleting point d.

For selecting new neighbors for one-hop neighbors, considering all two-hop neighbors as candidates is inefficient. The reason is as follows: Before update, based on HNSW’s edge selection strategy, if a one-hop neighbor does not have an edge to a particular two-hop neighbor, it means HNSW deemed this edge suboptimal. Since the size of two-hop neighbors is M2superscript𝑀2M^{2}italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and each one-hop neighbor originally has M𝑀Mitalic_M neighbors, many two-hop neighbors are not part of its original neighbor set. Therefore, establishing edges from one-hop neighbor to such two-hop neighbors was deemed suboptimal by HNSW before update and is likely to remain so after update, as the index structure does not undergo significant changes.

To address this complexity, we propose a novel algorithm(Algorithm 2) which focuses on restoring the connectivity of one-hop neighbors of the deleted point. Contrary to original method, Algorithm 2 exclusively updates neighbors of one-hop neighbors that include the deleted point in their original neighbor list. For every such point, the candidate new neighbors comprise its original neighbors along with the neighbors of the deleted point. Due to the previously described reasons, two-hop neighbors are not necessarily included in the new neighbor candidate set, ensuring a more efficient and focused neighbor update process. This operation exhibits a time complexity of O(M2)𝑂superscript𝑀2O(M^{2})italic_O ( italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) per layer since for each one-hop neighbors, the algorithm has to reselect its neighbors from a candidate set of size M𝑀Mitalic_M. Therefore, our algorithm demonstrates superior efficiency compared to the original approach. Figure 5 serves as an illustrative example of Algorithm 2. The complete process of Index Maintenance is detailed below:

Deletion. For deletions, as illustrated in Figure 4, when a point in the index is deleted, our method appends it to a list, deletedList, similar to the process in the original HNSW. The purpose of utilizing the deletedList is to manage deleted points without immediate removal efficiently, thus preserving the graph’s structural integrity during updates.

Algorithm 2 MNRepairNeighborConnection(data𝑑𝑎𝑡𝑎dataitalic_d italic_a italic_t italic_a, label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l, α𝛼\alphaitalic_α, hnsw𝑛𝑠𝑤hnswitalic_h italic_n italic_s italic_w)
1:Input: New point data: data𝑑𝑎𝑡𝑎dataitalic_d italic_a italic_t italic_a, label of the new point: label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l, multilayer graph: hnsw𝑛𝑠𝑤hnswitalic_h italic_n italic_s italic_w, parameter for pruning function: α𝛼\alphaitalic_α
2:Variables:
3:     deletedPoint𝑑𝑒𝑙𝑒𝑡𝑒𝑑𝑃𝑜𝑖𝑛𝑡deletedPointitalic_d italic_e italic_l italic_e italic_t italic_e italic_d italic_P italic_o italic_i italic_n italic_t: Point marked as deleted
4:     Lmaxsubscript𝐿𝑚𝑎𝑥L_{max}italic_L start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT: Maximum layer of deletedPoint𝑑𝑒𝑙𝑒𝑡𝑒𝑑𝑃𝑜𝑖𝑛𝑡deletedPointitalic_d italic_e italic_l italic_e italic_t italic_e italic_d italic_P italic_o italic_i italic_n italic_t in hnsw𝑛𝑠𝑤hnswitalic_h italic_n italic_s italic_w
5:     N1subscript𝑁1N_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: Neighbor set 1
6:     N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: Neighbor set 2
7:     C𝐶Citalic_C: Combined candidate set
8:deletedPoint𝑑𝑒𝑙𝑒𝑡𝑒𝑑𝑃𝑜𝑖𝑛𝑡absentdeletedPoint\leftarrowitalic_d italic_e italic_l italic_e italic_t italic_e italic_d italic_P italic_o italic_i italic_n italic_t ← getDeletedPoint(hnsw𝑛𝑠𝑤hnswitalic_h italic_n italic_s italic_w) // Get a point marked as deleted
9:if deletedPoint𝑑𝑒𝑙𝑒𝑡𝑒𝑑𝑃𝑜𝑖𝑛𝑡deletedPointitalic_d italic_e italic_l italic_e italic_t italic_e italic_d italic_P italic_o italic_i italic_n italic_t is NULL then
10:    insert(data𝑑𝑎𝑡𝑎dataitalic_d italic_a italic_t italic_a, label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l, hnsw𝑛𝑠𝑤hnswitalic_h italic_n italic_s italic_w) // Perform normal insertion
11:    return  
12:end if
13:Lmaxsubscript𝐿𝑚𝑎𝑥absentL_{max}\leftarrowitalic_L start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ← getMaxLayer(deletedPoint𝑑𝑒𝑙𝑒𝑡𝑒𝑑𝑃𝑜𝑖𝑛𝑡deletedPointitalic_d italic_e italic_l italic_e italic_t italic_e italic_d italic_P italic_o italic_i italic_n italic_t, hnsw𝑛𝑠𝑤hnswitalic_h italic_n italic_s italic_w) // Get the maximum layer of the deleted point in hnsw𝑛𝑠𝑤hnswitalic_h italic_n italic_s italic_w
14:for layer0𝑙𝑎𝑦𝑒𝑟0layer\leftarrow 0italic_l italic_a italic_y italic_e italic_r ← 0 to Lmaxsubscript𝐿𝑚𝑎𝑥L_{max}italic_L start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT do
15:    N1subscript𝑁1absentN_{1}\leftarrowitalic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← getNeighborhood(deletedPoint𝑑𝑒𝑙𝑒𝑡𝑒𝑑𝑃𝑜𝑖𝑛𝑡deletedPointitalic_d italic_e italic_l italic_e italic_t italic_e italic_d italic_P italic_o italic_i italic_n italic_t, layer𝑙𝑎𝑦𝑒𝑟layeritalic_l italic_a italic_y italic_e italic_r) // Get the neighbors of deletedPoint
16:    N2subscript𝑁2N_{2}\leftarrow\emptysetitalic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← ∅ // Initialize empty set for N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
17:    for each point vN1𝑣subscript𝑁1v\in N_{1}italic_v ∈ italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT do
18:       if edge(v𝑣vitalic_v, deletedPoint𝑑𝑒𝑙𝑒𝑡𝑒𝑑𝑃𝑜𝑖𝑛𝑡deletedPointitalic_d italic_e italic_l italic_e italic_t italic_e italic_d italic_P italic_o italic_i italic_n italic_t) exists in layer𝑙𝑎𝑦𝑒𝑟layeritalic_l italic_a italic_y italic_e italic_r then
19:          N2N2{v}subscript𝑁2subscript𝑁2𝑣N_{2}\leftarrow N_{2}\cup\{v\}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∪ { italic_v } // Add v𝑣vitalic_v to N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
20:       end if
21:    end for
22:    for each point uN2𝑢subscript𝑁2u\in N_{2}italic_u ∈ italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT do
23:       C𝐶absentC\leftarrowitalic_C ← getNeighborhood(u𝑢uitalic_u, layer𝑙𝑎𝑦𝑒𝑟layeritalic_l italic_a italic_y italic_e italic_r) // Get neighbors of u𝑢uitalic_u
24:       CCN1label𝐶𝐶subscript𝑁1𝑙𝑎𝑏𝑒𝑙C\leftarrow C\cup N_{1}\cup labelitalic_C ← italic_C ∪ italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ italic_l italic_a italic_b italic_e italic_l // Combine candidate sets, set new insert point to candidates
25:       C𝐶absentC\leftarrowitalic_C ← pruneCandidates(C𝐶Citalic_C, layer𝑙𝑎𝑦𝑒𝑟layeritalic_l italic_a italic_y italic_e italic_r, α𝛼\alphaitalic_α) // Prune the candidates using αRNG𝛼𝑅𝑁𝐺\alpha-RNGitalic_α - italic_R italic_N italic_G
26:       setNeighbors(u𝑢uitalic_u, C𝐶Citalic_C, layer𝑙𝑎𝑦𝑒𝑟layeritalic_l italic_a italic_y italic_e italic_r) // Set u𝑢uitalic_u’s neighbors to C𝐶Citalic_C
27:    end for
28:end for
29: update(deletedPoint𝑑𝑒𝑙𝑒𝑡𝑒𝑑𝑃𝑜𝑖𝑛𝑡deletedPointitalic_d italic_e italic_l italic_e italic_t italic_e italic_d italic_P italic_o italic_i italic_n italic_t, data𝑑𝑎𝑡𝑎dataitalic_d italic_a italic_t italic_a, label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l) // Update the data of the deleted point

Insertion. Algorithms 2 and 3, collectively called Mutual Neighbor Replaced Update (MN-RU), update the HNSW index when a new point replaces a deleted one. First, Algorithm 2 repairs the graph’s connectivity, then Algorithm 3 inserts the new point to the index. Unlike the conventional approach of adding new points, the Algorithm 3 enables the new point to inherit the layer level of the deleted point. Subsequent to this inheritance, the new point undergoes insertion using the standard HNSW insert process. Algorithm 3 starts by identifying the top layer of the HNSW index and the maximum layer of the deleted point, establishing an initial entry point for the search. From the top layer to the layer above the deleted point’s maximum layer, it iteratively searches for the nearest point as the entry point to the new data point, updating the entry point each time. Then, from the maximum layer of the deleted point to the lowest layer, it searches for neighbors, selects neighbors for the new point, and establishes connections. MN-RU ensures the seamless integration of the new element into the HNSW structure.

V Experiments

In this section, we evaluate our algorithm and highlight key experimental observations. The implementations of our algorithm and other baseline methods were executed in C++ on a system equipped with an Intel Xeon CPU E5-2678 v3 @ 2.50GHz and 110GB memory, operating on Ubuntu 22.04. All computational tasks, like data insertion, deletion, and retrieval, utilized 40 threads for enhanced efficiency in this section.

Algorithm 3 update(deletedPoint, data, label)
1:Input: The data which is marked deleted deletedPoint, Data of the new element data𝑑𝑎𝑡𝑎dataitalic_d italic_a italic_t italic_a, Label of the new element label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l
2:Variables:
3:     Lmaxsubscript𝐿𝑚𝑎𝑥L_{max}italic_L start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT: Maximum layer of the deleted point
4:     L𝐿Litalic_L: Maximum layer of the current HNSW
5:     ep𝑒𝑝epitalic_e italic_p: Entry point for search
6:     W𝑊Witalic_W: List of currently found nearest elements
7:     neighbors𝑛𝑒𝑖𝑔𝑏𝑜𝑟𝑠neighborsitalic_n italic_e italic_i italic_g italic_h italic_b italic_o italic_r italic_s: Neighbor points of current point
8:Lmaxsubscript𝐿𝑚𝑎𝑥absentL_{max}\leftarrowitalic_L start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ← getMaxLayer(deletedPoint𝑑𝑒𝑙𝑒𝑡𝑒𝑑𝑃𝑜𝑖𝑛𝑡deletedPointitalic_d italic_e italic_l italic_e italic_t italic_e italic_d italic_P italic_o italic_i italic_n italic_t) // Get maximum layer of the deleted point
9:L𝐿absentL\leftarrowitalic_L ← getMaxLayer(hnsw𝑛𝑠𝑤hnswitalic_h italic_n italic_s italic_w) // Get maximum layer of HNSW
10:ep𝑒𝑝absentep\leftarrowitalic_e italic_p ← getEnterPoint(L𝐿Litalic_L) // Get entry point for the highest layer of HNSW
11:for lcL𝑙𝑐𝐿lc\leftarrow Litalic_l italic_c ← italic_L to Lmax+1subscript𝐿𝑚𝑎𝑥1L_{max}+1italic_L start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT + 1 do
12:    W𝑊absentW\leftarrowitalic_W ← SEARCH-LAYER(data𝑑𝑎𝑡𝑎dataitalic_d italic_a italic_t italic_a, ep𝑒𝑝epitalic_e italic_p, 1, lc𝑙𝑐lcitalic_l italic_c) // Search layer to find nearest elements
13:    ep𝑒𝑝absentep\leftarrowitalic_e italic_p ← getNearestElement(W𝑊Witalic_W, data𝑑𝑎𝑡𝑎dataitalic_d italic_a italic_t italic_a) // Update entry point for next layer
14:end for
15:for layerLmax𝑙𝑎𝑦𝑒𝑟subscript𝐿𝑚𝑎𝑥layer\leftarrow L_{max}italic_l italic_a italic_y italic_e italic_r ← italic_L start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT to 00 do
16:    W𝑊absentW\leftarrowitalic_W ← SEARCH-LAYER(data𝑑𝑎𝑡𝑎dataitalic_d italic_a italic_t italic_a, ep𝑒𝑝epitalic_e italic_p, efConstruction𝑒𝑓𝐶𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛efConstructionitalic_e italic_f italic_C italic_o italic_n italic_s italic_t italic_r italic_u italic_c italic_t italic_i italic_o italic_n, layer𝑙𝑎𝑦𝑒𝑟layeritalic_l italic_a italic_y italic_e italic_r) // Search layer to find nearest elements
17:    neighbors𝑛𝑒𝑖𝑔𝑏𝑜𝑟𝑠absentneighbors\leftarrowitalic_n italic_e italic_i italic_g italic_h italic_b italic_o italic_r italic_s ← SELECT-NEIGHBORS(data𝑑𝑎𝑡𝑎dataitalic_d italic_a italic_t italic_a, W𝑊Witalic_W, M𝑀Mitalic_M, layer𝑙𝑎𝑦𝑒𝑟layeritalic_l italic_a italic_y italic_e italic_r) // Select neighbors for the current layer
18:    addBidirectionalConnections(neighbors𝑛𝑒𝑖𝑔𝑏𝑜𝑟𝑠neighborsitalic_n italic_e italic_i italic_g italic_h italic_b italic_o italic_r italic_s, data𝑑𝑎𝑡𝑎dataitalic_d italic_a italic_t italic_a, layer𝑙𝑎𝑦𝑒𝑟layeritalic_l italic_a italic_y italic_e italic_r) // Add bidirectional connections using the same strategy as HNSW
19:    ep𝑒𝑝absentep\leftarrowitalic_e italic_p ← W𝑊Witalic_W // Update entry point for next layer
20:end for

V-A Datasets

We used four datasets with different sizes and dimensions, as shown in Table I. These datasets are widely used for the evaluation of various ANNS methods, and we accessed these datasets through a public repository[30] 222https://www.cse.cuhk.edu.hk/systems/hash/gqr/datasets.html. The Sift2M dataset is a subset comprising the first two million vectors extracted from the Sift1B dataset [31], which encompasses 1 billion SIFT descriptors with a dimensionality of 128.

TABLE I: Table 1: Data statistics
Dataset Sift Gist ImageNet Sift2M
Base Size 1,000,000 1,000,000 2,340,373 2,000,000
Dim 128 960 150 128

V-B Baselines

Given that most memory-based ANNS methods [7, 8, 32] do not support real-time update operations, our primary comparison centers on contrasting our methods with HNSW’s replaced_update algorithm. Our methods, based mainly on Algorithm 2, focus on reselecting neighbors only for points mutually connected with the deleted points, except for MN-THN-RU, which will be discussed later. These methods are collectively termed Mutual Neighbor Replaced Update (MN-RU) with distinctive suffixes. Hereafter, we introduce both the baseline methods and our strategy:

  • HNSW replaced_update algorithm(HNSW-RU): This method, derived from the original HNSW algorithm [12], functions by replacing deleted nodes within the index to uphold the structure and efficacy of the HNSW graph. The implementation of the replaced_update algorithm can be accessed at 333https://github.com/nmslib/hnswlib.git. The HNSW index is constructed on four datasets. Specifically, for the SIFT and SIFT_2M datasets [31], the parameters are set at M=16𝑀16M=16italic_M = 16 and ef_construction=200𝑒𝑓_𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛200ef\_construction=200italic_e italic_f _ italic_c italic_o italic_n italic_s italic_t italic_r italic_u italic_c italic_t italic_i italic_o italic_n = 200. For the Gist dataset [31], the parameters are M=32𝑀32M=32italic_M = 32 and ef_construction=600𝑒𝑓_𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛600ef\_construction=600italic_e italic_f _ italic_c italic_o italic_n italic_s italic_t italic_r italic_u italic_c italic_t italic_i italic_o italic_n = 600. In the case of the ImageNet dataset, the parameters are M=64𝑀64M=64italic_M = 64 and ef_construction=800𝑒𝑓_𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛800ef\_construction=800italic_e italic_f _ italic_c italic_o italic_n italic_s italic_t italic_r italic_u italic_c italic_t italic_i italic_o italic_n = 800. The remaining methods maintain consistent parameter settings.

  • Mutual Neighbor replaced_update α𝛼\alphaitalic_α(MN-RU α𝛼\alphaitalic_α): This approach involves selecting the point set P𝑃Pitalic_P with mutual connections to the deleted points for neighbor reselection. The candidate neighbor set C𝐶Citalic_C for P𝑃Pitalic_P encompasses the neighbors of the deleted points and their neighbors. To enhance efficiency, the method incorporates the α𝛼\alphaitalic_α-RNG pruning strategy [27], with α=1𝛼1\alpha=1italic_α = 1.

  • Mutual Neighbor replaced_update β𝛽\betaitalic_β(MN-RU β𝛽\betaitalic_β): Similar to the previous method, MN-RU β𝛽\betaitalic_β selects the point set P𝑃Pitalic_P with mutual connections to the deleted points for neighbor reselection. For each point v𝑣vitalic_v in P𝑃Pitalic_P, it includes the neighbors of the deleted points and v𝑣vitalic_v’s original neighbors as the candidate new neighbor set C𝐶Citalic_C for v𝑣vitalic_v. The α𝛼\alphaitalic_α-RNG pruning strategy [27] is applied with α=1𝛼1\alpha=1italic_α = 1.

  • Mutual Neighbor replaced_update γ𝛾\gammaitalic_γ (MN-RU γ𝛾\gammaitalic_γ): MN-RU γ𝛾\gammaitalic_γ is akin to MN-RU β𝛽\betaitalic_β but with the α𝛼\alphaitalic_α parameter of the α𝛼\alphaitalic_α-RNG pruning strategy set to 1.1 (α=1.1𝛼1.1\alpha=1.1italic_α = 1.1). This method directly implements Algorithm 2.

  • Mutual Neighbor And Two Hop Neighbor replaced_update (MN-THN-RU): A variant of MN-RU γ𝛾\gammaitalic_γ, MN-THN-RU reselect the neighbors for points that mutually connected with deleted points and neighbors of neighbors of deleted points that have a connection to the deleted point.

To control query recall in our experiment, we use ef_𝑒𝑓_ef\_italic_e italic_f _ to represent the priority queue size in the HNSW search process, balancing query accuracy and efficiency. A larger ef_𝑒𝑓_ef\_italic_e italic_f _ value results in higher recall but increased running time. Additionally, K𝐾Kitalic_K denotes the number of returned points.

V-C Experimental Scenarios And Metrics

We compared our methods against the HNSW replaced_update algorithm in the following three scenarios.

  • Full_Coverage: In this scenario, encompassing the Gist, Sift, and ImageNet datasets, we execute 100 iterations where each dataset is segmented into 100 parts. Every iteration involves the deletion and reinsertion of a portion, enabling the assessment of the impact of complete coverage on the index structure and performance.

  • Random: For the Gist, Sift, and ImageNet datasets, we conduct 200 iterations. Within each iteration, 10,000 labels are randomly generated for deletion and reinsertion, facilitating the evaluation of method performance and robustness in the face of random data manipulations.

  • New_Data: Focusing on the Sift2M dataset, we initialize the index with the first million data points and conduct 10 iterations. Each iteration deletes 100,000 points from the first million and inserts 100,000 points from the second million. This process aims to assess the index’s performance and adaptability with continuous data introduction. The final index consists of the second million data points.

Refer to caption
(a) Gist
Refer to caption
(b) ImageNet
Refer to caption
(c) Sift
Figure 6: Figure 6: Update time of different methods across various datasets in full_coverage scenarios.
Refer to caption
(a) Gist
Refer to caption
(b) ImageNet
Refer to caption
(c) Sift
Figure 7: Figure 7: Update time of different methods across various datasets in random scenarios.

This comparison relies on two key metrics: update time efficiency and growth of unreachable points. A method with superior efficiency exhibits reduced update times, while a method demonstrating excellence showcases minimal growth in unreachable points per iteration. We evaluated our methods against the HNSW replaced_update algorithm in three scenarios, focusing on update time and the number of unreachable points.

V-D Update Time Efficiency

This section compares our methods with native HNSW-RU methods regarding update time efficiency. The experiments were conducted in three scenarios described in the previous section: full_coverage, random, and new_data. The results are shown in Figures 6, 7, 13 (a).

In the figures, the lower position of a curve corresponds to higher efficiency in update operations. It is evident that the MN-RU α𝛼\alphaitalic_α, MN-RU β𝛽\betaitalic_β, MN-RU γ𝛾\gammaitalic_γ, and MN-THN-RU methods consistently achieve lower update times compared to the HNSW-RU method and are 2-4 times faster in all scenarios. This indicates that our methods are more efficient in terms of update performance. The superior performance of our methods compared to HNSW-RU is due to their lower time complexity, as detailed in Section IV-C.

V-E Unreachable Points Growth

In this section, we compare our methods with native HNSW-RU methods in terms of the growth of unreachable points. The experiments were conducted in three scenarios described in the previous section: full_coverage, random, and new_data. The results are shown in Figures 8, 9, and 13 (b).

In the Figures, the method with fewer unreachable points after the same number of iterations is considered superior. We observe that over a prolonged period of update operations, the number of unreachable points increases for each method. For HNSW-RU, after 200 iterations described in previous sections, the number of unreachable points grows to approximately 3% to 4% in the Gist dataset and 2% to 3% in the ImageNet dataset. The MN-RU γ𝛾\gammaitalic_γ and MN-THN-RU methods maintain fewer unreachable points than other methods, leading us to conclude that they are more effective.

Refer to caption
(a) Gist
Refer to caption
(b) ImageNet
Refer to caption
(c) Sift
Figure 8: Figure 8: Growth of unreachable points across various datasets in full_coverage scenarios.
Refer to caption
(a) Gist
Refer to caption
(b) ImageNet
Refer to caption
(c) Sift
Figure 9: Figure 9: Growth of unreachable points across various datasets in random scenarios.

To understand the impact of unreachable points on search performance after updates, we conducted experiments on the Gist and ImageNet datasets. Figures 8 and 9 show that, after several updates, unreachable points can occupy a significant portion of these datasets, making the impact more observable.

Therefore, we investigated the search performance of indices built on Gist and ImageNet datasets under full_coverage and random scenarios after updates to assess how unreachable points affect recall. The original Gist query set has only 1,000 queries and the ImageNet query set has 200, making it difficult to cover the entire dataset, especially for small values of K𝐾Kitalic_K (K100𝐾100K\leq 100italic_K ≤ 100). This limits our ability to observe the impact of unreachable point growth on search accuracy.

To address this limitation, we adopted a more straightforward approach to test the impact of unreachable points on final recall: we used the entire original dataset as a query set. We performed nearest neighbor searches with K=1𝐾1K=1italic_K = 1. Changing the parameter ef_𝑒𝑓_ef\_italic_e italic_f _ to balance the recall and search time, we obtained the results shown in Figures 11 and 11. The results indicate that the MN-RU γ𝛾\gammaitalic_γ and MN-THN-RU methods outperform HNSW-RU by achieving better accuracy within the same time frame. This demonstrates the efficacy of our methods in maintaining high precision and effectively reducing the number of unreachable points, leading to improved search accuracy, especially after many update operations.

Refer to caption
(a)
Refer to caption
(b)
Figure 10: Figure 10: Search performance following update operations in full_coverage scenario using the Gist and ImageNet dataset.
Refer to caption
(c)
Refer to caption
(d)
Figure 11: Figure 11: Search performance following update operations in random scenario using the Gist and ImageNet dataset.
Refer to caption
Figure 12: Figure 12: Growth of unreachable points between MN-RU γ𝛾\gammaitalic_γ with back up index and the HNSW-RU in full_coverage scenario using Gist dataset
Refer to caption
(a)
Refer to caption
(b)
Figure 13: Figure 13: Performance of update time and growth of unreachable points using the Sift_2M dataset in new_data scenario.

To further validate the effectiveness of our backup index construction methods and the algorithm 1 illustrated in Figure 4, we experimented on the GIST dataset under the full_coverage scenario. This experiment compared the growth of unreachable points between the MN-RU γ𝛾\gammaitalic_γ with the backup index method and the original HNSW-RU. We set the parameter τ𝜏\tauitalic_τ to 40,000, which means that the backup index was rebuilt after every four iterations. As shown in Figure 13, the MN-RU γ𝛾\gammaitalic_γ with the backup index method significantly reduced the number of unreachable points compared to the original HNSW-RU. Therefore, our method with the backup index outperforms the original HNSW-RU baseline.

VI Conclusion

This paper addresses HNSW replaced_update limitations in real-time deletions and insertions, leading to the ‘unreachable points phenomenon’ and reduced efficiency. Our proposed MN-THN-RU and MN-RU γ𝛾\gammaitalic_γ algorithms with the backup index method effectively mitigate these issues by enhancing the efficiency of mixed operations and maintaining better graph connectivity. Extensive experimental validation demonstrates that our methods outperform the native HNSW strategy, significantly reducing the number of unreachable points and improving the update speed across various datasets. These improvements make our approach highly practical for real-world applications requiring dynamic real-time updates and high search accuracy.

References

  • [1] M. Wang, X. Xu, Q. Yue, and Y. Wang, “A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search,” Proc. VLDB Endow., vol. 14, no. 11, p. 1964–1978, jul 2021. [Online]. Available: https://doi.org/10.14778/3476249.3476255
  • [2] R. Chen, B. Liu, H. Zhu, Y. Wang, Q. Li, B. Ma, Q. Hua, J. Jiang, Y. Xu, H. Deng, and B. Zheng, “Approximate nearest neighbor search under neural similarity metric for large-scale recommendation,” in Proceedings of the 31st ACM International Conference on Information & Knowledge Management, ser. CIKM ’22.   New York, NY, USA: Association for Computing Machinery, 2022, p. 3013–3022. [Online]. Available: https://doi.org/10.1145/3511808.3557098
  • [3] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20.   Red Hook, NY, USA: Curran Associates Inc., 2020.
  • [4] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for high dimensional data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, pp. 2227–2240, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:206765442
  • [5] Q.-Y. Jiang and W.-J. Li, “Scalable graph hashing with feature transformation.” in IJCAI, vol. 15, 2015, pp. 2248–2254.
  • [6] J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2019.
  • [7] C. Fu, C. Xiang, C. Wang, and D. Cai, “Fast approximate nearest neighbor search with the navigating spreading-out graph,” Proc. VLDB Endow., vol. 12, no. 5, p. 461–474, jan 2019. [Online]. Available: https://doi.org/10.14778/3303753.3303754
  • [8] K. Lu, M. Kudo, C. Xiao, and Y. Ishikawa, “Hvs: hierarchical graph structure based on voronoi diagrams for solving approximate nearest neighbor search,” Proc. VLDB Endow., vol. 15, no. 2, p. 246–258, oct 2021. [Online]. Available: https://doi.org/10.14778/3489496.3489506
  • [9] S. Gollapudi, N. Karia, V. Sivashankar, R. Krishnaswamy, N. Begwani, S. Raz, Y. Lin, Y. Zhang, N. Mahapatro, P. Srinivasan, A. Singh, and H. V. Simhadri, “Filtered-diskann: Graph algorithms for approximate nearest neighbor search with filters,” in Proceedings of the ACM Web Conference 2023, ser. WWW ’23.   New York, NY, USA: Association for Computing Machinery, 2023, p. 3406–3416. [Online]. Available: https://doi.org/10.1145/3543507.3583552
  • [10] M. Wang, W. Xu, X. Yi, S. Wu, Z. Peng, X. Ke, Y. Gao, X. Xu, R. Guo, and C. Xie, “Starling: An i/o-efficient disk-resident graph index framework for high-dimensional vector similarity search on data segment,” Proc. ACM Manag. Data, vol. 2, no. 1, mar 2024. [Online]. Available: https://doi.org/10.1145/3639269
  • [11] M. Aumüller, E. Bernhardsson, and A. Faithfull, “Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms,” Information Systems, vol. 87, p. 101374, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0306437918303685
  • [12] Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824–836, 2020.
  • [13] L. Chen, Y. Gao, X. Li, C. S. Jensen, and G. Chen, “Efficient metric indexing for similarity search,” in 2015 IEEE 31st International Conference on Data Engineering, 2015, pp. 591–602.
  • [14] H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang, “idistance: An adaptive b+-tree based indexing method for nearest neighbor search,” ACM Transactions on Database Systems (TODS), vol. 30, no. 2, pp. 364–397, 2005.
  • [15] A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in high dimensions via hashing,” in Vldb, vol. 99, no. 6, 1999, pp. 518–529.
  • [16] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in Proceedings of the twentieth annual symposium on Computational geometry, 2004, pp. 253–262.
  • [17] K. Lu, H. Wang, W. Wang, and M. Kudo, “Vhp: approximate nearest neighbor search via virtual hypersphere partitioning,” Proceedings of the VLDB Endowment, vol. 13, no. 9, pp. 1443–1455, 2020.
  • [18] Y. Lei, Q. Huang, M. Kankanhalli, and A. K. Tung, “Locality-sensitive hashing scheme based on longest circular co-substring,” in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 2020, pp. 2589–2599.
  • [19] B. Zheng, Z. Xi, L. Weng, N. Q. V. Hung, H. Liu, and C. S. Jensen, “Pm-lsh: A fast and accurate lsh framework for high-dimensional approximate nn search,” Proceedings of the VLDB Endowment, vol. 13, no. 5, pp. 643–655, 2020.
  • [20] K. Lu and M. Kudo, “R2lsh: A nearest neighbor search scheme based on two-dimensional projected spaces,” in 2020 IEEE 36th International Conference on Data Engineering (ICDE).   IEEE, 2020, pp. 1045–1056.
  • [21] A. Babenko and V. Lempitsky, “Additive quantization for extreme vector compression,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 931–938.
  • [22] ——, “Tree quantization for large-scale similarity search and classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4240–4248.
  • [23] W. Li, Y. Zhang, Y. Sun, W. Wang, M. Li, W. Zhang, and X. Lin, “Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 8, pp. 1475–1488, 2019.
  • [24] N. Lee, J. Lee, and C. Park, “Augmentation-free self-supervised learning on graphs,” in Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 7, 2022, pp. 7372–7380.
  • [25] R. S. Oyamada, L. C. Shimomura, S. Barbon Jr, and D. S. Kaster, “A meta-learning configuration framework for graph-based similarity search indexes,” Information Systems, vol. 112, p. 102123, 2023.
  • [26] F. Groh, L. Ruppert, P. Wieschollek, and H. P. Lensch, “Ggnn: Graph-based gpu nearest neighbor search,” IEEE Transactions on Big Data, vol. 9, no. 1, pp. 267–279, 2022.
  • [27] A. Singh, S. J. Subramanya, R. Krishnaswamy, and H. V. Simhadri, “Freshdiskann: A fast and accurate graph-based ANN index for streaming similarity search,” CoRR, vol. abs/2105.09613, 2021. [Online]. Available: https://arxiv.org/abs/2105.09613
  • [28] Q. Chen, B. Zhao, H. Wang, M. Li, C. Liu, Z. Li, M. Yang, and J. Wang, “Spann: Highly-efficient billion-scale approximate nearest neighborhood search,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34.   Curran Associates, Inc., 2021, pp. 5199–5212. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2021/file/299dc35e747eb77177d9cea10a802da2-Paper.pdf
  • [29] Y. Xu, H. Liang, J. Li, S. Xu, Q. Chen, Q. Zhang, C. Li, Z. Yang, F. Yang, Y. Yang, P. Cheng, and M. Yang, “Spfresh: Incremental in-place update for billion-scale vector search,” in Proceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23.   New York, NY, USA: Association for Computing Machinery, 2023, p. 545–561. [Online]. Available: https://doi.org/10.1145/3600006.3613166
  • [30] J. Li, X. Yan, J. Zhang, A. Xu, J. Cheng, J. Liu, K. K. Ng, and T.-c. Cheng, “A general and efficient querying method for learning to hash,” in Proceedings of the 2018 International Conference on Management of Data, 2018, pp. 1333–1347.
  • [31] L. Amsaleg and H. Jegou, “Datasets for approximate nearest neighbor search,” http://corpus-texmex.irisa.fr/, 2010.
  • [32] C. Fu, C. Wang, and D. Cai, “High dimensional similarity search with satellite system graph: Efficiency, scalability, and unindexed query compatibility,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 8, pp. 4139–4150, 2022.