Enhancing HNSW Index for Real-Time Updates: Addressing Unreachable Points and Performance Degradation

1^st Wentao Xiao SCSE
UESTC
Chengdu, China
WentaoXiao1234@gmail.com 2^nd Yueyang Zhan SCSE
UESTC
Chengdu, China
yueyang.zhan@std.uestc.edu.cn 3^rd Rui Xi* {@IEEEauthorhalign} 4^th Mengshu Hou SCSE
UESTC
Chengdu, China
ruix.ryan@gmail.com *Corresponding author SCSE
UESTC
Chengdu, China
mshou@uestc.edu.cn 5^th Jianming Liao SCSE
UESTC
Chengdu, China
liaojm@uestc.edu.cn

Abstract

The approximate nearest neighbor search (ANNS) is a fundamental and essential component in data mining and information retrieval, with graph-based methodologies demonstrating superior performance compared to alternative approaches. Extensive research efforts have been dedicated to improving search efficiency by developing various graph-based indices, such as HNSW (Hierarchical Navigable Small World). However, the performance of HNSW and most graph-based indices become unacceptable when faced with a large number of real-time deletions, insertions, and updates. Furthermore, during update operations, HNSW can result in some data points becoming unreachable, a situation we refer to as the ‘unreachable points phenomenon’. This phenomenon could significantly affect the search accuracy of the graph in certain situations.

To address these issues, we present efficient measures to overcome the shortcomings of HNSW, specifically addressing poor performance over long periods of delete and update operations and resolving the issues caused by the unreachable points phenomenon. Our proposed MN-RU algorithm effectively improves update efficiency and suppresses the growth rate of unreachable points, ensuring better overall performance and maintaining the integrity of the graph. Our results demonstrate that our methods outperform existing approaches. Furthermore, since our methods are based on HNSW, they can be easily integrated with existing indices widely used in the industrial field, making them practical for future real-world applications. Code is available at https://github.com/xwt1/MN-RU.git

Index Terms:

Data Mining, Information Retrieval, Incremental Update, Freshness, Graph-Based Index

I Introduction

The approximate nearest neighbor search (ANNS) has emerged as a focal point in diverse application areas encompassing data mining, information retrieval, and recommendation systems ([1, 2]). Notably, deep learning models, such as large language models (LLMs), possess the capability to encode various data types, ranging from textual documents to visual and auditory inputs, into vector representations. Innovative systems like Retrieval-Augmented Generation (RAG) ([3]) leverage ANNS to efficiently retrieve relevant documents or information, integrating them with generative model outputs to augment the precision and relevance of the generated content. The basic ANNS can be defined as follows: Given a data set $D\subseteq\mathbb{R}^{d}$ with $n$ points in some metric space, a query data point $q\in\mathbb{R}^{d}$ , and $k\in\mathbb{N}$ , we seek to find the set $L$ that efficiently represents the $k$ nearest neighbors of $q$ from $P$ , while maximizing $\text{recall@}k=\frac{|G\cap L|}{|G|}$ , where $|G|=|L|=k$ . Here, $G$ is the ground truth set of $q$ ’s $k$ nearest neighbors in $P$ . Effective indexing is crucial in this context, as it significantly enhances the efficiency and accuracy of the search process, enabling the rapid and reliable retrieval of nearest neighbors from large datasets.

Based on the theory and concepts of ANNS, a large number of indexing methods have been designed and can generally be classified into four types: tree structure-based index [4], hashing-based index [5], quantization-based index [6], and graph-based index ([7, 8, 9, 10]). Among these, graph-based methods demonstrate superior empirical search performance compared to others ([11]).

Nevertheless, despite the commendable search performance exhibited by certain early graph-based techniques, a notable limitation lies in their static nature ([7, 8]), rendering them incapable of accommodating the real-time modifications requisite in practical applications. For instance, consider the RAG system, which relies on kNN-based methodologies to transform documents into vector representations. In scenarios where users seek to alter the content within the system, such modifications entail the insertion of new data points and the removal of existing ones. Consequently, the static nature of the indexing mechanism proves inadequate for addressing real-time updates, thereby necessitating alternative approaches to address this deficiency. In the subsequent section II, we will delve into the strategies devised to mitigate the constraints associated with static graph-based indices.

HNSW (Hierarchical Navigable Small World) is a widely used and well-established graph-based index within the industry ([12]), lauded for its exceptional performance in both search precision and computational efficiency. However, according to the findings and experimental analysis in our preliminary study (see details in section III), it has come to light that the current iteration of the HNSW index manifests two notable shortcomings following a substantial sequence of delete, insert, and query operations. The first and primary concern arises from the utilization of the markDelete algorithm and the replaced_update insertion algorithm (hereafter referred to as replaced_update for brevity), leading to a scenario where specific data points may become inaccessible, a phenomenon we identify as the ‘unreachable points phenomenon.’

This phenomenon can pose problems in practical applications. Imagine two scenarios in typical systems: If certain data points representing online stores in an index become inaccessible after a series of modifications, such as insertions and deletions, it would be highly frustrating for the owners of those online stores. This is especially critical for store owners who rely on recommendation systems; if their stores can never be recommended due to the unreachable points phenomenon, despite being the most relevant to client requests, it would significantly impact their business. Similarly, in a RAG system, if a user seeks knowledge on a specific topic, they would include relevant keywords in their query. However, due to the unreachable points phenomenon, if the data points most relevant to those keywords become unsearchable, the accuracy of the retrieved results will be compromised, subsequently affecting the output of the generation model. Therefore, the problem of some points becoming unreachable after insertions and deletions is a real-world concern. The second issue is that we observed the efficiency of native HNSW operations involving mixed deletions and insertions is lower compared to query operations. This inefficiency leads to query delays in scenarios involving mixed deletions, insertions, and queries, indirectly reducing overall query efficiency.

To address the first issue, we proposed practical solutions to mitigate this problem. Additionally, the deletion and update strategies we designed can reduce the number of unreachable points and their growth rate after repeated deletions and insertions. We redesigned the deletion and update strategies for HNSW to address the second issue. Our new method enhances the efficiency of mixed operations compared to the original approach and mitigates the impact of the phenomenon described in the first issue.

Our contribution can be summarized in the following:

•
Revealing the unreachable point phenomenon in the HNSW index:
- –
  
  Through analysis and experimentation, we have identified that HNSW can create unreachable points in the graph after deletion and insertion operations, leading to adverse effects, as previously discussed. We proposed suitable practical solutions and mitigated the impact of this phenomenon in our newly designed deletion and update algorithm.
•
Improved Replaced_Update Strategy:
- –
  
  Building on existing methods, we proposed an improved replaced_update algorithm called MN-RU. This algorithm enhances the speed of deletion and insertion operations while also alleviating the unreachable points phenomenon to a significant extent.
•
Comprehensive Performance Evaluations:
- –
  
  We implemented a comprehensive strategy and integrated it into a single platform for comparative evaluations. The results validate the superiority of our approach over the native HNSW strategy, offering valuable insights for future practical applications.

II Related Work

II-A Approximate Nearest Neighbor Search Algorithms

Approximate nearest neighbor search (ANNS) has been a central focus of scholarly research for the past two decades, which led many studies to develop effective methodologies to address this complex problem.

In space partitioning methods, such as tree structures, data space is initially recursively divided into multiple regions to facilitate the construction of tree or forest-based indices. These methods are commonly categorized into hierarchical structures and reference point-based structures. While effective for low-dimensional datasets, their performance significantly deteriorates with increasing data dimensionality. Beyond 10 dimensions, tree-based space partitioning methods exhibit reduced efficiency, often slower than brute-force linear-scan methods. Strategies like the pyramid technique [13] and iDistance [14] have been proposed to combat the ’curse of dimensionality.’

Initially developed for addressing ANNS in the Hamming space [15], hash-based methods were later extended to the Euclidean space [16]. These methods involve projecting high-dimensional data points into lower-dimensional spaces using hash functions to devise efficient algorithms to identify nearest neighbors. Despite their efficacy, hash-based methods often require numerous hash tables for satisfactory search outcomes, leading to the emergence of variants like VHP [17], LCCS-LSH [18], PMLSH [19], and R2SLH [20].

Product quantization (PQ) is a widely used quantization-based method for accelerating search processing by compressing input vectors into compact codes for memory-efficient dataset processing. It also provides effective techniques for estimating distances between raw vectors and compressed codes, improving distance estimation accuracy. Diverse quantization-based approaches [21, 22] have been devised to mitigate quantization errors and to improve query precision, focusing on refining quantization techniques.

Graph-based methods [23, 24, 25, 26] have emerged as potent solutions for high-dimensional ANNS, constructing efficient indexing structures. However, constructing exact KNN graphs becomes exponentially complex with increasing nodes. Approximated KNN graph construction has been explored as an alternative, offering similar efficacy as exact graphs in specific applications. Despite this, memory constraints in graph construction necessitate cluster algorithms for simplified subgraph creation and more efficient query processing. Integration of machine learning and deep learning into graph-based nearest neighbor search methods has advanced search capabilities.

The landscape of dynamic update methods has recently advanced a lot. Noteworthy advancements include Fresh-diskann ([27]), which introduces a StreamingMerge protocol to handle node deletions from the DeleteList and subsequent insertion of $N$ new points, as detailed in the referenced paper. Additionally, SPANN [28] stands out as the pioneering on-disk vector index, leveraging balanced clustering to achieve minimal tail search latency and deliver exceptional performance. Furthermore, SPFresh ([29]) offers heightened throughput and reduced latency for both search and update operations, ensuring the efficient retrieval of new highly reliable vectors.

II-B HNSW

Previous studies have investigated various strategies aimed at maintaining the freshness of graph-based or hybrid indexes. Notably, HNSW ¹¹1https://github.com/nmslib/hnswlib.git serves as a graph-based index in memory and has developed a robust framework for executing delete and update operations. In the case of a delete operation concerning a label $x_{d}$ , HNSW employs the markDelete algorithm to flag and transfer it to a designated deleted set. On the other hand, for an insertion operation, the choice can be made to either directly insert a new point into the index if the index size remains within the predefined limit, or to replace an old deleted label $x_{d}$ with the new entry. The whole insertion procedure, referred to as replaced_update in this study, can be represented as follows.

1.

Check for Deleted Points: Initially, HNSW examines for any points flagged as deleted. If a deleted point is identified, it is designated for replacement.
2.

Collect Neighbors: The algorithm gathers the one-hop and two-hop neighbors of the deleted point. For each one-hop neighbor $v_{j}$ , the one-hop and two-hop neighbors and the newly inserted point are considered potential candidates new neighbors for $v_{j}$ .
3.

Prune Candidate Neighbors: Employing a pruning strategy, an optimal neighbor set $N(v_{j})$ is chosen for each one-hop neighbor $v_{j}$ from the pool of candidates.
4.

Update Connections: Directed edges are established from each one-hop neighbor $v_{j}$ to the points in $N(v_{j})$ , ensuring graph connectivity.
5.

Insert the New Point: The new point is integrated into the index, with connections established based on the rectified neighbor sets.

The above-described processes are implemented within the HNSW source code, particularly encompassing the functions addPoint, updatePoint, and repairConnectionsForUpdate. The significance of the replaced_update method lies in its ability to manage the index efficiently. Without it, inserting new data points would leave deleted entries in the index, wasting storage and causing unnecessary expansion. By using replaced_update, deleted points’ storage is reused for new insertions, preventing the index from growing excessively and maintaining efficient space utilization.

Our approach extends from the foundational framework of HNSW, integrating enhancements to enhance the efficiency of dynamic updates and mitigate the issue of unreachable points. By optimizing the deletions and insertions, we aim to reduce latency and promote enhanced graph connectivity.

III Issues in Update Efficiency and Unreachable Points

Refer to caption — Figure 1: Figure 1: Comparison of query efficiency and replaced_update efficiency at a given recall level on three public datasets.

Although integrating the replaced_update method in the HNSW algorithm to efficiently substitute deleted points with new entries, thus facilitating real-time point deletion and space reclamation, its operational efficiency remains subpar. Illustrated in Figure 1, when approaching a recall rate of approximately 90%, the efficiency of replaced_update lags notably behind the search efficiency of the query in three public datasets. Notably, in datasets like GIST and ImageNet, the insertion speed using the replaced_update method is notably slower by a factor of 5 to 10. Through analyzing the process of replaced_update method, it has come to our attention that this method may give rise to the ‘unreachable points phenomenon’ after repeated deletions and insertions. That is, some specific points become inaccessible. Furthermore, this operational inefficiency compounds the issue of the unreachable points phenomenon. Subsequently, we will delve into the underlying reasons contributing to the occurrence of the unreachable points phenomenon and conduct an empirical experiment to validate our hypothesis.

III-A Causes of Unreachable Points

Definition 1

An unreachable point is defined as a point that possesses outgoing edges but lacks incoming edges in all layers of the navigable small-world graph.

In Figure 2, node $v$ has only one incoming edge from node $d$ . After marking node $d$ for deletion and applying the HNSW replaced_update strategy to remove $d$ , node $v$ lacks any incoming edges. In other words, unless the node $v$ serves as the entry point in the HNSW structure, it will remain unvisited in subsequent search operations.

It is evident that a point with an in-degree of zero cannot be found during the search process. During the construction of the HNSW index, the likelihood of forming unreachable points is exceedingly small. However, if the replace_update algorithm used by HNSW is applied to replace old points with newly inserted points, unreachable points emerge.

In HNSW, the occurrence of some points having an in-degree of zero after the replaced_update operation is attribute to the specific repair connection strategy employed by the algorithm. During a replaced_update in HNSW, the algorithm first gathers the neighbors of the deleted point, referred to as $T$ . It then combines $T$ with their own neighbors and the newly inserted point to form a new candidate neighbor set for each member of $T$ . Finally, new neighbors are selected from the candidate set for each member of $T$ . The edges from members of $T$ to their original neighbors might be deleted during these operations. As a result, these original neighbors may end up with no in-edges after the replaced_update operation. Consequently, these points will not be found during future search processes unless they are the entry points of the HNSW.

III-B Demonstration on Specific Datasets

Figure 3 illustrates the presence of unreachable points in HNSW following replaced_update operations. The experiment uses the Sift dataset, where 5% of the data is randomly deleted in each iteration, ensuring that these points do not already belong to the set of unreachable points and are subsequently reinserted. The experiment consists of 3000 iterations to track the evolution of unreachable points after each cycle of deletion and insertion.

Figure 3 (a) illustrates that the number of unreachable points in HNSW consistently increases with continuous deletion and reinsertion. After approximately 3000 iterations, the number of unreachable points reaches between 3% and 4% of the original dataset. This number rises with additional iterations, indicating that points already in the unreachable points set will not be searched in future search processes.

Figure 3 (b) shows that recall decreases as the number of iterations increase. With the same configuration, recall declines by about 3% due to the increasing number of unreachable points over iterations. This decline becomes more severe with more iterations, leading to a gradual reduction in the accuracy of the final search. Furthermore, this decrease in accuracy caused by the unreachable points phenomenon cannot be mitigated by adjusting HNSW searching parameters, such as increasing the ef_ parameter.

IV Methodology

As shown in Figure 1, the HNSW replaced_update algorithm exhibits poor update performance, leading to the phenomenon of unreachable points, which adversely affects search performance, as illustrated in Figure 3. To address these issues, we propose the back_up_index_construction and dual_search methods to mitigate the growth of unreachable points from the upper-level application. More importantly, we introduce the MN-RU algorithm, designed to improve update efficiency and significantly reduce the incidence of unreachable points within the index. The architecture of our approach is shown in Figure 4, which illustrates the integration of the HNSW index in the upper-level application and the operation of the MN-RU within the HNSW index.

IV-A Dual Indexes Design

In an HNSW index, points becoming unreachable after replaced_update operations can be attributed to the graph’s inherent connectivity and the algorithm’s maintenance processes. Although reconstruction and reconnection can effectively restore the connectivity of the HNSW graph, they also present several drawbacks, such as significant computational resources and time, service interruption, and system complexity. In practical applications involving high-frequency update scenarios, it is imperative to weigh these disadvantages carefully. This article considers strategies to address the point-disconnected problem by maintaining graph connectivity to balance performance and availability.

Diverging from conventional methodologies reliant on a singular HNSW index, we introduce an additional index designed explicitly for managing unreachable points within the HNSW indexing system, referred to as the HNSW Backup Index, as illustrated in Figure 4. The inception of this algorithm is rooted in minimizing the computational burden and service disruptions associated with recurrent reindexing processes. The procedure for creating the Backup Index involves several steps: First, the HNSW K-NN search function is applied to identify the point set $F$ that is closest to the query point $q$ from the current data set $P$ . We set $K$ to $|P|$ to retrieve all the points available in the index. Next, by removing $F$ from the set $P$ , the remaining set of points $U$ , which includes the points not found, is identified and extracted. Subsequently, a new HNSW backup index is constructed for these unfound point sets $U$ . This process leverages the multi-level proximity search feature of HNSW to efficiently manage unreachable points, ensuring that the structure and query performance of the original index remains unaffected while reducing the computational overhead and service interruptions typically associated with reconstruction.

Meanwhile, we introduce a threshold $\tau$ to regulate replaced_update operations. When the number of replaced_update operations exceeds $\tau$ , it triggers the reconstruction of the HNSW backup index. The upper application layer can adjust the value of $\tau$ according to specific needs. In our implementation, the value of $\tau$ was empirically configured to 40000.

Based on this novel index structure, it is imperative to propose new query and maintenance strategies to maintain the accuracy of the results, optimize query performance, and prevent any potential degradation in system efficiency. Next, we will describe these strategies comprehensively.

IV-B Dual Index Search

In order to ensure the query efficiency and the accuracy of the results, we propose a query algorithm called dualSearch, which queries a primary index (primary HNSW index) and a backup index (HNSW index dedicated to managing unreachable points) simultaneously to ensure that the query can cover all possible data points, even if some points no longer appear in the primary index after the replaced_update operations.

Algorithm 1 outlines the search process for dualSearch. Initially, it performs a K-NN search on the primary index ( $HNSW_{main}$ ) using the query point $q$ to retrieve $k$ nearest neighbors in the set of results $R_{main}$ . Concurrently, a similar K-NN search is executed on the backup index ( $HNSW_{backup}$ ), dedicated to managing unreachable points, yielding the results set $R_{backup}$ . Subsequently, the results of both searches are merged into a unified set $R_{combined}$ , which is then sorted according to the distance from point to query to prioritize closer points. Ultimately, the algorithm delivers the top $k$ points from this ordered set, ensuring a precise and effective query result that takes into account both the accessible and the unreachable points.

Algorithm 1 dualSearch(q, k, j)

Input: Query point:

q

, number of neighbors:

k

, size of dynamic candidate list:

ef

Output: Top

k

points from combined search results

Variables:

R_{main}

: Result set from main index,

R_{backup}

: Result set from backup index,

R_{combined}

: Combined result set,

R_{sorted}

: Sorted result set

R_{main}\leftarrow

K-NN-SEARCH

(HNSW_{main},q,k,ef)

// K-NN-SEARCH is Algorithm 5 from HNSW

R_{backup}\leftarrow

K-NN-SEARCH

(HNSW_{backup},q,k,ef)

// K-NN-SEARCH is Algorithm 5 from HNSW

R_{combined}\leftarrow R_{main}\cup R_{backup}

// Combine results

R_{sorted}\leftarrow

sort

R_{combined}

by distance to

q

return Top

k

points from

R_{sorted}

This approach improves search accuracy and maintains robust query performance, even with unreachable points. By using a dedicated backup index, the system avoids the pitfalls of traditional single-index structures.

IV-C Index Maintenance

When confronted with extensive datasets that undergo frequent updates through insertions and deletions, ensuring the precision of query outcomes requires an efficient strategy to maintain the index that encompasses update, delete, and other operations. This challenging task is pivotal for maintaining the integrity and reliability of the query results.

As mentioned earlier, the original HNSW replaced_update algorithm needs to consider the neighbors of the deleted point(namely one-hop neighbors) and the neighbors of these neighbors (namely two-hop neighbors) as potential new neighbors to one-hop neighbors during the update process. For each one-hop neighbors, the algorithm has to reselect its neighbors from a candidate set of size $M^{2}$ , including one-hop neighbors and the two-hop neighbors. Here, $M$ is a parameter in HNSW that defines the maximum out-degree of a data point within a specific layer. The time complexity of this operation is $O(M^{3})$ per layer, which is significantly time-consuming.

For selecting new neighbors for one-hop neighbors, considering all two-hop neighbors as candidates is inefficient. The reason is as follows: Before update, based on HNSW’s edge selection strategy, if a one-hop neighbor does not have an edge to a particular two-hop neighbor, it means HNSW deemed this edge suboptimal. Since the size of two-hop neighbors is $M^{2}$ and each one-hop neighbor originally has $M$ neighbors, many two-hop neighbors are not part of its original neighbor set. Therefore, establishing edges from one-hop neighbor to such two-hop neighbors was deemed suboptimal by HNSW before update and is likely to remain so after update, as the index structure does not undergo significant changes.

To address this complexity, we propose a novel algorithm(Algorithm 2) which focuses on restoring the connectivity of one-hop neighbors of the deleted point. Contrary to original method, Algorithm 2 exclusively updates neighbors of one-hop neighbors that include the deleted point in their original neighbor list. For every such point, the candidate new neighbors comprise its original neighbors along with the neighbors of the deleted point. Due to the previously described reasons, two-hop neighbors are not necessarily included in the new neighbor candidate set, ensuring a more efficient and focused neighbor update process. This operation exhibits a time complexity of $O(M^{2})$ per layer since for each one-hop neighbors, the algorithm has to reselect its neighbors from a candidate set of size $M$ . Therefore, our algorithm demonstrates superior efficiency compared to the original approach. Figure 5 serves as an illustrative example of Algorithm 2. The complete process of Index Maintenance is detailed below:

Deletion. For deletions, as illustrated in Figure 4, when a point in the index is deleted, our method appends it to a list, deletedList, similar to the process in the original HNSW. The purpose of utilizing the deletedList is to manage deleted points without immediate removal efficiently, thus preserving the graph’s structural integrity during updates.

Algorithm 2 MNRepairNeighborConnection(

data

label

\alpha

hnsw

)

1: Input: New point data:

data

, label of the new point:

label

, multilayer graph:

hnsw

, parameter for pruning function:

\alpha

2: Variables:

deletedPoint

: Point marked as deleted

L_{max}

: Maximum layer of

deletedPoint

hnsw

N_{1}

: Neighbor set 1

N_{2}

: Neighbor set 2

C

: Combined candidate set

deletedPoint\leftarrow

getDeletedPoint(

hnsw

) // Get a point marked as deleted

9: if

deletedPoint

is NULL then

10: insert(

data

label

hnsw

) // Perform normal insertion

11: return

12: end if

13:

L_{max}\leftarrow

getMaxLayer(

deletedPoint

hnsw

) // Get the maximum layer of the deleted point in

hnsw

14: for

layer\leftarrow 0

L_{max}

15:

N_{1}\leftarrow

getNeighborhood(

deletedPoint

layer

) // Get the neighbors of deletedPoint

16:

N_{2}\leftarrow\emptyset

// Initialize empty set for

N_{2}

17: for each point

v\in N_{1}

18: if edge(

v

deletedPoint

) exists in

layer

then

19:

N_{2}\leftarrow N_{2}\cup\{v\}

// Add

v

N_{2}

20: end if

21: end for

22: for each point

u\in N_{2}

23:

C\leftarrow

getNeighborhood(

u

layer

) // Get neighbors of

u

24:

C\leftarrow C\cup N_{1}\cup label

// Combine candidate sets, set new insert point to candidates

25:

C\leftarrow

pruneCandidates(

C

layer

\alpha

) // Prune the candidates using

\alpha-RNG

26: setNeighbors(

u

C

layer

) // Set

u

’s neighbors to

C

27: end for

28: end for

29: update(

deletedPoint

data

label

) // Update the data of the deleted point

Insertion. Algorithms 2 and 3, collectively called Mutual Neighbor Replaced Update (MN-RU), update the HNSW index when a new point replaces a deleted one. First, Algorithm 2 repairs the graph’s connectivity, then Algorithm 3 inserts the new point to the index. Unlike the conventional approach of adding new points, the Algorithm 3 enables the new point to inherit the layer level of the deleted point. Subsequent to this inheritance, the new point undergoes insertion using the standard HNSW insert process. Algorithm 3 starts by identifying the top layer of the HNSW index and the maximum layer of the deleted point, establishing an initial entry point for the search. From the top layer to the layer above the deleted point’s maximum layer, it iteratively searches for the nearest point as the entry point to the new data point, updating the entry point each time. Then, from the maximum layer of the deleted point to the lowest layer, it searches for neighbors, selects neighbors for the new point, and establishes connections. MN-RU ensures the seamless integration of the new element into the HNSW structure.

V Experiments

In this section, we evaluate our algorithm and highlight key experimental observations. The implementations of our algorithm and other baseline methods were executed in C++ on a system equipped with an Intel Xeon CPU E5-2678 v3 @ 2.50GHz and 110GB memory, operating on Ubuntu 22.04. All computational tasks, like data insertion, deletion, and retrieval, utilized 40 threads for enhanced efficiency in this section.

Algorithm 3 update(deletedPoint, data, label)

1: Input: The data which is marked deleted deletedPoint, Data of the new element

data

, Label of the new element

label

2: Variables:

L_{max}

: Maximum layer of the deleted point

L

: Maximum layer of the current HNSW

ep

: Entry point for search

W

: List of currently found nearest elements

neighbors

: Neighbor points of current point

L_{max}\leftarrow

getMaxLayer(

deletedPoint

) // Get maximum layer of the deleted point

L\leftarrow

getMaxLayer(

hnsw

) // Get maximum layer of HNSW

10:

ep\leftarrow

getEnterPoint(

L

) // Get entry point for the highest layer of HNSW

11: for

lc\leftarrow L

L_{max}+1

12:

W\leftarrow

SEARCH-LAYER(

data

ep

, 1,

lc

) // Search layer to find nearest elements

13:

ep\leftarrow

getNearestElement(

W

data

) // Update entry point for next layer

14: end for

15: for

layer\leftarrow L_{max}

0

16:

W\leftarrow

SEARCH-LAYER(

data

ep

efConstruction

layer

) // Search layer to find nearest elements

17:

neighbors\leftarrow

SELECT-NEIGHBORS(

data

W

M

layer

) // Select neighbors for the current layer

18: addBidirectionalConnections(

neighbors

data

layer

) // Add bidirectional connections using the same strategy as HNSW

19:

ep\leftarrow

W

// Update entry point for next layer

20: end for

V-A Datasets

We used four datasets with different sizes and dimensions, as shown in Table I. These datasets are widely used for the evaluation of various ANNS methods, and we accessed these datasets through a public repository[30] ²²2https://www.cse.cuhk.edu.hk/systems/hash/gqr/datasets.html. The Sift2M dataset is a subset comprising the first two million vectors extracted from the Sift1B dataset [31], which encompasses 1 billion SIFT descriptors with a dimensionality of 128.

TABLE I: Table 1: Data statistics

Dataset	Sift	Gist	ImageNet	Sift2M
Base Size	1,000,000	1,000,000	2,340,373	2,000,000
Dim	128	960	150	128

V-B Baselines

Given that most memory-based ANNS methods [7, 8, 32] do not support real-time update operations, our primary comparison centers on contrasting our methods with HNSW’s replaced_update algorithm. Our methods, based mainly on Algorithm 2, focus on reselecting neighbors only for points mutually connected with the deleted points, except for MN-THN-RU, which will be discussed later. These methods are collectively termed Mutual Neighbor Replaced Update (MN-RU) with distinctive suffixes. Hereafter, we introduce both the baseline methods and our strategy:

•

HNSW replaced_update algorithm(HNSW-RU): This method, derived from the original HNSW algorithm [12], functions by replacing deleted nodes within the index to uphold the structure and efficacy of the HNSW graph. The implementation of the replaced_update algorithm can be accessed at ³³3https://github.com/nmslib/hnswlib.git. The HNSW index is constructed on four datasets. Specifically, for the SIFT and SIFT_2M datasets [31], the parameters are set at $M=16$ and $ef\_construction=200$ . For the Gist dataset [31], the parameters are $M=32$ and $ef\_construction=600$ . In the case of the ImageNet dataset, the parameters are $M=64$ and $ef\_construction=800$ . The remaining methods maintain consistent parameter settings.
•

Mutual Neighbor replaced_update $\alpha$ (MN-RU $\alpha$ ): This approach involves selecting the point set $P$ with mutual connections to the deleted points for neighbor reselection. The candidate neighbor set $C$ for $P$ encompasses the neighbors of the deleted points and their neighbors. To enhance efficiency, the method incorporates the $\alpha$ -RNG pruning strategy [27], with $\alpha=1$ .
•

Mutual Neighbor replaced_update $\beta$ (MN-RU $\beta$ ): Similar to the previous method, MN-RU $\beta$ selects the point set $P$ with mutual connections to the deleted points for neighbor reselection. For each point $v$ in $P$ , it includes the neighbors of the deleted points and $v$ ’s original neighbors as the candidate new neighbor set $C$ for $v$ . The $\alpha$ -RNG pruning strategy [27] is applied with $\alpha=1$ .
•

Mutual Neighbor replaced_update $\gamma$ (MN-RU $\gamma$ ): MN-RU $\gamma$ is akin to MN-RU $\beta$ but with the $\alpha$ parameter of the $\alpha$ -RNG pruning strategy set to 1.1 ( $\alpha=1.1$ ). This method directly implements Algorithm 2.
•

Mutual Neighbor And Two Hop Neighbor replaced_update (MN-THN-RU): A variant of MN-RU $\gamma$ , MN-THN-RU reselect the neighbors for points that mutually connected with deleted points and neighbors of neighbors of deleted points that have a connection to the deleted point.

To control query recall in our experiment, we use $ef\_$ to represent the priority queue size in the HNSW search process, balancing query accuracy and efficiency. A larger $ef\_$ value results in higher recall but increased running time. Additionally, $K$ denotes the number of returned points.

V-C Experimental Scenarios And Metrics

We compared our methods against the HNSW replaced_update algorithm in the following three scenarios.

•

Full_Coverage: In this scenario, encompassing the Gist, Sift, and ImageNet datasets, we execute 100 iterations where each dataset is segmented into 100 parts. Every iteration involves the deletion and reinsertion of a portion, enabling the assessment of the impact of complete coverage on the index structure and performance.
•

Random: For the Gist, Sift, and ImageNet datasets, we conduct 200 iterations. Within each iteration, 10,000 labels are randomly generated for deletion and reinsertion, facilitating the evaluation of method performance and robustness in the face of random data manipulations.
•

New_Data: Focusing on the Sift2M dataset, we initialize the index with the first million data points and conduct 10 iterations. Each iteration deletes 100,000 points from the first million and inserts 100,000 points from the second million. This process aims to assess the index’s performance and adaptability with continuous data introduction. The final index consists of the second million data points.

This comparison relies on two key metrics: update time efficiency and growth of unreachable points. A method with superior efficiency exhibits reduced update times, while a method demonstrating excellence showcases minimal growth in unreachable points per iteration. We evaluated our methods against the HNSW replaced_update algorithm in three scenarios, focusing on update time and the number of unreachable points.

V-D Update Time Efficiency

This section compares our methods with native HNSW-RU methods regarding update time efficiency. The experiments were conducted in three scenarios described in the previous section: full_coverage, random, and new_data. The results are shown in Figures 6, 7, 13 (a).

In the figures, the lower position of a curve corresponds to higher efficiency in update operations. It is evident that the MN-RU $\alpha$ , MN-RU $\beta$ , MN-RU $\gamma$ , and MN-THN-RU methods consistently achieve lower update times compared to the HNSW-RU method and are 2-4 times faster in all scenarios. This indicates that our methods are more efficient in terms of update performance. The superior performance of our methods compared to HNSW-RU is due to their lower time complexity, as detailed in Section IV-C.

V-E Unreachable Points Growth

In this section, we compare our methods with native HNSW-RU methods in terms of the growth of unreachable points. The experiments were conducted in three scenarios described in the previous section: full_coverage, random, and new_data. The results are shown in Figures 8, 9, and 13 (b).

In the Figures, the method with fewer unreachable points after the same number of iterations is considered superior. We observe that over a prolonged period of update operations, the number of unreachable points increases for each method. For HNSW-RU, after 200 iterations described in previous sections, the number of unreachable points grows to approximately 3% to 4% in the Gist dataset and 2% to 3% in the ImageNet dataset. The MN-RU $\gamma$ and MN-THN-RU methods maintain fewer unreachable points than other methods, leading us to conclude that they are more effective.

To understand the impact of unreachable points on search performance after updates, we conducted experiments on the Gist and ImageNet datasets. Figures 8 and 9 show that, after several updates, unreachable points can occupy a significant portion of these datasets, making the impact more observable.

Therefore, we investigated the search performance of indices built on Gist and ImageNet datasets under full_coverage and random scenarios after updates to assess how unreachable points affect recall. The original Gist query set has only 1,000 queries and the ImageNet query set has 200, making it difficult to cover the entire dataset, especially for small values of $K$ ( $K\leq 100$ ). This limits our ability to observe the impact of unreachable point growth on search accuracy.

To address this limitation, we adopted a more straightforward approach to test the impact of unreachable points on final recall: we used the entire original dataset as a query set. We performed nearest neighbor searches with $K=1$ . Changing the parameter $ef\_$ to balance the recall and search time, we obtained the results shown in Figures 11 and 11. The results indicate that the MN-RU $\gamma$ and MN-THN-RU methods outperform HNSW-RU by achieving better accuracy within the same time frame. This demonstrates the efficacy of our methods in maintaining high precision and effectively reducing the number of unreachable points, leading to improved search accuracy, especially after many update operations.

To further validate the effectiveness of our backup index construction methods and the algorithm 1 illustrated in Figure 4, we experimented on the GIST dataset under the full_coverage scenario. This experiment compared the growth of unreachable points between the MN-RU $\gamma$ with the backup index method and the original HNSW-RU. We set the parameter $\tau$ to 40,000, which means that the backup index was rebuilt after every four iterations. As shown in Figure 13, the MN-RU $\gamma$ with the backup index method significantly reduced the number of unreachable points compared to the original HNSW-RU. Therefore, our method with the backup index outperforms the original HNSW-RU baseline.

VI Conclusion

This paper addresses HNSW replaced_update limitations in real-time deletions and insertions, leading to the ‘unreachable points phenomenon’ and reduced efficiency. Our proposed MN-THN-RU and MN-RU $\gamma$ algorithms with the backup index method effectively mitigate these issues by enhancing the efficiency of mixed operations and maintaining better graph connectivity. Extensive experimental validation demonstrates that our methods outperform the native HNSW strategy, significantly reducing the number of unreachable points and improving the update speed across various datasets. These improvements make our approach highly practical for real-world applications requiring dynamic real-time updates and high search accuracy.

References

[1] M. Wang, X. Xu, Q. Yue, and Y. Wang, “A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search,” Proc. VLDB Endow., vol. 14, no. 11, p. 1964–1978, jul 2021. [Online]. Available: https://doi.org/10.14778/3476249.3476255
[2] R. Chen, B. Liu, H. Zhu, Y. Wang, Q. Li, B. Ma, Q. Hua, J. Jiang, Y. Xu, H. Deng, and B. Zheng, “Approximate nearest neighbor search under neural similarity metric for large-scale recommendation,” in Proceedings of the 31st ACM International Conference on Information & Knowledge Management, ser. CIKM ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 3013–3022. [Online]. Available: https://doi.org/10.1145/3511808.3557098
[3] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY, USA: Curran Associates Inc., 2020.
[4] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for high dimensional data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, pp. 2227–2240, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:206765442
[5] Q.-Y. Jiang and W.-J. Li, “Scalable graph hashing with feature transformation.” in IJCAI, vol. 15, 2015, pp. 2248–2254.
[6] J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2019.
[7] C. Fu, C. Xiang, C. Wang, and D. Cai, “Fast approximate nearest neighbor search with the navigating spreading-out graph,” Proc. VLDB Endow., vol. 12, no. 5, p. 461–474, jan 2019. [Online]. Available: https://doi.org/10.14778/3303753.3303754
[8] K. Lu, M. Kudo, C. Xiao, and Y. Ishikawa, “Hvs: hierarchical graph structure based on voronoi diagrams for solving approximate nearest neighbor search,” Proc. VLDB Endow., vol. 15, no. 2, p. 246–258, oct 2021. [Online]. Available: https://doi.org/10.14778/3489496.3489506
[9] S. Gollapudi, N. Karia, V. Sivashankar, R. Krishnaswamy, N. Begwani, S. Raz, Y. Lin, Y. Zhang, N. Mahapatro, P. Srinivasan, A. Singh, and H. V. Simhadri, “Filtered-diskann: Graph algorithms for approximate nearest neighbor search with filters,” in Proceedings of the ACM Web Conference 2023, ser. WWW ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 3406–3416. [Online]. Available: https://doi.org/10.1145/3543507.3583552
[10] M. Wang, W. Xu, X. Yi, S. Wu, Z. Peng, X. Ke, Y. Gao, X. Xu, R. Guo, and C. Xie, “Starling: An i/o-efficient disk-resident graph index framework for high-dimensional vector similarity search on data segment,” Proc. ACM Manag. Data, vol. 2, no. 1, mar 2024. [Online]. Available: https://doi.org/10.1145/3639269
[11] M. Aumüller, E. Bernhardsson, and A. Faithfull, “Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms,” Information Systems, vol. 87, p. 101374, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0306437918303685
[12] Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824–836, 2020.
[13] L. Chen, Y. Gao, X. Li, C. S. Jensen, and G. Chen, “Efficient metric indexing for similarity search,” in 2015 IEEE 31st International Conference on Data Engineering, 2015, pp. 591–602.
[14] H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang, “idistance: An adaptive b+-tree based indexing method for nearest neighbor search,” ACM Transactions on Database Systems (TODS), vol. 30, no. 2, pp. 364–397, 2005.
[15] A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in high dimensions via hashing,” in Vldb, vol. 99, no. 6, 1999, pp. 518–529.
[16] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in Proceedings of the twentieth annual symposium on Computational geometry, 2004, pp. 253–262.
[17] K. Lu, H. Wang, W. Wang, and M. Kudo, “Vhp: approximate nearest neighbor search via virtual hypersphere partitioning,” Proceedings of the VLDB Endowment, vol. 13, no. 9, pp. 1443–1455, 2020.
[18] Y. Lei, Q. Huang, M. Kankanhalli, and A. K. Tung, “Locality-sensitive hashing scheme based on longest circular co-substring,” in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 2020, pp. 2589–2599.
[19] B. Zheng, Z. Xi, L. Weng, N. Q. V. Hung, H. Liu, and C. S. Jensen, “Pm-lsh: A fast and accurate lsh framework for high-dimensional approximate nn search,” Proceedings of the VLDB Endowment, vol. 13, no. 5, pp. 643–655, 2020.
[20] K. Lu and M. Kudo, “R2lsh: A nearest neighbor search scheme based on two-dimensional projected spaces,” in 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 2020, pp. 1045–1056.
[21] A. Babenko and V. Lempitsky, “Additive quantization for extreme vector compression,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 931–938.
[22] ——, “Tree quantization for large-scale similarity search and classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4240–4248.
[23] W. Li, Y. Zhang, Y. Sun, W. Wang, M. Li, W. Zhang, and X. Lin, “Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 8, pp. 1475–1488, 2019.
[24] N. Lee, J. Lee, and C. Park, “Augmentation-free self-supervised learning on graphs,” in Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 7, 2022, pp. 7372–7380.
[25] R. S. Oyamada, L. C. Shimomura, S. Barbon Jr, and D. S. Kaster, “A meta-learning configuration framework for graph-based similarity search indexes,” Information Systems, vol. 112, p. 102123, 2023.
[26] F. Groh, L. Ruppert, P. Wieschollek, and H. P. Lensch, “Ggnn: Graph-based gpu nearest neighbor search,” IEEE Transactions on Big Data, vol. 9, no. 1, pp. 267–279, 2022.
[27] A. Singh, S. J. Subramanya, R. Krishnaswamy, and H. V. Simhadri, “Freshdiskann: A fast and accurate graph-based ANN index for streaming similarity search,” CoRR, vol. abs/2105.09613, 2021. [Online]. Available: https://arxiv.org/abs/2105.09613
[28] Q. Chen, B. Zhao, H. Wang, M. Li, C. Liu, Z. Li, M. Yang, and J. Wang, “Spann: Highly-efficient billion-scale approximate nearest neighborhood search,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 5199–5212. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2021/file/299dc35e747eb77177d9cea10a802da2-Paper.pdf
[29] Y. Xu, H. Liang, J. Li, S. Xu, Q. Chen, Q. Zhang, C. Li, Z. Yang, F. Yang, Y. Yang, P. Cheng, and M. Yang, “Spfresh: Incremental in-place update for billion-scale vector search,” in Proceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 545–561. [Online]. Available: https://doi.org/10.1145/3600006.3613166
[30] J. Li, X. Yan, J. Zhang, A. Xu, J. Cheng, J. Liu, K. K. Ng, and T.-c. Cheng, “A general and efficient querying method for learning to hash,” in Proceedings of the 2018 International Conference on Management of Data, 2018, pp. 1333–1347.
[31] L. Amsaleg and H. Jegou, “Datasets for approximate nearest neighbor search,” http://corpus-texmex.irisa.fr/, 2010.
[32] C. Fu, C. Wang, and D. Cai, “High dimensional similarity search with satellite system graph: Efficiency, scalability, and unindexed query compatibility,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 8, pp. 4139–4150, 2022.