research-article

Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment

Authors:

Zhangyang Peng,

Rentong Guo, and

Charles XieAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 1

Article No.: 14, Pages 1 - 27

https://doi.org/10.1145/3639269

Published: 26 March 2024 Publication History

Abstract

High-dimensional vector similarity search (HVSS) is gaining prominence as a powerful tool for various data science and AI applications. As vector data scales up, in-memory indexes pose a significant challenge due to the substantial increase in main memory requirements. A potential solution involves leveraging disk-based implementation, which stores and searches vector data on high-performance devices like NVMe SSDs. However, implementing HVSS for data segments proves to be intricate in vector databases where a single machine comprises multiple segments for system scalability. In this context, each segment operates with limited memory and disk space, necessitating a delicate balance between accuracy, efficiency, and space cost. Existing disk-based methods fall short as they do not holistically address all these requirements simultaneously. In this paper, we present Starling, an I/O-efficient disk-resident graph index framework that optimizes data layout and search strategy within the segment. It has two primary components: (1) a data layout incorporating an in-memory navigation graph and a reordered disk-based graph with enhanced locality, reducing the search path length and minimizing disk bandwidth wastage; and (2) a block search strategy designed to minimize costly disk I/O operations during vector query execution. Through extensive experiments, we validate the effectiveness, efficiency, and scalability of Starling. On a data segment with 2GB memory and 10GB disk capacity, Starling can accommodate up to 33 million vectors in 128 dimensions, offering HVSS with over 0.9 average precision and top-10 recall rate, and latency under 1 millisecond. The results showcase Starling's superior performance, exhibiting 43.9x higher throughput with 98% lower query latency compared to state-of-the-art methods while maintaining the same level of accuracy.

References

[1]

2018. A Library for Efficient Similarity Search and Clustering of Dense Vectors. https://github.com/facebookresearch/faiss.

[2]

2020. Using AI to detect COVID-19 misinformation and exploitative content. https://ai.meta.com/blog/using-ai-to-detect-covid-19-misinformation-and-exploitative-content/.

[3]

2021. Billion-Scale Approximate Nearest Neighbor Search Challenge: NeurIPS'21 competition track. https://big-ann-benchmarks.com/.

[4]

2021. Milvus Was Built for Massive-Scale (Think Trillion) Vector Similarity Search. https://milvus.io/blog/Milvus-Was-Built-for-Massive-Scale-Think-Trillion-Vector-Similarity-Search.md.

[5]

2021. Scalable graph based indices for approximate nearest neighbor search. https://github.com/microsoft/DiskANN.

[6]

2022. Building a Vector Database for Scalable Similarity Search. https://milvus.io/blog/deep-dive-1-milvus-architecture-overview.md.

[7]

2023. The ChatGPT Retrieval Plugin lets you easily search and find personal or work documents by asking questions in everyday language. https://github.com/openai/chatgpt-retrieval-plugin.

[8]

Zainab Abbas, Vasiliki Kalavri, Paris Carbone, and Vladimir Vlassov. 2018. Streaming graph partitioning: an experimental study. PVLDB 11, 11 (2018), 1590--1603.

Digital Library

[9]

Fabien André, Anne-Marie Kermarrec, and Nicolas Le Scouarnec. 2015. Cache locality is not enough: High-Performance Nearest Neighbor Search with Product Quantization Fast Scan. PVLDB 9, 4 (2015), 288--299.

Digital Library

[10]

Konstantin Andreev and Harald Räcke. 2004. Balanced graph partitioning. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures. 120--124.

Digital Library

[11]

Kazuo Aoyama, Kazumi Saito, Hiroshi Sawada, and Naonori Ueda. 2011. Fast approximate similarity search based on degree-reduced neighborhood graphs. In SIGKDD. 1055--1063.

[12]

Akari Asai, Sewon Min, Zexuan Zhong, and Danqi Chen. 2023. ACL 2023 Tutorial: Retrieval-based LMs and Applications. ACL (2023).

[13]

Kai Uwe Barthel, Nico Hezel, Konstantin Schall, and Klaus Jung. 2019. Real-time visual navigation in huge image sets using similarity graphs. In ACM MM. 2202--2204.

[14]

Patrick H Chen, Chang Wei-cheng, Yu Hsiang-fu, Inderjit S Dhillon, and Hsieh Cho-jui. 2022. FINGER: Fast Inference for Graph-based Approximate Nearest Neighbor Search. arXiv:2206.11408 (2022).

[15]

Qi Chen, Bing Zhao, Haidong Wang, Mingqin Li, Chuanjie Liu, Zengzhong Li, Mao Yang, and Jingdong Wang. 2021. SPANN: Highly-efficient Billion-scale Approximate Nearest Neighborhood Search. In NeurIPS 2021. 5199--5212.

[16]

Rihan Chen, Bin Liu, Han Zhu, Yaoxuan Wang, Qi Li, Buting Ma, Qingbo Hua, Jun Jiang, Yunlong Xu, Hongbo Deng, and Bo Zheng. 2022. Approximate Nearest Neighbor Search under Neural Similarity Metric for Large-Scale Recommendation. In CIKM. 3013--3022.

[17]

Benjamin Coleman, Santiago Segarra, Anshumali Shrivastava, and Alex Smola. 2021. Graph Reordering for Cache-Efficient Near Neighbor Search. arXiv:2104.03221 (2021).

[18]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems. 191--198.

Digital Library

[19]

Sanjoy Dasgupta and Yoav Freund. 2008. Random Projection Trees and Low Dimensional Manifolds. In SOTC. 537--546.

[20]

Shiyuan Deng, Xiao Yan, KW Ng Kelvin, Chenyu Jiang, and James Cheng. 2019. Pyramid: A general framework for distributed similarity search on large-scale datasets. In IEEE International Conference on Big Data. 1066--1071.

[21]

Wei Dong, Moses Charikar, and Kai Li. 2011. Efficient K-nearest Neighbor Graph Construction for Generic Similarity Measures. In WWW. 577--586.

[22]

Ishita Doshi, Dhritiman Das, Ashish Bhutani, Rajeev Kumar, Rushi Bhatt, and Niranjan Balasubramanian. 2022. LANNS: A Web-Scale Approximate Nearest Neighbor Lookup System. PVLDB 15, 4 (2022).

[23]

Cong Fu, Changxu Wang, and Deng Cai. 2021. High Dimensional Similarity Search with Satellite System Graph: Efficiency, Scalability, and Unindexed Query Compatibility. TPAMI (2021).

[24]

Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. 2019. Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph. PVLDB 12, 5 (2019), 461--474.

Digital Library

[25]

Long Gong, Huayi Wang, Mitsunori Ogihara, and Jun Xu. 2020. iDEC: Indexable Distance Estimating Codes for Approximate Nearest Neighbor Search. PVLDB 13, 9 (2020), 1483--1497.

Digital Library

[26]

Mihajlo Grbovic and Haibin Cheng. 2018. Real-time personalization using embeddings for search ranking at airbnb. In SIGKDD. 311--320.

[27]

Rentong Guo, Xiaofan Luan, Long Xiang, Xiao Yan, Xiaomeng Yi, Jigao Luo, Qianya Cheng, Weizhi Xu, Jiarui Luo, Frank Liu, Zhenshan Cao, Yanliang Qiao, Ting Wang, Bo Tang, and Charles Xie. 2022. Manu: A Cloud Native Vector Database Management System. PVLDB 15, 12 (2022), 3548--3561.

Digital Library

[28]

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. 2020. Accelerating Large-Scale Inference with Anisotropic Vector Quantization. In ICML. 3887--3896.

[29]

Nico Hezel, Kai Uwe Barthel, Konstantin Schall, and Klaus Jung. 2023. Fast Approximate Nearest Neighbor Search with a Dynamic Exploration Graph using Continuous Refinement. arXiv:2307.10479 (2023).

[30]

Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano, and Linjun Yang. 2020. Embedding-based retrieval in facebook search. In SIGKDD. 2553--2561.

[31]

Qiang Huang, Jianlin Feng, Yikai Zhang, Qiong Fang, and Wilfred Ng. 2015. Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search. PVLDB 9, 1 (2015), 1--12.

Digital Library

[32]

Shikhar Jaiswal, Ravishankar Krishnaswamy, Ankit Garg, Harsha Vardhan Simhadri, and Sheshansh Agrawal. 2022. OOD-DiskANN: Efficient and Scalable Graph ANNS for Out-of-Distribution Queries. arXiv:2211.12850 (2022).

[33]

Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. 2019. DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. In NeurIPS, Vol. 32.

[34]

Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. TPAMI 33, 1 (2011), 117--128.

Digital Library

[35]

Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In ICML. 1188--1196.

[36]

Hao Li, Xiaojie Liu, Tao Li, and Rundong Gan. 2020. A novel density-based clustering algorithm using nearest neighbor graph. PR 102 (2020), 107206.

[37]

Hongzheng Li, Yingxia Shao, Junping Du, Bin Cui, and Lei Chen. 2022. An I/O-Efficient Disk-based Graph System for Scalable Second-Order Random Walk of Large Graphs. PVLDB 15, 8 (2022), 1619--1631.

Digital Library

[38]

Jie Li, Haifeng Liu, Chuanghua Gui, Jianyu Chen, Zhenyuan Ni, Ning Wang, and Yuan Chen. 2018. The Design and Implementation of a Real Time Visual Search System on JD E-commerce Platform. In Proceedings of the 19th International Middleware Conference. 9--16.

Digital Library

[39]

Mingjie Li, Yuan-Gen Wang, Peng Zhang, Hanpin Wang, Lisheng Fan, Enxia Li, and Wei Wang. 2022. Deep Learning for Approximate Nearest Neighbour Search: A Survey and Future Directions. IEEE Transactions on Knowledge and Data Engineering (2022).

[40]

Mingjie Li, Ying Zhang, Yifang Sun, Wei Wang, Ivor W. Tsang, and Xuemin Lin. 2020. I/O Efficient Approximate Nearest Neighbour Search based on Learned Functions. In ICDE. 289--300.

[41]

Nan Li, Bo Kang, and Tijl De Bie. 2023. SkillGPT: a RESTful API service for skill extraction and standardization using a Large Language Model. arXiv:2304.11060 (2023).

[42]

Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Mingjie Li, Wenjie Zhang, and Xuemin Lin. 2020. Approximate Nearest Neighbor Search on High Dimensional Data - Experiments, Analyses, and Improvement. TKDE 32, 8 (2020), 1475--1488.

[43]

Kejing Lu, Mineichi Kudo, Chuan Xiao, and Yoshiharu Ishikawa. 2022. HVS: hierarchical graph structure based on voronoi diagrams for solving approximate nearest neighbor search. PVLDB 15, 2 (2022), 246--258.

Digital Library

[44]

Kejing Lu, Hongya Wang, Wei Wang, and Mineichi Kudo. 2020. VHP: Approximate Nearest Neighbor Search via Virtual Hypersphere Partitioning. PVLDB 13, 9 (2020), 1443--1455.

Digital Library

[45]

Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for large-scale graph processing. In SIGMOD. 135--146.

[46]

Yury A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. TPAMI 42, 4 (2020), 824--836.

Digital Library

[47]

Marius Muja and David G. Lowe. 2014. Scalable Nearest Neighbor Algorithms for High Dimensional Data. TPAMI 36, 11 (2014), 2227--2240.

[48]

Ramzi Nasr, Daniel S Hirschberg, and Pierre Baldi. 2010. Hashing algorithms and data structures for rapid searches of fingerprint vectors. Journal of chemical information and modeling 50, 8 (2010), 1358--1368.

[49]

Donald Nguyen, Andrew Lenharth, and Keshav Pingali. 2013. A lightweight infrastructure for graph analytics. In SOSP. 456--471.

[50]

Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. 2017. Embedding-based news recommendation for millions of users. In SIGKDD. 1933--1942.

[51]

Anil Pacaci and M Tamer Özsu. 2019. Experimental analysis of streaming algorithms for graph partitioning. In SIGMOD. 1375--1392.

[52]

Youngki Park, Sungchan Park, Woosung Jung, and Sang-goo Lee. 2015. Reversed CF: A fast collaborative filtering algorithm using a k-nearest neighbor graph. Expert Systems with Applications 42, 8 (2015), 4022--4028.

Digital Library

[53]

Maria Predari and Aurélien Esnard. 2016. A k-way greedy graph partitioning with initial fixed vertices for parallel applications. In 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP). 280--287.

[54]

Jianbin Qin, Yaoshu Wang, Chuan Xiao, Wei Wang, Xuemin Lin, and Yoshiharu Ishikawa. 2018. GPH: Similarity search in hamming space. In ICDE. 29--40.

[55]

Michael r. garey and david s. johnson. 1980. Computers and Intractability: A Guide to the Theory of NP-Completeness.

[56]

Jie Ren, Minjia Zhang, and Dong Li. 2020. HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory. In NeurIPS.

[57]

Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. 2017. Learning cross-modal embeddings for cooking recipes and food images. In CVPR. 3020--3028.

[58]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR. 815--823.

[59]

Julian Shun and Guy E. Blelloch. 2013. Ligra: a lightweight graph processing framework for shared memory. In PPoPP. 135--146.

Digital Library

[60]

Harsha Vardhan Simhadri, George Williams, Martin Aumüller, Matthijs Douze, Artem Babenko, Dmitry Baranchuk, Qi Chen, Lucas Hosseini, Ravishankar Krishnaswamy, Gopal Srinivasa, Suhas Jayaram Subramanya, and Jingdong Wang. 2021. Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search. In NeurIPS, Vol. 176. 177--189.

[61]

Yang Song, Yu Gu, Rui Zhang, and Ge Yu. 2021. ProMIPS: Efficient high-dimensional C-approximate maximum inner product search with a lightweight index. In ICDE. 1619--1630.

[62]

Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, Kun Yu, Yuxing Yuan, Yinghao Zou, Jiquan Long, Yudong Cai, Zhenxiang Li, Zhifeng Zhang, Yihua Mo, Jun Gu, Ruiyi Jiang, Yi Wei, and Charles Xie. 2021. Milvus: A Purpose-Built Vector Data Management System. In SIGMOD. 2614--2627.

Digital Library

[63]

Mengzhao Wang, Lingwei Lv, Xiaoliang Xu, Yuxiang Wang, Qiang Yue, and Jiongkang Ni. 2023. An Efficient and Robust Framework for Approximate Nearest Neighbor Search with Attribute Constraint. In NeurIPS.

[64]

Mengzhao Wang, Weizhi Xu, Xiaomeng Yi, Songlin Wu, Zhangyang Peng, Xiangyu Ke, Yunjun Gao, Xiaoliang Xu, Rentong Guo, and Charles Xie. 2024. Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment. arXiv:2401.02116 (2024).

[65]

Mengzhao Wang, Xiaoliang Xu, Qiang Yue, and Yuxiang Wang. 2021. A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search. PVLDB 14, 11 (2021), 1964--1978.

Digital Library

[66]

Yifan Wang, Haodi Ma, and Daisy Zhe Wang. 2022. LIDER: An Efficient High-dimensional Learned Index for Large-scale Dense Passage Retrieval. PVLDB 16, 2 (2022), 154--166.

Digital Library

[67]

Chuangxian Wei, Bin Wu, Sheng Wang, Renjie Lou, Chaoqun Zhan, Feifei Li, and Yuanzhe Cai. 2020. AnalyticDB-V: A Hybrid Analytical Engine Towards Query Fusion for Structured and Unstructured Data. PVLDB 13, 12 (2020), 3152--3165.

Digital Library

[68]

Hao Wei, Jeffrey Xu Yu, Can Lu, and Xuemin Lin. 2016. Speedup Graph Processing by Graph Ordering. In SIGMOD. 1813--1828.

[69]

Wen Yang, Tao Li, Gai Fang, and Hong Wei. 2020. PASE: PostgreSQL Ultra-High-Dimensional Approximate Nearest Neighbor Search Extension. In SIGMOD. 2241--2253.

[70]

Minjia Zhang and Yuxiong He. 2019. Grip: Multi-store capacity-optimized high-performance nearest neighbor search for vector search engine. In CIKM. 1673--1682.

Digital Library

[71]

Pengcheng Zhang, Bin Yao, Chao Gao, Bin Wu, Xiao He, Feifei Li, Yuanfei Lu, Chaoqun Zhan, and Feilong Tang. 2022. Learning-based query optimization for multi-probe approximate nearest neighbor search. VLDBJ (2022), 1--23.

[72]

Qianxi Zhang, Shuotao Xu, Qi Chen, Guoxin Sui, Jiadong Xie, Zhizhen Cai, Yaoqi Chen, Yinxuan He, Yuqing Yang, Fan Yang, et al. 2023. VBASE: Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity. In OSDI.

[73]

Da Zheng, Disa Mhembere, Randal C. Burns, Joshua T. Vogelstein, Carey E. Priebe, and Alexander S. Szalay. 2015. FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs. In FAST. 45--58.

[74]

Chun Jiang Zhu, Minghu Song, Qinqing Liu, Chloé Becquey, and Jinbo Bi. 2020. Benchmark on indexing algorithms for accelerating molecular similarity search. Journal of Chemical Information and Modeling 60, 12 (2020), 6167--6184.

Index Terms

Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Record and block layout
    2. Database management system engines
      1. Database query processing
        Query optimization
  2. Information retrieval
    1. Evaluation of retrieval results
      1. Retrieval efficiency

Recommendations

SeRF: Segment Graph for Range-Filtering Approximate Nearest Neighbor Search
SIGMOD

Effective vector representation models, e.g., word2vec and node2vec, embed real-world objects such as images and documents in high dimensional vector space. In the meanwhile, the objects are often associated with attributes such as timestamps and prices. ...
Read More
Efficient locality-sensitive hashing over high-dimensional streaming data
Abstract
Approximate nearest neighbor (ANN) search in high-dimensional spaces is fundamental in many applications. Locality-sensitive hashing (LSH) is a well-known methodology to solve the ANN problem. Existing LSH-based ANN solutions typically employ a ...
Read More
Efficient indexing of binary LSH for high dimensional nearest neighbor

Approximate Nearest Neighbor search (ANN) is one of the most frequently used and yet expensive operations in the high-dimensional database, especially the multimedia database involving massive high-dimensional feature vectors. Recently, Locality-...
Read More

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 1

SIGMOD

February 2024

1874 pages

EISSN:2836-6573

DOI:10.1145/3654807

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2024

Published in PACMMOD Volume 2, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
267
Total Downloads

Downloads (Last 12 months)267
Downloads (Last 6 weeks)154

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents