Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Survey of vector database management systems

Published: 15 July 2024 Publication History

Abstract

There are now over 20 commercial vector database management systems (VDBMSs), all produced within the past five years. But embedding-based retrieval has been studied for over ten years, and similarity search a staggering half century and more. Driving this shift from algorithms to systems are new data intensive applications, notably large language models, that demand vast stores of unstructured data coupled with reliable, secure, fast, and scalable query processing capability. A variety of new data management techniques now exist for addressing these needs, however there is no comprehensive survey to thoroughly review these techniques and systems. We start by identifying five main obstacles to vector data management, namely the ambiguity of semantic similarity, large size of vectors, high cost of similarity comparison, lack of structural properties that can be used for indexing, and difficulty of efficiently answering “hybrid” queries that jointly search both attributes and vectors. Overcoming these obstacles has led to new approaches to query processing, storage and indexing, and query optimization and execution. For query processing, a variety of similarity scores and query types are now well understood; for storage and indexing, techniques include vector compression, namely quantization, and partitioning techniques based on randomization, learned partitioning, and “navigable” partitioning; for query optimization and execution, we describe new operators for hybrid queries, as well as techniques for plan enumeration, plan selection, distributed query processing, data manipulation queries, and hardware accelerated query execution. These techniques lead to a variety of VDBMSs across a spectrum of design and runtime characteristics, including “native” systems that are specialized for vectors and “extended” systems that incorporate vector capabilities into existing systems. We then discuss benchmarks, and finally outline research challenges and point the direction for future work.

References

[25]
Abdelkader, A., Arya, S., da Fonseca, G.D., Mount, D.M.: Approximate nearest neighbor searching with non-Euclidean and weighted distances. In: SODA, pp. 355–372 (2019)
[26]
Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: ICDT (2001)
[27]
Andoni A and Indyk P Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions Commun. ACM 2008 51 1 117-122
[28]
Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I., Schmidt, L.: Practical and optimal LSH for angular distance. In: NeurIPS, pp. 1225–1233 (2015)
[29]
Andoni, A., Indyk, P., Razenshteyn, I.: Approximate nearest neighbor search in high dimensions. In: ICM, pp. 3287–3318 (2018)
[30]
Andoni, A., Razenshteyn, I.: Optimal data-dependent hashing for approximate near neighbors. In: STOC, pp. 793–801 (2015)
[31]
André, F., Kermarrec, A.M., Le Scouarnec, N.: Accelerated nearest neighbor search with Quick ADC. In: ICMR (2017)
[32]
André F, Kermarrec AM, and Le Scouarnec N Quicker ADC: unlocking the hidden potential of product quantization with SIMD IEEE Trans. Pattern Anal. Mach. Intell. 2021 43 5 1666-1677
[33]
Asai, A., Min, S., Zhong, Z., Chen, D.: Retrieval-based language models and applications. In: ACL (2023)
[34]
Aumüller M, Bernhardsson E, and Faithfull A ANN-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms Inform. Syst. 2020 87 101374
[35]
Azizi I, Echihabi K, and Palpanas T ELPIS: graph-based similarity search for scalable data science Proc. VLDB Endow. 2023 16 6 1548-1559
[36]
Bang, F.: GPTCache: an open-source semantic cache for LLM applications enabling faster answers and cost savings. In: NLP-OSS, pp. 212–218 (2023)
[37]
Bentley JL Multidimensional binary search trees used for associative searching Commun. ACM 1975 18 9 509-517
[38]
Berg M, Cheong O, Kreveld M, and Overmars M Computational Geometry: Algorithms and Applications 2008 3 Berlin Springer-Verlag
[39]
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: ICDT (1999)
[40]
Chang, W.C., Yu, F.X., Chang, Y.W., Yang, Y., Kumar, S.: Pre-training tasks for embedding-based large-scale retrieval. In: ICLR (2020)
[41]
Chen H, Ryu J, Vinyard ME, Lerer A, and Pinello L SIMBA: single-cell embedding along with features Nat. Methods 2024 21 1003-1013
[42]
Chen L, Gao Y, Song X, Li Z, Zhu Y, Miao X, and Jensen CS Indexing metric spaces for exact similarity search ACM Comput. Surv. 2022 55 6 1-39
[43]
Chen, Q., Zhao, B., Wang, H., Li, M., Liu, C., Li, Z., Yang, M., Wang, J., Yang, M., Wang, J.: SPANN: highly-efficient billion-scale approximate nearest neighbor search. In: NeurIPS (2021)
[44]
Ciaccia, P., Patella, M., Zezula, P.: M-Tree: an efficient access method for similarity search in metric spaces. In: Proc. VLDB Endow., pp. 426–435 (1997)
[45]
Dasgupta, S., Freund, Y.: Random projection trees and low dimensional manifolds. In: STOC, pp. 537–546 (2008)
[46]
Dasgupta, S., Sinha, K.: Randomized partition trees for exact nearest neighbor search. In: COLT, pp. 317–337 (2013)
[47]
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: SCG, pp. 253–262 (2004)
[48]
Davidson SB, Garcia-Molina H, and Skeen D Consistency in a partitioned network: a survey ACM Comput. Surv. 1985 17 3 341-370
[49]
Davoudian A, Chen L, and Liu M A survey on NoSQL stores ACM Comput. Surv. 2018 51 2 1-43
[50]
Dearholt, D., Gonzales, N., Kurup, G.: Monotonic search networks for computer vision databases. In: ACSSC, pp. 548–553 (1988)
[51]
Dong, W., Charikar, M., Li, K.: Efficient k-nearest neighbor graph construction for generic similarity measures. In: WWW (2011)
[52]
Echihabi K, Zoumpatianos K, and Palpanas T New trends in high-D vector similarity search: AI-driven, progressive, and distributed Proc. VLDB Endow. 2021 14 12 3198-3201
[53]
Echihabi K, Zoumpatianos K, Palpanas T, and Benbrahim H Return of the Lernaean Hydra: experimental evaluation of data series approximate similarity search Proc. VLDB Endow. 2019 13 3 403-420
[54]
Edelsbrunner H and Shah NR Incremental topological flipping works for regular triangulations Algorithmica 1996 15 223-241
[55]
Eppstein D, Paterson MS, and Yao FF On nearest-neighbor graphs Discrete Comput. Geom. 1997 17 263-282
[56]
Fu C, Xiang C, Wang C, and Cai D Fast approximate nearest neighbor search with the navigating spreading-out graph Proc. VLDB Endow. 2019 12 5 461-474
[57]
Gao J and Long C RaBitQ: quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search Proc. ACM Manag. Data 2024 2 3 1-27
[58]
Ge, T., He, K., Ke, Q., Sun, J.: Optimized product quantization for approximate nearest neighbor search. In: CVPR, pp. 2946–2953 (2013)
[59]
Gilbert S and Lynch N Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services SIGACT News 2002 33 2 51-59
[60]
Gollapudi, S., Karia, N., Sivashankar, V., Krishnaswamy, R., Begwani, N., Raz, S., Lin, Y., Zhang, Y., Mahapatro, N., Srinivasan, P., Singh, A., Simhadri, H.V.: Filtered-DiskANN: graph algorithms for approximate nearest neighbor search with filters. In: WWW (2023)
[61]
Guo R, Luan X, Xiang L, Yan X, Yi X, Luo J, Cheng Q, Xu W, Luo J, Liu F, Cao Z, Qiao Y, Wang T, Tang B, and Xie C Manu: a cloud native vector database management system Proc. VLDB Endow. 2022 15 12 3548-3561
[62]
Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., Kumar, S.: Accelerating large-scale inference with anisotropic vector quantization. In: ICML (2020)
[63]
Hambardzumyan, S., Tuli, A., Ghukasyan, L., Rahman, F., Topchyan, H., Isayan, D., McQuade, M., Harutyunyan, M., Hakobyan, T., Stranic, I., Buniatyan, D.: Deep Lake: a lakehouse for deep learning. In: CIDR (2023)
[64]
Harwood, B., Drummond, T.: FANNG: fast approximate nearest neighbour graphs. In: CVPR (2016)
[65]
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)
[66]
Jégou H, Douze M, and Schmid C Product quantization for nearest neighbor search IEEE Trans. Pattern Anal. Mach. Intell. 2011 33 1 117-128
[67]
Johnson J, Douze M, and Jégou H Billion-scale similarity search with GPUs IEEE Trans. Big Data 2021 7 3 535-547
[68]
Jurafsky D and Martin JH Speech and Language Processing 2009 2 Hoboken Prentice-Hall
[69]
Keivani O, Sinha K, and Ram P Improved maximum inner product search with better theoretical guarantee using randomized partition trees Mach. Learn. 2018 107 1069-1094
[70]
Kim, Y.: Applications and future of dense retrieval in industry. In: SIGIR, pp. 3373–3374 (2022)
[71]
Kleinberg JM Navigation in a small world Nature 2000 406 845
[72]
Lakshman A and Malik P Cassandra: a decentralized structured storage system SIGOPS Oper. Syst. Rev. 2010 44 2 35-40
[73]
Lee D and Wong C Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees Acta Inform. 1977 9 23-29
[74]
Leskovec J, Rajaraman A, and Ullman J Mining of Massive Datasets 2014 3 Cambridge Cambridge University Press
[75]
Li F Modernization of databases in the cloud era: building databases that run like Legos Proc. VLDB Endow. 2023 16 12 4140-4151
[76]
Li, H., Ai, Q., Zhan, J., Mao, J., Liu, Y., Liu, Z., Cao, Z.: Constructing tree-based index for efficient and effective dense retrieval. In: SIGIR (2023)
[77]
Li, J., Liu, H., Gui, C., Chen, J., Ni, Z., Wang, N., Chen, Y.: The design and implementation of a real time visual search system on JD e-commerce platform. In: Middleware, pp. 9–16 (2018)
[78]
Li W, Zhang Y, Sun Y, Wang W, Li M, Zhang W, and Lin X Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement IEEE Trans. Knowl. Data Eng. 2020 32 8 1475-1488
[79]
Lindholm E, Nickolls J, Oberman S, and Montrym J NVIDIA Tesla: a unified graphics and computing architecture IEEE Micro 2008 28 2 39-55
[80]
Lipton RJ and Tarjan RE Applications of a planar separator theorem SIAM J. Comput. 1980 9 3 615-627
[81]
Liu, T., Moore, A.W., Gray, A., Yang, K.: An investigation of practical approximate nearest neighbor algorithms. In: NeurIPS, pp. 825–832 (2004)
[82]
Luo C and Carey MJ LSM-Based storage techniques: a survey VLDB J. 2019 29 1 393-418
[83]
Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi- probe LSH: efficient indexing for high-dimensional similarity search. In: Proc. VLDB Endow. pp. 950–961 (2007)
[84]
Malkov Y, Ponomarenko A, Logvinov A, and Krylov V Approximate nearest neighbor algorithm based on navigable small world graphs Inform. Syst. 2014 45 61-68
[85]
Malkov Y and Yashunin DA Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs IEEE Trans. Pattern Anal. Mach. Intell. 2020 42 4 824-836
[86]
Matsui Y, Uchida Y, Jégou H, and Satoh S A survey of product quantization ITE Trans. Media Technol. Appl. 2018 6 1 2-10
[87]
Meiser S Point location in arrangements of hyperplanes Inform. Comput. 1993 106 2 286-303
[88]
Meng J, Wang H, Xu J, and Ogihara M ONe index for all kernels (ONIAK): a zero re-indexing LSH solution to ANNS-ALT (After Linear Transformation) Proc. VLDB Endow. 2022 15 13 3937-3949
[89]
Mirkes EM, Allohibi J, and Gorban A Fractional norms and quasinorms do not help to overcome the curse of dimensionality Entropy 2020 22 10 1105
[90]
Mitra B and Craswell N An introduction to neural information retrieval Found. Trends Inf. Retr. 2018 13 1 1-126
[91]
Moll O, Favela M, Madden S, Gadepally V, and Cafarella M SeeSaw: interactive ad-hoc search over image databases Proc. ACM Manag. Data 2023 1 4 1-26
[92]
Muja, M., Lowe., D.G.: FLANN: fast library for approximate nearest neighbors. In: VISAPP (2009)
[93]
Navarro G Searching in metric spaces by spatial approximation VLDB J. 2002 11 1 28-46
[94]
Norouzi, M., Fleet, D.J.: Cartesian k-means. In: CVPR (2013)
[95]
O’Neil P, Cheng E, Gawlick D, and O’Neil E The log-structured merge-tree (LSM-tree) Acta Inform. 1996 33 351-385
[96]
Paredes, R., Chávez, E.: Using the k-nearest neighbor graph for proximity searching in metric spaces. In: SPIRE, pp. 127–138 (2005)
[97]
Paredes, R., Chávez, E., Figueroa, K., Navarro, G.: Practical construction of k-nearest neighbor graphs in metric spaces. In: WEA (2006)
[98]
Pouyanfar S, Sadiq S, Yan Y, Tian H, Tao Y, Reyes MP, Shyu ML, Chen SC, and Iyengar SS A survey on deep learning: algorithms, techniques, and applications ACM Comput. Surv. 2018 51 5 1-36
[99]
Prout, A., Wang, S.P., Victor, J., Sun, Z., Li, Y., Chen, J., Bergeron, E., Hanson, E., Walzer, R., Gomes, R., Shamgunov, N.: Cloud-native transactions and analytics in SingleStore. In: SIGMOD, pp. 2340–2352 (2022)
[100]
Qin J, Wang W, Xiao C, and Zhang Y Similarity query processing for high-dimensional data Proc. VLDB Endow. 2020 13 12 3437-3440
[101]
Qin, J., Wang, W., Xiao, C., Zhang, Y., Wang, Y.: High-dimensional similarity query processing for data science. In: KDD, pp. 4062–4063 (2021)
[102]
Ram, P., Sinha, K.: Revisiting kd-tree for nearest neighbor search. In: KDD, pp. 1378–1388 (2019)
[103]
Rigaux P, Scholl M, and Voisard A Spatial Databases: With Application to GIS 2001 Burlington Morgan Kaufmann Publishers Inc.
[104]
Rubinstein, A.: Hardness of approximate nearest neighbor search. In: STOC, pp. 1260–1268 (2018)
[105]
Salakhutdinov, R.R., Hinton, G.E.: Learning a nonlinear embedding by preserving class neighbourhood structure. In: AISTATS (2007)
[106]
Sellis, T., Roussopoulos, N., Faloutsos, C.: Multidimensional access methods: trees have grown everywhere. Proc. VLDB Endow., pp. 13–14 (1997)
[107]
Silpa-Anan, C., Hartley, R.: Optimised KD-trees for fast image descriptor matching. In: CVPR (2008)
[108]
Sivic, Z.: Video Google: a text retrieval approach to object matching in videos. In: ICCV, pp. 1470–1477 (2003)
[109]
Su TH and Chang RC On constructing the relative neighborhood graphs in Euclidean k-dimensional spaces Computing 1991 46 121-130
[110]
Su Y, Sun Y, Zhang M, and Wang J Vexless: a serverless vector data management system using cloud functions Proc. ACM Manag. Data 2024 2 3 1-26
[111]
Subramanya, S.J., Devvrit, Kadekodi, R., Krishnaswamy, R., Simhadri, H.: DiskANN: Fast accurate billion-point nearest neighbor search on a single node. In: NeurIPS (2019)
[112]
Tagliabue, J., Greco, C.: (Vector) Space is not the final frontier: product search as program synthesis. In: SIGIR (2023)
[113]
Taipalus T Vector database management systems: fundamental concepts, use-cases, and current challenges Cognitive Syst. Res. 2024 85 101216
[114]
Teflioudi C and Gemulla R Exact and approximate maximum inner product search with LEMP ACM Trans. Database Syst. 2016 42 1 1-49
[115]
Toussaint GT The relative neighbourhood graph of a finite planar set Pattern Recognit. 1980 12 4 261-268
[116]
Vaidya PM An O(nlogn) algorithm for the all-nearest-neighbors problem Discrete Comput. Geom. 1989 4 101-115
[117]
Vempala, S.S.: Randomly-oriented k-d trees adapt to intrinsic dimension. In: LIPIcs (2012)
[118]
Wang F and Sun J Survey on distance metric learning and dimensionality reduction in data mining Data Min. Knowl. Disc. 2015 29 534-564
[119]
Wang, J., Li, S.: Query-driven iterated neighborhood graph search for large scale indexing. In: MM, pp. 179–188 (2012)
[120]
Wang, J., Wang, J., Zeng, G., Tu, Z., Gan, R., Li, S.: Scalable k-NN graph construction for visual descriptors. In: CVPR, pp. 1106–1113 (2012)
[121]
Wang J, Wang N, Jia Y, Li J, Zeng G, Zha H, and Hua XS Trinary-projection trees for approximate nearest neighbor search IEEE Trans. Pattern Anal. Mach. Intell. 2014 36 2 388-403
[122]
Wang, J., Yi, X., Guo, R., Jin, H., Xu, P., Li, S., Wang, X., Guo, X., Li, C., Xu, X., Yu, K., Yuan, Y., Zou, Y., Long, J., Cai, Y., Li, Z., Zhang, Z., Mo, Y., Gu, J., Jiang, R., Wei, Y., Xie, C.: Milvus: A purpose-built vector data management system. In: SIGMOD, pp. 2614–2627 (2021)
[123]
Wang, J., Zhang, Q.: Disaggregated database systems. In: SIGMOD, pp. 37–44 (2023)
[124]
Wang J, Zhang T, Song J, Sebe N, and Shen HT A survey on learning to hash IEEE Trans. Pattern Anal. Mach. Intell. 2018 40 4 769-790
[125]
Wang M, Xu W, Yi X, Wu S, Peng Z, Ke X, Gao Y, Xu X, Guo R, and Xie C Starling: an I/O-efficient disk-resident graph index framework for high-dimensional vector similarity search on data segment Proc. ACM Manag. Data 2024 2 1 1-27
[126]
Wang M, Xu X, Yue Q, and Wang Y A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search Proc. VLDB Endow. 2021 14 11 1964-1978
[127]
Wang R and Deng D DeltaPQ: lossless product quantization code compression for high dimensional similarity search Proc. VLDB Endow. 2020 13 13 3603-3616
[128]
Watts DJ and Strogatz SH Collective dynamics of ‘small-world’ networks Nature 1998 393 440-442
[129]
Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. Proc. VLDB Endow. pp. 194–205 (1998)
[130]
Wei C, Wu B, Wang S, Lou R, Zhan C, Li F, and Cai Y AnalyticDB-V: a hybrid analytical engine towards query fusion for structured and unstructured data Proc. VLDB Endow. 2020 13 12 3152-3165
[131]
Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NeurIPS, pp. 1753–1760 (2008)
[132]
Williams, R.: On the difference between closest, furthest, and orthogonal pairs: Nearly-linear vs barely-subquadratic complexity. In: SODA, pp. 1207–1215 (2018)
[133]
Wu, W., He, J., Qiao, Y., Fu, G., Liu, L., Yu, J.: HQANN: Efficient and robust similarity search for hybrid queries with structured and unstructured constraints. In: CIKM (2022)
[134]
Xue W, Li H, Peng Y, Cui J, and Shi Y Secure k nearest neighbors query for high-dimensional vectors in outsourced environments IEEE Trans. Big Data 2018 4 4 586-599
[135]
Yandex, A.B., Lempitsky, V.: Efficient indexing of billion-scale datasets of deep descriptors. In: CVPR, pp. 2055–2063 (2016)
[136]
Yang, W., Li, T., Fang, G., Wei, H.: PASE: PostgreSQL ultra-high-dimensional approximate nearest neighbor search extension. In: SIGMOD, pp. 2241–2253 (2020)
[137]
Yianilos, P.N.: Data structures and algorithms for nearest neighbor search in general metric spaces. In: SODA, pp. 311–321 (1993)
[138]
Zhan C, Su M, Wei C, Peng X, Lin L, Wang S, Chen Z, Li F, Pan Y, Zheng F, and Chai C AnalyticDB: real-time OLAP database system at Alibaba Cloud Proc. VLDB Endow. 2019 12 12 2059-2070
[139]
Zhang, H., Cao, L., Yan, Y., Madden, S., Rundensteiner, E.A.: Continuously adaptive similarity search. In: SIGMOD, pp. 2601–2616 (2020)
[140]
Zhang W, Ji J, Zhu J, Li J, Xu H, and Zhang B BitHash: an efficient bitwise locality sensitive hashing method with applications Knowl. Based Syst. 2016 97 40-47
[141]
Zhang, X., Wang, Q., Xu, C., Peng, Y., Xu, J.: FedKNN: secure federated k-nearest neighbor search. Proc. ACM Manag. Data 2(1), 1–26 (2024)
[142]
Zhao WL, Wang H, and Ngo CW Approximate k-NN graph construction: a generic online approach IEEE Trans. Multimed. 2022 24 1909-1921
[143]
Zhu Y, Chen L, Gao Y, and Jensen CS Pivot selection algorithms in metric spaces: a survey and experimental study VLDB J. 2022 31 1 23-47
[144]
Zhu Y, Ma R, Zheng B, Ke X, Chen L, and Gao Y GTS: GPU-based tree index for fast similarity search Proc. ACM Manag. Data 2024 2 3 1-27
[145]
Zuo C, Qiao M, Zhou W, Li F, and Deng D SeRF: segment graph for range-filtering approximate nearest neighbor search Proc. ACM Manag. Data 2024 2 1 1-26

Index Terms

  1. Survey of vector database management systems
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image The VLDB Journal — The International Journal on Very Large Data Bases
    The VLDB Journal — The International Journal on Very Large Data Bases  Volume 33, Issue 5
    Sep 2024
    533 pages

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 15 July 2024
    Accepted: 24 June 2024
    Revision received: 07 June 2024
    Received: 12 October 2023

    Author Tags

    1. Vector data management
    2. Similarity search
    3. k nearest neighbor
    4. Approximate nearest neighbor
    5. Nearest neighbor index

    Qualifiers

    • Research-article

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 01 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media