Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

GTS: GPU-based Tree Index for Fast Similarity Search

Published: 30 May 2024 Publication History

Abstract

Similarity search, the task of identifying objects most similar to a given query object under a specific metric, has gathered significant attention due to its practical applications. However, the absence of coordinate information to accelerate similarity search and the high computational cost of measuring object similarity hinder the efficiency of existing CPU-based methods. Additionally, these methods struggle to meet the demand for high throughput data management. To address these challenges, we propose GTS, a GPU-based tree index designed for the parallel processing of similarity search in general metric spaces, where only the distance metric for measuring object similarity is known. The GTS index utilizes a pivot-based tree structure to efficiently prune objects and employs list tables to facilitate GPU computing. To efficiently manage concurrent similarity queries with limited GPU memory, we have developed a two-stage search method that combines batch processing and sequential strategies to optimize memory usage. The paper also introduces an effective update strategy for the proposed GPU-based index, encompassing streaming data updates and batch data updates. Additionally, we present a cost model to evaluate search performance. Extensive experiments on five real-life datasets demonstrate that GTS achieves efficiency gains of up to two orders of magnitude over existing CPU baselines and up to 20x efficiency improvements compared to state-of-the-art GPU-based methods.

References

[1]
2013. Word to vectors. https://code.google.com/archive/p/word2vec
[2]
2023. Moby project. https://mobyproject.org
[3]
2023. National Library of Medicine. http://www.ncbi.nlm.nih.gov/genome
[4]
2023. Nvidia GPU introduction. https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4090
[5]
2023. The source code of GTS. https://github.com/ZJU-DAILY/GTS/
[6]
Ricardo J. Barrientos, Javier A. Riquelme, Ruber Hernández-García, Cristóbal A. Navarro, and Wladimir E. Soto-Silva. 2022. Fast kNN query processing over a multi-node GPU environment. J. Supercomput. 78, 2 (2022), 3045--3071.
[7]
Aritz Bilbao-Jayo and Aitor Almeida. 2018. Automatic political discourse analysis with multi-scale convolutional neural networks and contextual data. International Journal of Distributed Sensor Networks 14, 11 (2018), 1550147718811827.
[8]
Paolo Bolettieri, Andrea Esuli, Fabrizio Falchi, Claudio Lucchese, Raffaele Perego, and Fausto Rabitti. 2009. Enabling content-based image retrieval in very large digital libraries. In Proceedings of the SecondWorkshop on Very Large Digital Libraries.
[9]
Tolga Bozkaya and Z. Meral Özsoyoglu. 1997. Distance-Based Indexing for High-Dimensional Metric Spaces. In SIGMOD. 357--368.
[10]
Tolga Bozkaya and Z. Meral Özsoyoglu. 1999. Indexing Large Metric Spaces for Similarity Search Queries. ACM Trans. Database Syst. 24, 3 (1999), 361--404.
[11]
Sergey Brin. 1995. Near Neighbor Search in Large Metric Spaces. In VLDB, Umeshwar Dayal, Peter M. D. Gray, and Shojiro Nishio (Eds.). 574--584.
[12]
Carrie J. Cai, Emily Reif, Narayan Hegde, Jason D. Hipp, Been Kim, Daniel Smilkov, Martin Wattenberg, Fernanda B. Viégas, Gregory S. Corrado, Martin C. Stumpe, and Michael Terry. 2019. Human-Centered Tools for Coping with Imperfect Algorithms During Medical Decision-Making. In CHI. 4.
[13]
Edgar Chávez and Gonzalo Navarro. 2000. An Effective Clustering Algorithm to Index High Dimensional Metric Spaces. In Seventh International Symposium on String Processing and Information Retrieval. 75--86.
[14]
Edgar Chávez and Gonzalo Navarro. 2005. A compact space decomposition for effective metric indexing. Pattern Recognit. Lett. 26, 9 (2005), 1363--1376.
[15]
Edgar Chávez, Gonzalo Navarro, Ricardo Baeza-Yates, and José Luis Maproquín. 2001. Proximity searching in metric spaces. Comput. Surveys 33, 3 (2001), 273--321.
[16]
Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, and Gang Chen. 2017. Efficient Metric Indexing for Similarity Search and Similarity Joins. IEEE Trans. Knowl. Data Eng. 29, 3 (2017), 556--571.
[17]
Lu Chen, Yunjun Gao, Xuan Song, Zheng Li, Yifan Zhu, Xiaoye Miao, and Christian S. Jensen. 2023. Indexing Metric Spaces for Exact Similarity Search. Comput. Surveys 55, 6 (2023), 128:1--128:39.
[18]
Lu Chen, Yunjun Gao, Baihua Zheng, Christian S. Jensen, Hanyu Yang, and Keyu Yang. 2017. Pivot-based Metric Indexing. Proc. VLDB Endow. 10, 10 (2017), 1058--1069.
[19]
Paolo Ciaccia, Marco Patella, and Pavel Zezula. 1997. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In VLDB. 426--435.
[20]
Brian Donnelly and Michael Gowanlock. 2020. A coordinate-oblivious index for high-dimensional distance similarity searches on the GPU. In ICS. 8:1--8:12.
[21]
Ronald Fagin, Amnon Lotem, and Moni Naor. 2003. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66, 4 (2003), 614--656.
[22]
Maximilian Franzke, Tobias Emrich, Andreas Züfle, and Matthias Renz. 2016. Indexing multi-metric data. In ICDE. 1122--1133.
[23]
Anil Gaihre, Da Zheng, Scott Weitze, Lingda Li, Shuaiwen Leon Song, Caiwen Ding, Xiaoye S. Li, and Hang Liu. 2021. Dr. Top-k: delegate-centric Top-k on GPUs. In SC. 39.
[24]
Bishwamittra Ghosh, Mohammed Eunus Ali, Farhana Murtaza Choudhury, Sajid Hasan Apon, Timos Sellis, and Jianxin Li. 2018. The Flexible Socio Spatial Group Queries. Proc. VLDB Endow. 12, 2 (2018), 99--111.
[25]
Gaurav Gupta, Minghao Yan, Benjamin Coleman, Bryce Kille, Ryan A. Leo Elworth, Tharun Medini, Todd J. Treangen, and Anshumali Shrivastava. 2021. Fast Processing and Querying of 170TB of Genomics Data via a Repeated And Merged BloOm Filter (RAMBO). In SIGMOD. 2226--2234.
[26]
Antonin Guttman. 1984. R-Trees: A Dynamic Index Structure for Spatial Searching. In SIGMOD'84, Proceedings of Annual Meeting, Boston, Massachusetts, USA, June 18--21, 1984, Beatrice Yormark (Ed.). ACM Press, 47--57.
[27]
Dorit S. Hochbaum and David B. Shmoys. 1985. A Best Possible Heuristic for the k-Center Problem. Math. Oper. Res. 10, 2 (1985), 180--184.
[28]
Linjia Hu, Saeid Nooshabadi, and Majid Ahmadi. 2016. Parallel randomized KD-tree forest on GPU cluster for image descriptor matching. In ISCAS. 582--585.
[29]
Peng Jiang, Sanju Sinha, Kenneth Aldape, Sridhar Hannenhalli, Cenk Sahinalp, and Eytan Ruppin. 2022. Big data in basic and translational cancer research. Nature Reviews Cancer 22, 11 (2022), 625--639.
[30]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-Scale Similarity Search with GPUs. IEEE Trans. Big Data 7, 3 (2021), 535--547.
[31]
Caetano Traina Jr., Roberto F. Santos Filho, Agma J. M. Traina, Marcos R. Vieira, and Christos Faloutsos. 2007. The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient. VLDB J. 16, 4 (2007), 483--505.
[32]
Iraj Kalantari and Gerard McDonald. 1983. A Data Structure and an Algorithm for the Nearest Point Problem. IEEE Trans. Software Eng. 9, 5 (1983), 631--634.
[33]
Jinwoong Kim, Sul-Gi Kim, and Beomseok Nam. 2013. Parallel multi-dimensional range query processing with R-trees on GPU. J. Parallel Distributed Comput. 73, 8 (2013), 1195--1207.
[34]
Juhwan Kim, Jongseon Seo, Jonghyeok Park, Sang-Won Lee, Hongchan Roh, and Hyungmin Cho. 2022. ES4D: Accelerating Exact Similarity Search for High-Dimensional Vectors via Vector Slicing and In-SSD Computation. In ICCD. 298--306.
[35]
Mincheol Kim, Ling Liu, and Wonik Choi. 2018. A GPU-Aware Parallel Index for Processing High-Dimensional Big Data. IEEE Trans. Computers 67, 10 (2018), 1388--1402.
[36]
Mincheol Kim, Ling Liu, and Wonik Choi. 2022. Multi-GPU Efficient Indexing For Maximizing Parallelism of High Dimensional Range Query Services. IEEE Trans. Serv. Comput. 15, 5 (2022), 2910--2924.
[37]
Doris Jung Lin Lee, Dixin Tang, Kunal Agarwal, Thyne Boonmark, Caitlyn Chen, Jake Kang, Ujjaini Mukhopadhyay, Jerry Song, Micah Yong, Marti A. Hearst, and Aditya G. Parameswaran. 2021. Lux: Always-on Visualization Recommendations for Exploratory Dataframe Workflows. Proc. VLDB Endow. 15, 3 (2021), 727--738.
[38]
Zhila Nouri Lewis and Yi-Cheng Tu. 2022. G-PICS: A Framework for GPU-Based Spatial Indexing and Query Processing. IEEE Trans. Knowl. Data Eng. 34, 3 (2022), 1243--1257.
[39]
Chuanwen Li, Yu Gu, Jianzhong Qi, Jiayuan He, Qingxu Deng, and Ge Yu. 2018. A GPU Accelerated Update Efficient Index for kNN Queries in Road Networks. In ICDE. 881--892.
[40]
Chen Luo and Michael J. Carey. 2020. LSM-based storage techniques: a survey. VLDB J. 29, 1 (2020), 393--418.
[41]
Lijuan Luo, Martin D. F. Wong, and Lance Leong. 2012. Parallel implementation of R-trees on the GPU. In ASP-DAC. 353--358.
[42]
Yuyu Luo, Yihui Zhou, Nan Tang, Guoliang Li, Chengliang Chai, and Leixian Shen. 2023. Learned Data-aware Image Representations of Line Charts for Similarity Search. Proc. ACM Manag. Data 1, 1 (2023), 88:1--88:29.
[43]
Rui Mao, Willard L. Miranker, and Daniel P. Miranker. 2012. Pivot selection: Dimension reduction for distance-based indexing. J. Discrete Algorithms 13 (2012), 32--46.
[44]
Mauricio Marín, Roberto Uribe, and Ricardo J. Barrientos. 2007. Searching and Updating Metric Space Databases Using the Parallel EGNAT. In ICCS. 229--236.
[45]
Luisa Micó, José Oncina, and Enrique Vidal. 1994. A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recognit. Lett. 15, 1 (1994), 9--17.
[46]
Juraj Mosko, Jakub Lokoc, and Tomás Skopal. 2011. Clustered pivot tables for I/O-optimized similarity search. In SISAP. 17--24.
[47]
Naohito Nakasato. 2012. Implementation of a parallel tree method on a GPU. J. Comput. Sci. 3, 3 (2012), 132--141.
[48]
Gonzalo Navarro and Roberto Uribe Paredes. 2011. Fully dynamic metric access methods based on hyperplane partitioning. Inf. Syst. 36, 4 (2011), 734--747.
[49]
Guillermo Ruiz, Francisco Santoyo, Edgar Chávez, Karina Figueroa, and Eric Sadit Tellez. 2013. Extreme Pivots for Faster Metric Indexes. In SISAP. 115--126.
[50]
Amirhesam Shahvarani and Hans-Arno Jacobsen. 2016. A Hybrid B-tree as Solution for In-Memory Indexing on CPU-GPU Heterogeneous Computing Platforms. In SIGMOD. 1523--1538.
[51]
Larissa Capobianco Shimomura, Rafael Seidi Oyamada, Marcos R. Vieira, and Daniel S. Kaster. 2021. A survey on graph-based methods for similarity searches in metric spaces. Inf. Syst. 95 (2021), 101507.
[52]
Dejun Teng, Akshay Nehe, Prajeeth Emanuel, Furqan Baig, Jun Kong, and Fusheng Wang. 2021. GPU-based Real-time Contact Tracing at Scale. In SIGSPATIAL. 1--10.
[53]
Kyle J Tomek, Kevin Volkel, Elaine W Indermaur, James M Tuck, and Albert J Keung. 2021. Promiscuous molecules for smarter file operations in DNA-based data storage. Nature communications 12, 1 (2021), 3518.
[54]
Mengzhao Wang, Xiaoliang Xu, Qiang Yue, and Yuxiang Wang. 2021. A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search. Proc. VLDB Endow. 14, 11 (2021), 1964--1978.
[55]
Yue Wang, Zhe Wang, Ziyuan Zhao, Zijian Li, Xun Jian, Hao Xin, Lei Chen, Jianchun Song, Zhenhong Chen, and Meng Zhao. 2022. Effective Similarity Search on Heterogeneous Networks: A Meta-Path Free Approach. IEEE Trans. Knowl. Data Eng. 34, 7 (2022), 3225--3240.
[56]
Martin Winter, Mathias Parger, Daniel Mlakar, and Markus Steinberger. 2021. Are dynamic memory managers on GPUs slow?: a survey and benchmarks. In PPoPP. 219--233.
[57]
Simin You, Jianting Zhang, and Le Gruenwald. 2013. Parallel spatial query processing on GPUs using R-trees. In SIGSPATIAL. 23--31.
[58]
Yuanhang Yu, Dong Wen, Ying Zhang, Lu Qin, Wenjie Zhang, and Xuemin Lin. 2022. GPU-accelerated Proximity Graph Approximate Nearest Neighbor Search and Construction. In ICDE. 552--564.
[59]
Yuxiang Zeng, Yongxin Tong, and Lei Chen. 2023. LiteHST: A Tree Embedding based Method for Similarity Search. Proc. ACM Manag. Data 1, 1 (2023), 35:1--35:26.
[60]
Weijie Zhao, Shulong Tan, and Ping Li. 2020. SONG: Approximate Nearest Neighbor Search on GPU. In ICDE. 1033--1044.
[61]
Jingbo Zhou, Qi Guo, H. V. Jagadish, Lubos Krcál, Siyuan Liu, Wenhao Luan, Anthony K. H. Tung, Yueji Yang, and Yuxin Zheng. 2018. A Generic Inverted Index Framework for Similarity Search on the GPU. In ICDE. 893--904.
[62]
Yifan Zhu, Lu Chen, Yunjun Gao, and Christian S. Jensen. 2022. Pivot Selection Algorithms in Metric Spaces: A Survey and Experimental Study. VLDB J. 31, 1 (2022), 23--47.
[63]
Yifan Zhu, Lu Chen, Yunjun Gao, Baihua Zheng, and PengfeiWang. 2022. DESIRE: An Efficient Dynamic Cluster-based Forest Indexing for Similarity Search in Multi-Metric Spaces. PVLDB 15, 10 (2022), 2121--2133.

Index Terms

  1. GTS: GPU-based Tree Index for Fast Similarity Search

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Management of Data
    Proceedings of the ACM on Management of Data  Volume 2, Issue 3
    SIGMOD
    June 2024
    1953 pages
    EISSN:2836-6573
    DOI:10.1145/3670010
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 May 2024
    Published in PACMMOD Volume 2, Issue 3

    Permissions

    Request permissions for this article.

    Author Tags

    1. GPU-based index
    2. concurrent similarity search
    3. metric space

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 238
      Total Downloads
    • Downloads (Last 12 months)238
    • Downloads (Last 6 weeks)54
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media