Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Hyper: A High-Performance and Memory-Efficient Learned Index via Hybrid Construction

Published: 30 May 2024 Publication History

Abstract

Learned indexes use machine learning techniques to improve index construction. However, they often face a fundamental trade-off between performance and memory consumption, especially in dynamic environments with frequent insert and delete operations. This trade-off stems from the construction approaches used in learned indexes: The top-down approach increases performance at the cost of significant memory overhead, while the bottom-up approach focuses on memory efficiency but introduces performance issues due to prediction errors. % A unified solution that simultaneously optimizes performance and memory consumption in dynamic data management scenarios is therefore highly desirable.
We propose Hyper, a highly efficient learned index with a novel two-phase hybrid construction approach. Our approach combines bottom-up construction for leaf nodes with top-down construction for inner nodes to achieve an optimal balance between performance and memory consumption. Hyper effectively handles concurrent writes and structure adjustments without sacrificing query performance. We evaluated Hyper on both simple and complex real-world datasets and compared it to seven state-of-the-art learned indexes and several traditional data structures for dynamic workloads. The evaluation results show that Hyper achieves a remarkable performance boost of up to 3.75× with significantly reduced index memory consumption of up to 1610× in the single-thread evaluation. In high concurrency scenarios, Hyper even achieves improvements up to 5.73×, 3.72×, and 3.99× in read-only, read-write, and write-only workloads.

References

[1]
Daniel J. Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating compression and execution in column-oriented database systems. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Chicago, Illinois, USA, June 27--29. 671--682. https://doi.org/10.1145/1142473.1142548
[2]
Suhas S.P. Rao and Miriam H. Huntley and Neva C. Durand, Elena K. Stamenova, and Ivan D. Bochkov and James T. Robinson and Adrian L. Sanborn and Ido Machol and Arina D. Omer and Eric S. Lander and Erez Lieberman Aiden. 2014. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 7 (2014), 1665--1680.
[3]
Rudolf Bayer and Edward M. McCreight. 1972. Organization and Maintenance of Large Ordered Indices. Acta Informatica 1 (1972), 173--189. https://doi.org/10.1007/BF00288683
[4]
Timo Bingmann. 2008. Stx b tree c template classes. URL http://panthema. net/2007/stx-btree (2008).
[5]
Robert Binna, Eva Zangerle, Martin Pichl, Günther Specht, and Viktor Leis. 2018. HOT: A Height Optimized Trie Index for Main-Memory Database Systems. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD), Houston, TX, USA, June 10--15. 521--534. https://doi.org/10.1145/3183713.3196896
[6]
Rene De La Briandais. 1959. File searching using variable length keys. In Papers presented at the the 1959 western joint computer conference, IRE-AIEE-ACM 1959 (Western), San Francisco, California, USA, March 3--5, 1959. 295--298. https://doi.org/10.1145/1457838.1457895
[7]
Google Cloud. 2017. OpenStreetMap. https://console.cloud.google.com/marketplace/details/openstreetmap/geoopenstreetmap? project=practice-bigtable.
[8]
Angjela Davitkova, Evica Milchevski, and Sebastian Michel. 2020. The ML-Index: A Multidimensional, Learned Index for Point, Range, and Nearest-Neighbor Queries. In Proceedings of the 23rd International Conference on Extending Database Technology (EDBT), Copenhagen, Denmark, March 30 - April 02. 407--410. https://doi.org/10.5441/002/edbt.2020.44
[9]
Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David B. Lomet, and Tim Kraska. 2020. ALEX: An Updatable Adaptive Learned Index. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD), online conference [Portland, OR, USA], June 14--19, 2020. 969--984. https://doi.org/10.1145/3318464.3389711
[10]
Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020. Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads. Proc. VLDB Endow. 14, 2 (2020), 74--86. https://doi.org/10.14778/ 3425879.3425880
[11]
Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale Applications. ACM Trans. Storage 17, 4 (2021), 26:1--26:32. https://doi.org/10. 1145/3483840
[12]
Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proceedings of the VLDB Endowment 13, 8 (2020), 1162--1175. https://doi.org/10.14778/ 3389133.3389135
[13]
Edward Fredkin. 1960. Trie memory. Commun. ACM 3, 9 (1960), 490--499. https://doi.org/10.1145/367390.367400
[14]
Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. FITing-Tree: A Dataaware Index Structure. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD), Amsterdam, The Netherlands, June 30 - July 5. 1189--1206. https://doi.org/10.1145/3299869.3319860
[15]
Jiake Ge, Huanchen Zhang, Boyu Shi, Yuanhui Luo, Yunda Guo, Yunpeng Chai, Yuxing Chen, and Anqun Pan. 2023. SALI: A Scalable Adaptive Learned Index Framework based on Probability Models. Proc. ACM Manag. Data 1, 4 (2023), 258:1--258:25. https://doi.org/10.1145/3626752
[16]
Goetz Graefe. 2011. Modern B-Tree Techniques. Found. Trends Databases 3, 4 (2011), 203--402. https://doi.org/10.1561/1900000028
[17]
Shunsuke Higuchi, Junji Takemasa, Yuki Koizumi, Atsushi Tagami, and Toru Hasegawa. 2021. Feasibility of Longest Prefix Matching using Learned Index Structures. SIGMETRICS Perform. Evaluation Rev. 48, 4 (2021), 45--48. https: //doi.org/10.1145/3466826.3466842
[18]
William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015. BetrFS:Write-Optimization in a Kernel File System. ACM Trans. Storage 11, 4 (2015), 18:1--18:29. https://doi.org/10.1145/2798729
[19]
Peiquan Jin, Chengcheng Yang, Christian S. Jensen, Puyuan Yang, and Lihua Yue. 2016. Read/write-optimized tree indexing for solid-state drives. VLDB J. 25, 5 (2016), 695--717. https://doi.org/10.1007/s00778-015-0406--1
[20]
Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. 2010. FAST: fast architecture sensitive tree search on modern CPUs and GPUs. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), Indianapolis, Indiana, USA, June 6--10. 339--350. https://doi.org/10.1145/1807167.1807206
[21]
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. CoRR abs/1911.13014 (2019). http://arxiv.org/abs/1911.13014
[22]
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2020. RadixSpline: a single-pass learned index. In Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM@SIGMOD), Portland, Oregon, USA, June 19, 2020. 5:1--5:5. https://doi.org/10.1145/3401071.3401659
[23]
Donald E Knuth. 2014. Art of computer programming, volume 2: Seminumerical algorithms. Addison-Wesley Professional.
[24]
Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD), Houston, TX, USA, June 10--15. 489--504. https://doi.org/10.1145/3183713.3196909
[25]
Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radix tree: ARTful indexing for main-memory databases. In 29th IEEE International Conference on Data Engineering (ICDE), Brisbane, Australia, April 8--12. 38--49. https://doi.org/10.1109/ICDE.2013.6544812
[26]
Viktor Leis, Florian Scheibner, Alfons Kemper, and Thomas Neumann. 2016. The ART of practical synchronization. In Proceedings of the 12th International Workshop on Data Management on New Hardware (DaMoN), San Francisco, CA, USA, June 27. 3:1--3:8. https://doi.org/10.1145/2933349.2933352
[27]
Pengfei Li, Yu Hua, Jingnan Jia, and Pengfei Zuo. 2021. FINEdex: A Fine-grained Learned Index Scheme for Scalable and Concurrent Memory Systems. Proc. VLDB Endow. 15, 2 (2021), 321--334. https://doi.org/10.14778/3489496.3489512
[28]
Pengfei Li, Hua Lu, Rong Zhu, Bolin Ding, Long Yang, and Gang Pan. 2023. DILI: A Distribution-Driven Learned Index. Proc. VLDB Endow. 16, 9 (2023), 2212--2224.
[29]
Yinan Li, Bingsheng He, Jun Yang, Qiong Luo, and Ke Yi. 2010. Tree Indexing on Solid State Drives. Proc. VLDB Endow. 3, 1 (2010), 1195--1206. https://doi.org/10.14778/1920841.1920990
[30]
Christian E Lopez and Caleb Gallemore. 2021. An augmented multilingual Twitter dataset for studying the COVID-19 infodemic. Social Network Analysis and Mining 11, 1 (2021), 102.
[31]
Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache craftiness for fast multicore key-value storage. In Proceedings of the Seventh European Conference on Computer Systems (EuroSys), Bern, Switzerland, April 10--13. 183--196. https://doi.org/10.1145/2168836.2168855
[32]
Ryan Marcus, Emily Zhang, and Tim Kraska. 2020. CDFShop: Exploring and Optimizing Learned Index Structures. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD), online conference [Portland, OR, USA], June 14--19. 2789--2792. https://doi.org/10.1145/3318464.3384706
[33]
Markus Mäsker, Tim Süß, Lars Nagel, Lingfang Zeng, and André Brinkmann. 2019. Hyperion: Building the Largest Inmemory Search Tree. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD), Amsterdam, The Netherlands, June 30 - July 5. 1207--1222. https://doi.org/10.1145/3299869.3319870
[34]
Michael Mitzenmacher. 2018. A Model for Learned Bloom Filters and Optimizing by Sandwiching. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems (NeurIPS), December 3--8, Montréal, Canada. 462--471.
[35]
Thomas Neumann and Sebastian Michel. 2008. Smooth Interpolating Histograms with Error Guarantees. In Proceedings of the 25th British National Conference on Databases (BNCOD), Cardiff, UK, July 7--10, Vol. 5071. 126--138. https: //doi.org/10.1007/978--3--540--70504--8_12
[36]
Patrick E. O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O'Neil. 1996. The Log-Structured Merge-Tree (LSMTree). Acta Informatica 33, 4 (1996), 351--385. https://doi.org/10.1007/s002360050048
[37]
Joseph O'Rourke. 1981. An On-Line Algorithm for Fitting Straight Lines Between Data Ranges. Commun. ACM 24, 9 (1981), 574--578. https://doi.org/10.1145/358746.358758
[38]
Peter Van Sandt, Yannis Chronis, and Jignesh M. Patel. 2019. Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search?. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD), Amsterdam, The Netherlands, June 30 - July 5. 36--53. https://doi.org/10.1145/3299869.3300075
[39]
Zhaoyan Sun, Xuanhe Zhou, and Guoliang Li. 2023. Learned Index: A Comprehensive Experimental Evaluation. Proc. VLDB Endow. 16, 8 (2023), 1992--2004. https://www.vldb.org/pvldb/vol16/p1992-li.pdf
[40]
Chuzhe Tang, YouyunWang, Zhiyuan Dong, Gansen Hu, ZhaoguoWang, MinjieWang, and Haibo Chen. 2020. XIndex: a scalable learned index for multicore data storage. In 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), San Diego, California, USA, February 22--26. 308--320. https://doi.org/10.1145/3332466.3374547
[41]
Li Wang, Zining Zhang, and Zhenjie Zhang. 2020. PA-Tree: Polled-Mode Asynchronous B Tree for NVMe. In 36th IEEE International Conference on Data Engineering (ICDE), Dallas, TX, USA, April 20--24. IEEE, 553--564. https: //doi.org/10.1109/ICDE48307.2020.00054
[42]
Youyun Wang, Chuzhe Tang, Zhaoguo Wang, and Haibo Chen. 2020. SIndex: a scalable learned index for string keys. In APSys '20: 11th ACM SIGOPS Asia-Pacific Workshop on Systems, Tsukuba, Japan, August 24--25, 2020. 17--24. https://doi.org/10.1145/3409963.3410496
[43]
Chaichon Wongkham, Baotong Lu, Chris Liu, Zhicong Zhong, Eric Lo, and Tianzheng Wang. 2022. Are Updatable Learned Indexes Ready? Proc. VLDB Endow. 15, 11 (2022), 3004--3017.
[44]
Chin-Hsien Wu, Tei-Wei Kuo, and Li-Ping Chang. 2007. An efficient B-tree layer implementation for flash-memory storage systems. ACM Trans. Embed. Comput. Syst. 6, 3 (2007), 19. https://doi.org/10.1145/1275986.1275991
[45]
Jiacheng Wu, Yong Zhang, Shimin Chen, Yu Chen, Jin Wang, and Chunxiao Xing. 2021. Updatable Learned Index with Precise Positions. Proc. VLDB Endow. 14, 8 (2021), 1276--1288. https://doi.org/10.14778/3457390.3457393
[46]
Xingbo Wu, Fan Ni, and Song Jiang. 2019. Wormhole: A Fast Ordered Index for In-memory Data Management. In Proceedings of the Fourteenth EuroSys Conference (EuroSys), Dresden, Germany, March 25--28. 18:1--18:16. https://doi.org/10.1145/3302424.3303955
[47]
Qing Xie, Chaoyi Pang, Xiaofang Zhou, Xiangliang Zhang, and Ke Deng. 2014. Maximum error-bounded Piecewise Linear Representation for online stream approximation. VLDB J. 23, 6 (2014), 915--937. https://doi.org/10.1007/s00778-014-0355-0
[48]
Andrew Chi-Chih Yao. 1978. On Random 2--3 Trees. Acta Informatica 9 (1978), 159--170. https://doi.org/10.1007/ BF00289075
[49]
Huanchen Zhang, Xiaoxuan Liu, David G. Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2020. Order-Preserving Key Compression for In-Memory Search Trees. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD), online conference [Portland, OR, USA], June 14--19. 1601--1615. https://doi.org/10.1145/3318464.3380583
[50]
Huan-Yu Zhao, Guang-Xia Li, Hao Lan Zhang, and Yun Xue. 2016. An improved algorithm for segmenting online time series with error bound guarantee. International Journal of Machine Learning and Cybernetics 7, 3 (2016), 365--374. https://doi.org/10.1007/s13042-014-0310--9

Index Terms

  1. Hyper: A High-Performance and Memory-Efficient Learned Index via Hybrid Construction

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Management of Data
    Proceedings of the ACM on Management of Data  Volume 2, Issue 3
    SIGMOD
    June 2024
    1953 pages
    EISSN:2836-6573
    DOI:10.1145/3670010
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 May 2024
    Published in PACMMOD Volume 2, Issue 3

    Permissions

    Request permissions for this article.

    Author Tags

    1. distribution aware
    2. hybrid construction
    3. learned index
    4. memory efficient

    Qualifiers

    • Research-article

    Funding Sources

    • National Key R&D Program of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 221
      Total Downloads
    • Downloads (Last 12 months)221
    • Downloads (Last 6 weeks)44
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media