Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

AirIndex: Versatile Index Tuning Through Data and Storage

Published: 13 November 2023 Publication History

Abstract

The end-to-end lookup latency of a hierarchical index---such as a B-tree or a learned index---is determined by its structure such as the number of layers, the kinds of branching functions appearing in each layer, the amount of data we must fetch from layers, etc. Our primary observation is that by optimizing those structural parameters (or designs) specifically to a target system's I/O characteristics (e.g., latency, bandwidth), we can offer a faster lookup compared to the ones that are not optimized. Can we develop a systematic method for finding those optimal design parameters? Ideally, the method must have the potential to generate almost any existing index or a novel combination of them for the fastest possible lookup.
In this work, we present new data and an I/O-aware index builder (called AirIndex) that can find high-speed hierarchical index designs in a principled way. Specifically, AirIndex minimizes an objective function expressing the end-to-end latency in terms of various designs---the number of layers, types of layers, and more---for given data and a storage profile, using a graph-based optimization method purpose-built to address the computational challenges rising from the inter-dependencies among index layers and the exponentially many candidate parameters in a large search space. Our empirical studies confirm that AirIndex can find optimal index designs, build optimal indexes within the times comparable to existing methods, and deliver up to 4.1x faster lookup than a lightweight B-tree library (LMDB), 3.3x--46.3x faster than state-of-the-art learned indexes (RMI/CDFShop, PGM-index, ALEX/APEX, PLEX), and 2.0 faster than Data Calculator's suggestion on various dataset and storage settings.

Supplemental Material

PPTX File
Presentation slides

References

[1]
[n.d.]. https://github.com/illinoisdata/airindex-public.
[2]
[n.d.]. https://github.com/illinoisdata/lmdb.
[3]
[n.d.]. https://github.com/illinoisdata/RMI.
[4]
[n.d.]. https://github.com/illinoisdata/PGM-index.
[5]
[n.d.]. https://github.com/illinoisdata/ALEX_ext.
[6]
[n.d.]. https://github.com/illinoisdata/airindex-public/tree/main/src/bin/data_calculator.rs.
[7]
[n.d.]. A high-performance distributed shared-log for Ceph. https://github.com/cruzdb/zlog. [Online; accessed December-27--2022].
[8]
[n.d.]. MySQL. https://www.mysql.com/. [Online; accessed December-27--2022].
[9]
Hussam Abu-Libdeh, Deniz Altinbüken, Alex Beutel, Ed H. Chi, Lyric Doshi, Tim Kraska, Xiaozhou Li, Andy Ly, and Christopher Olston. 2020. Learned Indexes for a Google-scale Disk-based Database. CoRR abs/2012.12501 (2020).
[10]
A. M. Andrew. 1979. Another Efficient Algorithm for Convex Hulls in Two Dimensions. Inf. Process. Lett. 9, 5 (1979), 216--219.
[11]
Raja Appuswamy, Goetz Graefe, Renata Borovica-Gajic, and Anastasia Ailamaki. 2019. The five-minute rule 30 years later and its impact on the storage hierarchy. Commun. ACM 62, 11 (2019), 114--120.
[12]
Microsoft Azure. [n.d.]. Azure. https://azure.microsoft.com. [Online; accessed Jul-17--2022].
[13]
Microsoft Azure. [n.d.]. Azure Blob Storage. https://azure.microsoft.com/en-us/services/storage/blobs/. [Online; accessed Jul-17--2022].
[14]
Microsoft Azure. [n.d.]. Azure managed disk types. https://docs.microsoft.com/en-us/azure/virtual-machines/disks-types. [Online; accessed Jul-17--2022].
[15]
Microsoft Azure. [n.d.]. Network File System (NFS) 3.0 protocol support for Azure Blob Storage. https://docs.microsoft.com/en-us/azure/storage/blobs/network-file-system-protocol-support. [Online; accessed Jul-17--2022].
[16]
Mahesh Balakrishnan, Jason Flinn, Chen Shen, Mihir Dharamshi, Ahmed Jafri, Xiao Shi, Santosh Ghosh, Hazem Hassan, Aaryaman Sagar, Rhed Shi, et al . 2020. Virtual consensus in delos. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 617--632.
[17]
R. Bayer and E. M. McCreight. 1972. Organization and maintenance of large ordered indexes. Acta Informatica 1 (1972), 173--189. Issue 3. https://doi.org/10.1007/BF00288683
[18]
Michael A. Bender, Erik D. Demaine, and Martin Farach-Colton. 2000. Cache-Oblivious B-Trees. In FOCS. IEEE Computer Society, 399--409.
[19]
Huang Bin and Peng Yuxing. 2014. An efficient distributed B-tree index method in cloud computing. Open Cybernetics and Systemics Journal 8 (2014). Issue 1. https://doi.org/10.2174/1874110x01408010302
[20]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!). In OSDI. USENIX Association, 205--218.
[21]
Surajit Chaudhuri and Vivek R. Narasayya. 1998. AutoAdmin 'What-if' Index Analysis Utility. In SIGMOD Conference. ACM Press, 367--378.
[22]
Leying Chen and Shimin Chen. 2021. How Does Updatable Learned Index Perform on Non-Volatile Main Memory?. In ICDE Workshops. IEEE, 66--71.
[23]
Shimin Chen, Phillip B. Gibbons, and Todd C. Mowry. 2001. Improving index performance through prefetching. SIGMOD Record (ACM Special Interest Group on Management of Data) 30 (2001). Issue 2. https://doi.org/10.1145/376284.375688
[24]
Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, and Gary Valentin. 2002. Fractal prefetching B-Trees: Optimizing both cache and disk performance. Proceedings of the ACM SIGMOD International Conference on Management of Data.
[25]
Douglas Comer. 1979. UBIQUITOUS B-TREE. Comput Surv 11 (1979). Issue 2.
[26]
Yifan Dai, Yien Xu, Aishwarya Ganesan, Ramnatthan Alagappan, Brian Kroth, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2020. From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees. In OSDI. USENIX Association, 155--171.
[27]
Henry Daly, Ahmed Hassan, Michael F. Spear, and Roberto Palmieri. 2018. NuMask: High performance scalable skip list for NUMA. Leibniz International Proceedings in Informatics, LIPIcs 121. https://doi.org/10.4230/LIPIcs.DISC.2018.18
[28]
Debabrata Dash, Neoklis Polyzotis, and Anastasia Ailamaki. 2011. CoPhy: A Scalable, Portable, and Interactive Index Advisor for Large Workloads. Proc. VLDB Endow. 4, 6 (2011), 362--372.
[29]
Ian Dick, Alan Fekete, and Vincent Gramoli. 2017. A skip list for multicore. Concurrency Computation 29 (2017). Issue 4. https://doi.org/10.1002/cpe.3876
[30]
Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet, and Tim Kraska. 2020. ALEX: An Updatable Adaptive Learned Index. Proceedings of the ACM SIGMOD International Conference on Management of Data. https://doi.org/10.1145/3318464.3389711
[31]
Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David B. Lomet, and Tim Kraska. 2020. ALEX: An Updatable Adaptive Learned Index. In SIGMOD Conference. ACM, 969--984.
[32]
Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. 2020. Why Are Learned Indexes So Effective?. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research), Hal Daumé III and Aarti Singh (Eds.), Vol. 119. PMLR, 3123--3132. https://proceedings.mlr.press/v119/ferragina20a.html
[33]
Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow. 13, 8 (2020), 1162--1175.
[34]
Martin R. Frank, Edward Omiecinski, and Shamkant B. Navathe. 1992. Adaptive and Automated Index Selection in RDBMS. In EDBT (Lecture Notes in Computer Science), Vol. 580. Springer, 277--292.
[35]
Goetz Graefe and Per Åke Larson. 2001. B-tree indexes and CPU caches. Proceedings - International Conference on Data Engineering (2001). https://doi.org/10.1109/ICDE.2001.914847
[36]
Jim Gray and Goetz Graefe. 1997. The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb. SIGMOD Rec. 26, 4 (1997), 63--68.
[37]
Jing He, Shao wen Yao, Li Cai, and Wei Zhou. 2018. SLC-index: A scalable skip list-based index for cloud data processing. Journal of Central South University 25 (2018). Issue 10. https://doi.org/10.1007/s11771-018--3927-0
[38]
Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, and Demi Guo. 2018. The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models. In SIGMOD Conference. ACM, 535--550.
[39]
Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, and Demi Guo. 2018. The Internals of the Data Calculator. CoRR abs/1808.02066 (2018).
[40]
Ivo Jimenez, Huascar Sanchez, Quoc Trung Tran, and Neoklis Polyzotis. 2012. Kaizen: a semi-automatic index advisor. In SIGMOD Conference. ACM, 685--688.
[41]
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned cardinalities: Estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677 (2018).
[42]
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. NeurIPS Workshop on Machine Learning for Systems (2019).
[43]
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2020. RadixSpline: a single-pass learned index. In aiDM@SIGMOD. ACM, 5:1--5:5.
[44]
Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2019. SageDB: A Learned Database System. In CIDR. www.cidrdb.org.
[45]
Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. Proceedings of the ACM SIGMOD International Conference on Management of Data. https://doi.org/10.1145/3183713.3196909
[46]
Pengfei Li, Hua Lu, Qian Zheng, Long Yang, and Gang Pan. 2020. LISA: A Learned Index Structure for Spatial Data. In SIGMOD Conference. ACM, 2119--2133.
[47]
LMDB. [n.d.]. Lightning Memory-Mapped Database Manager. http://www.lmdb.tech/doc/ Online; accessed Jul-17--2022.
[48]
David B. Lomet. 1998. B-tree Page Size When Caching is Considered. SIGMOD Rec. 27, 3 (1998), 28--32.
[49]
Baotong Lu, Jialin Ding, Eric Lo, Umar Farooq Minhas, and Tianzheng Wang. 2021. APEX: A High-Performance Learned Index on Persistent Memory. arXiv preprint arXiv:2105.00683 (2021).
[50]
Vincent Y. Lum and Huei Ling. 1971. An optimization problem on the selection of secondary keys. In ACM Annual Conference. ACM, 349--356.
[51]
Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. 2020. Benchmarking Learned Indexes. Proc. VLDB Endow. 14, 1 (2020), 1--13.
[52]
Ryan Marcus, Emily Zhang, and Tim Kraska. 2020. CDFShop: Exploring and Optimizing Learned Index Structures. In SIGMOD Conference. ACM, 2789--2792.
[53]
Patrick E. O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O'Neil. 1996. The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica 33, 4 (1996), 351--385.
[54]
Postcard. [n.d.]. Postcard: A no_std serde compatible message library for Rust. https://github.com/jamesmunns/postcard. [Online; accessed July-17--2022].
[55]
PostgreSQL. [n.d.]. PostgreSQL: The World's Most Advanced Open Source Relational Database. https://www.postgresql.org. [Online; accessed July-17--2022].
[56]
William Pugh. 1990. Skip lists: a probabilistic alternative to balanced trees. Commun. ACM 33, 6 (1990), 668--676.
[57]
Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The Linux B-tree filesystem. ACM Transactions on Storage (TOS) 9, 3 (2013), 1--32.
[58]
Mario Schkolnick. 1975. The Optimal Selection of Secondary Indices for Files. Inf. Syst. 1, 4 (1975), 141--146.
[59]
Serde. [n.d.]. Serde. https://serde.rs. [Online; accessed July-17--2022].
[60]
Stefan Sprenger, Steffen Zeuch, and Ulf Leser. 2017. Cache-sensitive skip list: Efficient range queries on modern CPUs. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10195 LNCS. https://doi.org/10.1007/978--3--319--56111-0_1
[61]
SQLite. [n.d.]. SQLite. https://www.sqlite.org. [Online; accessed April-24--2021].
[62]
Mihail Stoian, Andreas Kipf, Ryan Marcus, and Tim Kraska. 2021. PLEX: Towards Practical Learned Indexing. CoRR abs/2108.05117 (2021).
[63]
Michael Stonebraker. 1974. The choice of partial inversions and combined indices. Int. J. Parallel Program. 3, 2 (1974), 167--188.
[64]
Yongjoo Park Supawit Chockchowwat, Wenjie Liu. 2023. AirIndex: Versatile Index Tuning Through Data and Storage (Extended Version). arXiv preprint (2023).
[65]
Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon aurora: Design considerations for high throughput cloud-native relational databases. In Proceedings of the 2017 ACM International Conference on Management of Data. 1041--1052.
[66]
Li Wang, Zining Zhang, Bingsheng He, and Zhenjie Zhang. 2020. PA-Tree: Polled-mode asynchronous B tree for NVMe. Proceedings - International Conference on Data Engineering 2020-April. https://doi.org/10.1109/ICDE48307.2020.00054
[67]
Youyun Wang, Chuzhe Tang, Zhaoguo Wang, and Haibo Chen. 2020. SIndex: A Scalable Learned Index for String Keys. In Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems (Tsukuba, Japan) (APSys '20). Association for Computing Machinery, New York, NY, USA, 17--24. https://doi.org/10.1145/3409963.3410496
[68]
Xingda Wei, Rong Chen, and Haibo Chen. 2020. Fast RDMA-based Ordered Key-Value Store using Remote Learned Cache. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 117--135. https://www.usenix.org/conference/osdi20/presentation/wei
[69]
Sai Wu, Dawei Jiang, Beng Chin Ooi, and Kunlung Wu. 2010. Efficient btree based indexing for cloud data processing. Proceedings of the VLDB Endowment 3 (2010). Issue 1. https://doi.org/10.14778/1920841.1920991
[70]
Wei Zhou, Jin Lu, Zhongzhi Luan, Shipu Wang, Gang Xue, and Shaowen Yao. 2014. SNB-index: A SkipNet and B tree based auxiliary Cloud index. Cluster Computing 17 (2014). Issue 2. https://doi.org/10.1007/s10586-013-0246-y

Cited By

View all
  • (2024)Morphtree: a polymorphic main-memory learned index for dynamic workloadsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00823-y33:4(1065-1084)Online publication date: 1-Jul-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 3
PACMMOD
September 2023
472 pages
EISSN:2836-6573
DOI:10.1145/3632968
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2023
Published in PACMMOD Volume 1, Issue 3

Permissions

Request permissions for this article.

Author Tags

  1. constrained optimization
  2. external memory
  3. graph search
  4. hierarchical indexes
  5. index complexity
  6. indexing
  7. instance optimization
  8. learned indexes
  9. parallel tuning
  10. storage
  11. tuning

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)282
  • Downloads (Last 6 weeks)29
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Morphtree: a polymorphic main-memory learned index for dynamic workloadsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00823-y33:4(1065-1084)Online publication date: 1-Jul-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media