Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

One Seed, Two Birds: A Unified Learned Structure for Exact and Approximate Counting

Published: 26 March 2024 Publication History

Abstract

The modern database has many precise and approximate counting requirements. Nevertheless, a solitary multidimensional index or cardinality estimator is insufficient to cater to the escalating demands across all counting scenarios. Such approaches are constrained either by query selectivity or by the compromise between query accuracy and efficiency.
We propose CardIndex, a unified learned structure to solve the above problems. CardIndex serves as a versatile solution that not only functions as a multidimensional learned index for accurate counting but also doubles as an adaptive cardinality estimator, catering to varying counting scenarios with diverse requirements for precision and efficiency. Rigorous experimentation has showcased its superiority. Compared to the state-of-the-art (SOTA) autoregressive data-driven cardinality estimation baselines, our structure achieves training and updating times that are two orders of magnitude faster. Additionally, our CPU-based query estimation latency surpasses GPU-based baselines by two to three times. Notably, the estimation accuracy of low-selectivity queries is up to 314 times better than the current SOTA estimator. In terms of indexing tasks, the construction speed of our structure is two orders of magnitude faster than RSMI and 1.9 times faster than R-tree. Furthermore, it exhibits a point query processing speed that is 3%-17% times faster than RSMI and 1.07 to 2.75 times faster than R-tree and KDB-tree. Range queries under specific loads are 20% times faster than the SOTA indexes.

Supplemental Material

MP4 File
Presentation video
MP4 File
Presentation video

References

[1]
2017. OpenStreetMap data set. https://www.openstreetmap.org.
[2]
2019. Vehicle, snowmobile, and boat registrations. https://catalog.data.gov/dataset/vehicle-snowmobile-and-boat-registrations.
[3]
2021. Individual household electric power consumption data set. http://archive.ics.uci.edu/ml/datasets/Individualhouseholdelectricpowerconsumption.
[4]
2021. PostgreSQL. https://www.postgresql.org/.
[5]
Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. 1990. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD international conference on Management of data. 322--331.
[6]
HK Dai and Hung-Chi Su. 2003. On the locality properties of space-filling curves. In Algorithms and Computation: 14th International Symposium, ISAAC 2003, Kyoto, Japan, December 15--17, 2003. Proceedings 14. Springer, 385--394.
[7]
Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020. Tsunami: A Learned Multi-Dimensional Index for Correlated Data and Skewed Workloads. Proc. VLDB Endow. 14, 2 (oct 2020), 74--86. https://doi.org/10.14778/3425879.3425880
[8]
Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. Fiting-tree: A data-aware index structure. In Proceedings of the 2019 International Conference on Management of Data. 1189--1206.
[9]
Edward Gan, Jialin Ding, Kai Sheng Tai, Vatsal Sharan, and Peter Bailis. 2018. Moment-Based Quantile Sketches for Efficient High Cardinality Aggregation Queries. Proc. VLDB Endow. 11, 11 (2018), 1647--1660. https://doi.org/10.14778/3236187.3236212
[10]
Jian Gao, Xin Cao, Xin Yao, Gong Zhang, and Wei Wang. 2023. LMSFC: A Novel Multidimensional Index Based on Learned Monotonic Space Filling Curves. Proc. VLDB Endow. 16, 10 (aug 2023), 2605--2617. https://doi.org/10.14778/3603581.3603598
[11]
Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. 2015. Made: Masked autoencoder for distribution estimation. In International conference on machine learning. PMLR, 881--889.
[12]
Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, Not from Queries! Proc. VLDB Endow. 13, 7 (mar 2020), 992--1005. https://doi.org/10.14778/3384345.3384349
[13]
Jeffrey Jestes, Ke Yi, and Feifei Li. 2011. Building Wavelet Histograms on Large Data in MapReduce. Proc. VLDB Endow. 5, 2 (2011), 109--120. https://doi.org/10.14778/2078324.2078327
[14]
Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 international conference on management of data. 489--504.
[15]
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment (Nov 2015), 204--215. https://doi.org/10.14778/2850583.2850594
[16]
Guoliang Li and Chao Zhang. 2022. HTAP Databases: What is New and What is Next. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD '22). Association for Computing Machinery, New York, NY, USA, 2483--2488. https://doi.org/10.1145/3514221.3522565
[17]
Qiyu Liu, Yanyan Shen, and Lei Chen. 2021. LHist: Towards Learning Multi-dimensional Histogram for Massive Spatial Data. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). 1188--1199. https://doi.org/10.1109/ICDE51399.2021.00107
[18]
Moin Hussain Moti, Panagiotis Simatis, and Dimitris Papadias. 2022. Waffle: A Workload-Aware and Query-Sensitive Framework for Disk-Based Spatial Indexing. Proceedings of the VLDB Endowment 16, 4 (2022), 670--683.
[19]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Neural Information Processing Systems (2019).
[20]
Viswanath Poosala, Peter J. Haas, Yannis Ioannidis, and Eugene J. Shekita. 1996. Improved histograms for selectivity estimation of range predicates. International Conference on Management of Data (1996).
[21]
Jianzhong Qi, Guanli Liu, Christian S Jensen, and Lars Kulik. 2020. Effectively learning spatial indices. Proceedings of the VLDB Endowment 13, 12 (2020), 2341--2354.
[22]
Frank Ramsak, Volker Markl, Robert Fenk, Martin Zirkel, Klaus Elhardt, and Rudolf Bayer. 2000. Integrating the UB-tree into a database system kernel. In VLDB, Vol. 2000. 263--272.
[23]
John T Robinson. 1981. The KDB-tree: a search structure for large multidimensional dynamic indexes. In Proceedings of the 1981 ACM SIGMOD international conference on Management of data. 10--18.
[24]
Dennis G. Severance and Guy M. Lohman. 1976. Differential files: their application to the maintenance of large databases. ACM Trans. Database Syst. 1 (1976), 256--267.
[25]
Ji Sun, Jintao Zhang, Zhaoyan Sun, Guoliang Li, and Nan Tang. 2021. Learned Cardinality Estimation: A Design Space Exploration and a Comparative Evaluation. 15, 1 (sep 2021), 85--97. https://doi.org/10.14778/3485450.3485459
[26]
Herbert Tropf and Helmut Herzog. 1981. Multidimensional Range Search in Dynamically Balanced Trees. ANGE-WANDTE INFO. 2 (1981), 71--77.
[27]
Haixin Wang, Xiaoyi Fu, Jianliang Xu, and Hua Lu. 2019. Learned Index for Spatial Queries. In 2019 20th IEEE International Conference on Mobile Data Management (MDM). 569--574. https://doi.org/10.1109/MDM.2019.00121
[28]
Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and Qingqing Zhou. 2021. Are We Ready for Learned Cardinality Estimation? Proc. VLDB Endow. 14, 9 (may 2021), 1640--1654. https://doi.org/10.14778/3461535.3461552
[29]
Jiacheng Wu, Yong Zhang, Shimin Chen, Jin Wang, Yu Chen, and Chunxiao Xing. 2021. Updatable Learned Index with Precise Positions. Proc. VLDB Endow. 14, 8 (apr 2021), 1276--1288. https://doi.org/10.14778/3457390.3457393
[30]
Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2020. NeuroCard: One Cardinality Estimator for All Tables. Proc. VLDB Endow. 14, 1 (sep 2020), 61--73. https://doi.org/10.14778/3421424.3421432
[31]
Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M. Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep Unsupervised Cardinality Estimation. Proc. VLDB Endow. 13, 3 (Nov 2019), 279--292. https://doi.org/10.14778/3368289.3368294
[32]
Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random Sampling over Joins Revisited. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 1525--1539. https://doi.org/10.1145/3183713.3183739

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 2, Issue 1
SIGMOD
February 2024
1874 pages
EISSN:2836-6573
DOI:10.1145/3654807
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2024
Published in PACMMOD Volume 2, Issue 1

Permissions

Request permissions for this article.

Author Tags

  1. cardinality estimation
  2. learned index

Qualifiers

  • Research-article

Funding Sources

  • NSFC

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 190
    Total Downloads
  • Downloads (Last 12 months)190
  • Downloads (Last 6 weeks)26
Reflects downloads up to 12 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media