research-article

One Seed, Two Birds: A Unified Learned Structure for Exact and Approximate Counting

Authors:

Xianglong LiuAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 2, Issue 1

Article No.: 15, Pages 1 - 26

https://doi.org/10.1145/3639270

Published: 26 March 2024 Publication History

Abstract

The modern database has many precise and approximate counting requirements. Nevertheless, a solitary multidimensional index or cardinality estimator is insufficient to cater to the escalating demands across all counting scenarios. Such approaches are constrained either by query selectivity or by the compromise between query accuracy and efficiency.

We propose CardIndex, a unified learned structure to solve the above problems. CardIndex serves as a versatile solution that not only functions as a multidimensional learned index for accurate counting but also doubles as an adaptive cardinality estimator, catering to varying counting scenarios with diverse requirements for precision and efficiency. Rigorous experimentation has showcased its superiority. Compared to the state-of-the-art (SOTA) autoregressive data-driven cardinality estimation baselines, our structure achieves training and updating times that are two orders of magnitude faster. Additionally, our CPU-based query estimation latency surpasses GPU-based baselines by two to three times. Notably, the estimation accuracy of low-selectivity queries is up to 314 times better than the current SOTA estimator. In terms of indexing tasks, the construction speed of our structure is two orders of magnitude faster than RSMI and 1.9 times faster than R-tree. Furthermore, it exhibits a point query processing speed that is 3%-17% times faster than RSMI and 1.07 to 2.75 times faster than R-tree and KDB-tree. Range queries under specific loads are 20% times faster than the SOTA indexes.

Supplemental Material

MP4 File

Presentation video

Download
40.00 MB

MP4 File

Presentation video

Download
37.89 MB

References

[1]

2017. OpenStreetMap data set. https://www.openstreetmap.org.

[2]

2019. Vehicle, snowmobile, and boat registrations. https://catalog.data.gov/dataset/vehicle-snowmobile-and-boat-registrations.

[3]

2021. Individual household electric power consumption data set. http://archive.ics.uci.edu/ml/datasets/Individualhouseholdelectricpowerconsumption.

[4]

2021. PostgreSQL. https://www.postgresql.org/.

[5]

Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. 1990. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD international conference on Management of data. 322--331.

Digital Library

[6]

HK Dai and Hung-Chi Su. 2003. On the locality properties of space-filling curves. In Algorithms and Computation: 14th International Symposium, ISAAC 2003, Kyoto, Japan, December 15--17, 2003. Proceedings 14. Springer, 385--394.

[7]

Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020. Tsunami: A Learned Multi-Dimensional Index for Correlated Data and Skewed Workloads. Proc. VLDB Endow. 14, 2 (oct 2020), 74--86. https://doi.org/10.14778/3425879.3425880

Digital Library

[8]

Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. Fiting-tree: A data-aware index structure. In Proceedings of the 2019 International Conference on Management of Data. 1189--1206.

Digital Library

[9]

Edward Gan, Jialin Ding, Kai Sheng Tai, Vatsal Sharan, and Peter Bailis. 2018. Moment-Based Quantile Sketches for Efficient High Cardinality Aggregation Queries. Proc. VLDB Endow. 11, 11 (2018), 1647--1660. https://doi.org/10.14778/3236187.3236212

Digital Library

[10]

Jian Gao, Xin Cao, Xin Yao, Gong Zhang, and Wei Wang. 2023. LMSFC: A Novel Multidimensional Index Based on Learned Monotonic Space Filling Curves. Proc. VLDB Endow. 16, 10 (aug 2023), 2605--2617. https://doi.org/10.14778/3603581.3603598

Digital Library

[11]

Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. 2015. Made: Masked autoencoder for distribution estimation. In International conference on machine learning. PMLR, 881--889.

[12]

Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, Not from Queries! Proc. VLDB Endow. 13, 7 (mar 2020), 992--1005. https://doi.org/10.14778/3384345.3384349

Digital Library

[13]

Jeffrey Jestes, Ke Yi, and Feifei Li. 2011. Building Wavelet Histograms on Large Data in MapReduce. Proc. VLDB Endow. 5, 2 (2011), 109--120. https://doi.org/10.14778/2078324.2078327

Digital Library

[14]

Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 international conference on management of data. 489--504.

Digital Library

[15]

Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment (Nov 2015), 204--215. https://doi.org/10.14778/2850583.2850594

Digital Library

[16]

Guoliang Li and Chao Zhang. 2022. HTAP Databases: What is New and What is Next. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD '22). Association for Computing Machinery, New York, NY, USA, 2483--2488. https://doi.org/10.1145/3514221.3522565

Digital Library

[17]

Qiyu Liu, Yanyan Shen, and Lei Chen. 2021. LHist: Towards Learning Multi-dimensional Histogram for Massive Spatial Data. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). 1188--1199. https://doi.org/10.1109/ICDE51399.2021.00107

[18]

Moin Hussain Moti, Panagiotis Simatis, and Dimitris Papadias. 2022. Waffle: A Workload-Aware and Query-Sensitive Framework for Disk-Based Spatial Indexing. Proceedings of the VLDB Endowment 16, 4 (2022), 670--683.

Digital Library

[19]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Neural Information Processing Systems (2019).

[20]

Viswanath Poosala, Peter J. Haas, Yannis Ioannidis, and Eugene J. Shekita. 1996. Improved histograms for selectivity estimation of range predicates. International Conference on Management of Data (1996).

[21]

Jianzhong Qi, Guanli Liu, Christian S Jensen, and Lars Kulik. 2020. Effectively learning spatial indices. Proceedings of the VLDB Endowment 13, 12 (2020), 2341--2354.

Digital Library

[22]

Frank Ramsak, Volker Markl, Robert Fenk, Martin Zirkel, Klaus Elhardt, and Rudolf Bayer. 2000. Integrating the UB-tree into a database system kernel. In VLDB, Vol. 2000. 263--272.

Digital Library

[23]

John T Robinson. 1981. The KDB-tree: a search structure for large multidimensional dynamic indexes. In Proceedings of the 1981 ACM SIGMOD international conference on Management of data. 10--18.

Digital Library

[24]

Dennis G. Severance and Guy M. Lohman. 1976. Differential files: their application to the maintenance of large databases. ACM Trans. Database Syst. 1 (1976), 256--267.

Digital Library

[25]

Ji Sun, Jintao Zhang, Zhaoyan Sun, Guoliang Li, and Nan Tang. 2021. Learned Cardinality Estimation: A Design Space Exploration and a Comparative Evaluation. 15, 1 (sep 2021), 85--97. https://doi.org/10.14778/3485450.3485459

Digital Library

[26]

Herbert Tropf and Helmut Herzog. 1981. Multidimensional Range Search in Dynamically Balanced Trees. ANGE-WANDTE INFO. 2 (1981), 71--77.

[27]

Haixin Wang, Xiaoyi Fu, Jianliang Xu, and Hua Lu. 2019. Learned Index for Spatial Queries. In 2019 20th IEEE International Conference on Mobile Data Management (MDM). 569--574. https://doi.org/10.1109/MDM.2019.00121

[28]

Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and Qingqing Zhou. 2021. Are We Ready for Learned Cardinality Estimation? Proc. VLDB Endow. 14, 9 (may 2021), 1640--1654. https://doi.org/10.14778/3461535.3461552

Digital Library

[29]

Jiacheng Wu, Yong Zhang, Shimin Chen, Jin Wang, Yu Chen, and Chunxiao Xing. 2021. Updatable Learned Index with Precise Positions. Proc. VLDB Endow. 14, 8 (apr 2021), 1276--1288. https://doi.org/10.14778/3457390.3457393

Digital Library

[30]

Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2020. NeuroCard: One Cardinality Estimator for All Tables. Proc. VLDB Endow. 14, 1 (sep 2020), 61--73. https://doi.org/10.14778/3421424.3421432

Digital Library

[31]

Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M. Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep Unsupervised Cardinality Estimation. Proc. VLDB Endow. 13, 3 (Nov 2019), 279--292. https://doi.org/10.14778/3368289.3368294

Digital Library

[32]

Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random Sampling over Joins Revisited. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 1525--1539. https://doi.org/10.1145/3183713.3183739

Digital Library

Index Terms

One Seed, Two Birds: A Unified Learned Structure for Exact and Approximate Counting
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data access methods
        Multidimensional range search
    2. Database management system engines
      1. Database query processing
        Query optimization

Recommendations

The Case for Learned Index Structures
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Indexes are models: a \btree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate ...
Two birds, one stone: a fast, yet lightweight, indexing scheme for modern database systems

Classic database indexes (e.g., B⁺-Tree), though speed up queries, suffer from two main drawbacks: (1) An index usually yields 5% to 15% additional storage overhead which results in non-ignorable dollar cost in big data scenarios especially when ...
A learned index for approximate kNN queries in high-dimensional spaces
Abstract
Approximate k-Nearest Neighbor (kNN) search in high-dimensional spaces is a fundamental problem in computer systems and applications. However, traditional indexes for kNN search do not scale gracefully to massive high-dimensional datasets. As the ... $^{}$ $^{}$

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 2, Issue 1

SIGMOD

February 2024

1874 pages

EISSN:2836-6573

DOI:10.1145/3654807

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 March 2024

Published in PACMMOD Volume 2, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

NSFC

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
190
Total Downloads

Downloads (Last 12 months)190
Downloads (Last 6 weeks)26

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents