Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Determining Exact Quantiles with Randomized Summaries

Published: 26 March 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Quantiles are fundamental statistics in various data science tasks, but costly to compute, e.g., by loading the entire data in memory for ranking. With limited memory space, prevalent in end devices or databases with heavy loads, it needs to scan the data in multiple passes. The idea is to gradually shrink the range of the queried quantile till it is small enough to fit in memory for ranking the result. Existing methods use deterministic sketches to determine the exact range of quantile, known as deterministic filter, which could be inefficient in range shrinking. In this study, we propose to shrink the ranges more aggressively, using randomized summaries such as KLL sketch. That is, with a high probability the quantile lies in a smaller range, namely probabilistic filter, determined by the randomized sketch. Specifically, we estimate the expected passes for determining the exact quantiles with probabilistic filters, and select a proper probability that can minimize the expected passes. Analyses show that our exact quantile determination method can terminate in P passes with 1-δ confidence, storing O(N 1/P logP-1/2P (1/δ)) items, close to the lower bound Ømega(N1/P) for a fixed δ. The approach has been deployed as a function in an LSM-tree based time-series database Apache IoTDB. Remarkably, the randomized sketches can be pre-computed for the immutable SSTables in LSM-tree. Moreover, multiple quantile queries could share the data passes for probabilistic filters in range estimation. Extensive experiments on real and synthetic datasets demonstrate the superiority of our proposal compared to the existing methods with deterministic filters. On average, our method takes 0.48 fewer passes and 18% of the time compared with the state-of-the-art deterministic sketch (GK sketch).

    References

    [1]
    Ildar Absalyamov, Michael J. Carey, and Vassilis J. Tsotras. 2018. Lightweight Cardinality Estimation in LSM-based Systems. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 841--855. https://doi.org/10.1145/3183713.3183761
    [2]
    Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi. 2013. Mergeable summaries. ACM Trans. Database Syst. 38, 4 (2013), 26. https://doi.org/10.1145/2500128
    [3]
    Zhiwei Chen, Shaoxu Song, Ziheng Wei, Jingyun Fang, and Jiang Long. 2021. Approximating Median Absolute Deviation with Bounded Error. Proc. VLDB Endow. 14, 11 (2021), 2114--2126. https://doi.org/10.14778/3476249.3476266
    [4]
    Experiment Code and Data. 2023. https://github.com/thssdb/exact-quantile.
    [5]
    Graham Cormode, Zohar S. Karnin, Edo Liberty, Justin Thaler, and Pavel Veselý. 2021. Relative Error Streaming Quantiles. In PODS'21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20--25, 2021, Leonid Libkin, Reinhard Pichler, and Paolo Guagliardo (Eds.). ACM, 96--108. https://doi.org/10.1145/3452021.3458323
    [6]
    Graham Cormode, Abhinav Mishra, Joseph Ross, and Pavel Veselý. 2021. Theory meets Practice at the Median: A Worst Case Comparison of Relative Error Quantile Algorithms. In KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14--18, 2021, Feida Zhu, Beng Chin Ooi, and Chunyan Miao (Eds.). ACM, 2722--2731. https://doi.org/10.1145/3447548.3467152
    [7]
    Graham Cormode and Pavel Veselý. 2020. A Tight Lower Bound for Comparison-Based Quantile Summaries. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2020, Portland, OR, USA, June 14--19, 2020, Dan Suciu, Yufei Tao, and Zhewei Wei (Eds.). ACM, 81--93. https://doi.org/10.1145/3375395. 3387650
    [8]
    Bitcoin Dataset. 2018. https://www.kaggle.com/shiheyingzhe/bitcoin-transaction-data-from-2009-to-2018.
    [9]
    Binance Dataset. 2022. https://www.kaggle.com/datasets/jorijnsmit/binance-full-history.
    [10]
    Electric Dataset. 2022. https://www.kaggle.com/datasets/jeanmidev/smart-meters-in-london.
    [11]
    IBM Dataset. 2023. https://www.kaggle.com/datasets/ealtman2019/ibm-transactions-for-anti-money-laundering-aml.
    [12]
    Thruster Dataset. 2022. https://www.kaggle.com/datasets/patrickfleith/spacecraft-thruster-firing-tests-dataset.
    [13]
    Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. https://doi.org/10.1145/1327452.1327492
    [14]
    Ted Dunning. 2021. The t-digest: Efficient estimates of distributions. Softw. Impacts 7 (2021), 100049. https://doi.org/ 10.1016/j.simpa.2020.100049
    [15]
    Ted Dunning and Otmar Ertl. 2019. Computing Extremely Accurate Quantiles Using t-Digests. CoRR abs/1902.04023 (2019). arXiv:1902.04023 http://arxiv.org/abs/1902.04023
    [16]
    Michael Greenwald and Sanjeev Khanna. 2001. Space-Efficient Online Computation of Quantile Summaries. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21--24, 2001, Sharad Mehrotra and Timos K. Sellis (Eds.). ACM, 58--66. https://doi.org/10.1145/375663.375670
    [17]
    Joseph M. Hellerstein, Christopher Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib Analytics Library or MAD Skills, the SQL. Proc. VLDB Endow. 5, 12 (2012), 1700--1711. https://doi.org/10.14778/2367502.2367510
    [18]
    C. A. R. Hoare. 1962. Quicksort. Comput. J. 5, 1 (1962), 10--15. https://doi.org/10.1093/comjnl/5.1.10
    [19]
    Configuration in Apache IoTDB. 2023. https://iotdb.apache.org/UserGuide/Master/Reference/Common-Config-Manual. html.
    [20]
    Document in Apache IoTDB. 2023. https://iotdb.apache.org/UserGuide/Master/Operators-Functions/Data-Profiling. html#quantile.
    [21]
    Implementation in Apache IoTDB. 2023. https://github.com/apache/iotdb/tree/research/exact-quantile.
    [22]
    Yannis E. Ioannidis. 2003. The History of Histograms (abridged). In Proceedings of 29th International Conference on Very Large Data Bases, VLDB 2003, Berlin, Germany, September 9--12, 2003, Johann Christoph Freytag, Peter C. Lockemann, Serge Abiteboul, Michael J. Carey, Patricia G. Selinger, and Andreas Heuer (Eds.). Morgan Kaufmann, 19--30. https://doi.org/10.1016/B978-012722442--8/50011--2
    [23]
    Apache IoTDB. 2023. https://iotdb.apache.org/.
    [24]
    Nikita Ivkin, Edo Liberty, Kevin J. Lang, Zohar S. Karnin, and Vladimir Braverman. 2019. Streaming Quantiles Algorithms with Small Space and Update Time. CoRR abs/1907.00236 (2019). arXiv:1907.00236 http://arxiv.org/abs/ 1907.00236
    [25]
    Yuyuan Kang, Xiangdong Huang, Shaoxu Song, Lingzhe Zhang, Jialin Qiao, Chen Wang, Jianmin Wang, and Julian Feinauer. 2022. Separation or Not: On Handing Out-of-Order Time-Series Data in Leveled LSM-Tree. In 38th IEEE International Conference on Data Engineering, ICDE 2022, Kuala Lumpur, Malaysia, May 9--12, 2022. IEEE, 3340--3352. https://doi.org/10.1109/ICDE53745.2022.00315
    [26]
    Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. 2016. Optimal Quantile Approximation in Streams. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9--11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, Irit Dinur (Ed.). IEEE Computer Society, 71--78. https://doi.org/10.1109/FOCS.2016.17
    [27]
    J. Kiefer. 1953. Sequential minimax search for a maximum. Proc. Amer. Math. Soc. 4 (01 1953), 502--505.
    [28]
    YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome A. Rolia. 2011. A Study of Skew in MapReduce Applications. https://api.semanticscholar.org/CorpusID:134490
    [29]
    Yuanzhen Li, Shengjie Zheng, Zi-Xin Tan, Tuo Cao, Fei Luo, and Chunxia Xiao. 2023. Self-Supervised Monocular Depth Estimation by Digging into Uncertainty Quantification. J. Comput. Sci. Technol. 38, 3 (2023), 510--525. https: //doi.org/10.1007/S11390-023--3088-Y
    [30]
    Ji Lin, Wei-Ming Chen, Yujun Lin, John Cohn, Chuang Gan, and Song Han. 2020. MCUNet: Tiny Deep Learning on IoT Devices. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/ 86c51678350f656dcc7f490a43946ee5-Abstract.html
    [31]
    Chen Luo and Michael J. Carey. 2020. LSM-based storage techniques: a survey. VLDB J. 29, 1 (2020), 393--418. https://doi.org/10.1007/s00778-019-00555-y
    [32]
    Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. 1999. Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1--3, 1999, Philadelphia, Pennsylvania, USA, Alex Delis, Christos Faloutsos, and Shahram Ghandeharizadeh (Eds.). ACM Press, 251--262. https://doi.org/10.1145/304182.304204
    [33]
    Charles Masson, Jee E. Rim, and Homin K. Lee. 2019. DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees. Proc. VLDB Endow. 12, 12 (2019), 2195--2205. https://doi.org/10.14778/3352063.3352135
    [34]
    J. Ian Munro and Mike Paterson. 1980. Selection and Sorting with Limited Storage. Theor. Comput. Sci. 12 (1980), 315--323. https://doi.org/10.1016/0304--3975(80)90061--4
    [35]
    Karl-Dietrich Neubert. 1998. The Flashsort1 Algorithm. Dr. Dobb's Journal (02 1998), 123--125.
    [36]
    Patrick E. O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth J. O'Neil. 1996. The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica 33, 4 (1996), 351--385. https://doi.org/10.1007/s002360050048
    [37]
    Helmut Prodinger. 1995. Multiple Quickselect - Hoare's Find Algorithm for Several Elements. Inf. Process. Lett. 56, 3 (1995), 123--129. https://doi.org/10.1016/0020-0190(95)00150-B
    [38]
    InfluxDB Quantile. 2023. https://github.com/influxdata/flux/blob/8ad36639ede4826242455389fff4810adfc4e884/stdlib/ universe/quantile.go#L404.
    [39]
    PostgreSQL Quantile. 2023. https://www.postgresql.org/docs/current/functions-aggregate.html.
    [40]
    Supplementary. 2023. https://github.com/thssdb/exact-quantile/blob/main/Appendix.pdf.
    [41]
    Chen Wang, Jialin Qiao, Xiangdong Huang, Shaoxu Song, Haonan Hou, Tian Jiang, Lei Rui, Jianmin Wang, and Jiaguang Sun. 2023. Apache IoTDB: A Time Series Database for IoT Applications. Proc. ACM Manag. Data 1, 2 (2023), 195:1--195:27. https://doi.org/10.1145/3589775
    [42]
    Wolfgang Weiss, Víctor Juan Expósito Jiménez, and Herwig Zeiner. 2020. Dynamic Buffer Sizing for Out-of-order Event Compensation for Time-sensitive Applications. ACM Trans. Sens. Networks 17, 1 (2020), 1:1--1:23. https://doi.org/10.1145/3410403
    [43]
    Haitao Yuan and Guoliang Li. 2021. A Survey of Traffic Prediction: from Spatio-Temporal Data to Intelligent Transportation. Data Sci. Eng. 6, 1 (2021), 63--85. https://doi.org/10.1007/s41019-020-00151-z
    [44]
    Kangfei Zhao, Zongyan He, Jeffrey Xu Yu, and Yu Rong. 2023. Learning with Small Data: Subgraph Counting Queries. Data Sci. Eng. 8, 3 (2023), 292--305. https://doi.org/10.1007/S41019-023-00223-W

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Management of Data
    Proceedings of the ACM on Management of Data  Volume 2, Issue 1
    SIGMOD
    February 2024
    1874 pages
    EISSN:2836-6573
    DOI:10.1145/3654807
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 March 2024
    Published in PACMMOD Volume 2, Issue 1

    Author Tags

    1. data stream
    2. quantile
    3. sketches

    Qualifiers

    • Research-article

    Funding Sources

    • the National Natural Science Foundation of China
    • Beijing Key Laboratory of Industrial Big Data System and Application
    • the National Key Research and Development Plan

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 162
      Total Downloads
    • Downloads (Last 12 months)162
    • Downloads (Last 6 weeks)29
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media