research-article

LAQy: Efficient and Reusable Query Approximations via Lazy Sampling

Authors:

Periklis Chrysogelos,

Anastasia AilamakiAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 2

Article No.: 174, Pages 1 - 26

https://doi.org/10.1145/3589319

Published: 20 June 2023 Publication History

Abstract

Modern analytical engines rely on Approximate Query Processing (AQP) to provide faster response times than the hardware allows for exact query answering. However, existing AQP methods impose steep performance penalties as workload unpredictability increases. Specifically, offline AQP relies on predictable workloads to create samples that match the queries in a priori to query execution, reducing query response times when queries match the expected workload. As soon as workload predictability diminishes, existing online AQP methods create query-specific samples with little reuse across queries, producing significantly smaller gains in response times. As a result, existing approaches cannot fully exploit the benefits of sampling under increased unpredictability.

We analyze sample creation and propose LAQy, a framework for building, expanding, and merging samples to adapt to the changes in workload predicates. We show the main parameters that affect the sample creation time and propose lazy sampling to overcome the unpredictability issues that cause fast-but-specialized samples to be query-specific. We evaluate LAQy by implementing it in an in-memory code-generation-based scale-up analytical engine to show the adaptivity and practicality of our framework in a modern system. LAQy speeds up online sampling processing as a function of sample reuse ranging from practically zero to full online sampling time and from 2.5x to 19.3x in a simulated exploratory workload.

Supplemental Material

MP4 File

Presentation video for SIGMOD 2023.

Download
94.03 MB

References

[1]

Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi. 2012. Mergeable summaries. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20--24, 2012, Michael Benedikt, Markus Krötzsch, and Maurizio Lenzerini (Eds.). ACM, 23--34. https://doi.org/10.1145/2213556.2213562

Digital Library

[2]

Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the 8th ACM European Conference on Computer Systems (Prague, Czech Republic) (EuroSys '13). Association for Computing Machinery, New York, NY, USA, 29--42. https://doi.org/10.1145/2465351.2465355

Digital Library

[3]

Ran Ben Basat, Seungbum Jo, Srinivasa Rao Satti, and Shubham Ugare. 2021. Approximate query processing over static sets and sliding windows. Theor. Comput. Sci. 885 (2021), 1--14. https://doi.org/10.1016/j.tcs.2021.06.015

Digital Library

[4]

Altan Birler, Bernhard Radke, and Thomas Neumann. 2020. Concurrent Online Sampling for All, for Free. In Proceedings of the 16th International Workshop on Data Management on New Hardware (Portland, Oregon) (DaMoN '20). Association for Computing Machinery, New York, NY, USA, Article 5, 8 pages. https://doi.org/10.1145/3399666.3399924

Digital Library

[5]

Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38, 4 (2015), 28--38. http://sites.computer.org/debull/A15dec/p28.pdf

[6]

Donald Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Greg Seidman, Michael Stonebraker, Nesime Tatbul, and Stanley B. Zdonik. 2002. Monitoring Streams - A New Class of Data Management Applications. In Proceedings of 28th International Conference on Very Large Data Bases, VLDB 2002, Hong Kong, August 20--23, 2002. Morgan Kaufmann, 215--226. https://doi.org/10.1016/B978--155860869--6/50027--5

[7]

M. T. Chao. 1982. A general purpose unequal probability sampling plan. Biometrika 69, 3 (12 1982), 653--656. https://doi.org/10.1093/biomet/69.3.653 arXiv:https://academic.oup.com/biomet/article-pdf/69/3/653/591311/69--3--653.pdf

[8]

Surajit Chaudhuri, Gautam Das, and Utkarsh Srivastava. 2004. Effective Use of Block-Level Sampling in Statistics Estimation. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (Paris, France) (SIGMOD '04). Association for Computing Machinery, New York, NY, USA, 287--298. https://doi.org/10.1145/1007568.1007602

Digital Library

[9]

Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate Query Processing: No Silver Bullet. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD '17). Association for Computing Machinery, New York, NY, USA, 511--519. https://doi.org/10.1145/3035918.3056097

Digital Library

[10]

Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. Proc. VLDB Endow. 12, 5 (2019), 544--556. https://doi.org/10.14778/3303753.3303760

Digital Library

[11]

Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. 2012. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Found. Trends Databases 4, 1--3 (2012), 1--294. https://doi.org/10.1561/1900000004

Digital Library

[12]

Philipp Eichmann, Emanuel Zgraggen, Carsten Binnig, and Tim Kraska. 2020. IDEBench: A Benchmark for Interactive Data Exploration. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1555--1569. https://doi.org/10.1145/3318464.3380574

Digital Library

[13]

João Gama, Indrundefined ?liobaitundefined, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A Survey on Concept Drift Adaptation. ACM Comput. Surv. 46, 4, Article 44 (mar 2014), 37 pages. https://doi.org/10.1145/2523813

Digital Library

[14]

Goetz Graefe. 1990. Encapsulation of Parallelism in the Volcano Query Processing System. SIGMOD Rec. 19, 2 (may 1990), 102--111. https://doi.org/10.1145/93605.98720

Digital Library

[15]

Hazar Harmouch and Felix Naumann. 2017. Cardinality Estimation: An Experimental Survey. Proc. VLDB Endow. 11, 4 (Dec. 2017), 499--512. https://doi.org/10.1145/3186728.3164145

Digital Library

[16]

Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online Aggregation. SIGMOD Rec. 26, 2 (June 1997), 171--182. https://doi.org/10.1145/253262.253291

Digital Library

[17]

Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, Not from Queries! Proc. VLDB Endow. 13, 7 (March 2020), 992--1005. https://doi.org/10.14778/3384345.3384349

Digital Library

[18]

Srikanth Kandula, Kukjin Lee, Surajit Chaudhuri, and Marc Friedman. 2019. Experiences with Approximating Queries in Microsoft's Production Big-Data Clusters. Proc. VLDB Endow. 12, 12 (Aug. 2019), 2131--2142. https://doi.org/10.14778/3352063.3352130

Digital Library

[19]

Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. 2016. Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 631--646. https://doi.org/10.1145/2882903.2882940

Digital Library

[20]

Manos Karpathiotakis, Ioannis Alagiannis, and Anastasia Ailamaki. 2016. Fast Queries Over Heterogeneous Data Through Engine Customization. Proc. VLDB Endow. 9, 12 (2016), 972--983. https://doi.org/10.14778/2994509.2994516

Digital Library

[21]

Albert Kim, Eric Blais, Aditya Parameswaran, Piotr Indyk, Sam Madden, and Ronitt Rubinfeld. 2015. Rapid Sampling for Visualizations with Ordering Guarantees. Proc. VLDB Endow. 8, 5 (Jan. 2015), 521--532. https://doi.org/10.14778/2735479.2735485

Digital Library

[22]

Tim Kraska. 2017. Approximate Query Processing for Interactive Data Science. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD '17). Association for Computing Machinery, New York, NY, USA, 525. https://doi.org/10.1145/3035918.3056099

Digital Library

[23]

Kaiyu Li and Guoliang Li. 2018. Approximate Query Processing: What is New and Where to Go? - A Survey on Approximate Query Processing. Data Sci. Eng. 3, 4 (2018), 379--397. https://doi.org/10.1007/s41019-018-0074--4

[24]

Qingzhi Ma and Peter Triantafillou. 2019. DBEst: Revisiting Approximate Query Processing Engines with Machine Learning Models. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1553--1570. https://doi.org/10.1145/3299869.3324958

Digital Library

[25]

Prashanth Menon, Amadou Ngom, Todd C. Mowry, Andrew Pavlo, and Lin Ma. 2020. Permutable Compiled Queries: Dynamically Adapting Compiled Queries without Recompiling. Proc. VLDB Endow. 14, 2 (2020), 101--113.

Digital Library

[26]

Robert B. Miller. 1968. Response time in man-computer conversational transactions. In American Federation of Information Processing Societies: Proceedings of the AFIPS '68 Fall Joint Computer Conference, December 9--11, 1968, San Francisco, California, USA - Part I (AFIPS Conference Proceedings), Vol. 33. AFIPS / ACM / Thomson Book Company, Washington D.C., 267--277. https://doi.org/10.1145/1476589.1476628

Digital Library

[27]

Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. Proc. VLDB Endow. 4, 9 (June 2011), 539--550. https://doi.org/10.14778/2002938.2002940

Digital Library

[28]

Matthaios Olma, Odysseas Papapetrou, Raja Appuswamy, and Anastasia Ailamaki. 2019. Taster: Self-Tuning, Elastic and Online Approximate Query Processing. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8--11, 2019. IEEE, 482--493. https://doi.org/10.1109/ICDE.2019.00050

[29]

Patrick O'Neil, Elizabeth O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. Springer-Verlag, Berlin, Heidelberg, 237--252. https://doi.org/10.1007/978--3--642--10424--4_17

[30]

Odysseas Papapetrou, Minos N. Garofalakis, and Antonios Deligiannakis. 2015. Sketching distributed sliding-window data streams. VLDB J. 24, 3 (2015), 345--368. https://doi.org/10.1007/s00778-015-0380--7

Digital Library

[31]

S. K. Park and K. W. Miller. 1988. Random Number Generators: Good Ones Are Hard to Find. Commun. ACM 31, 10 (Oct. 1988), 1192--1201. https://doi.org/10.1145/63039.63042

Digital Library

[32]

Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. 2018. VerdictDB: Universalizing Approximate Query Processing. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 1461--1476. https://doi.org/10.1145/3183713.3196905

Digital Library

[33]

Jinglin Peng, Dongxiang Zhang, Jiannan Wang, and Jian Pei. 2018. AQP: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 1477--1492. https://doi.org/10.1145/3183713.3183747

Digital Library

[34]

Aunn Raza, Periklis Chrysogelos, Angelos-Christos G. Anadiotis, and Anastasia Ailamaki. 2020. Adaptive HTAP through Elastic Resource Scheduling. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 2043--2054. https://doi.org/10.1145/3318464.3389783

Digital Library

[35]

Arik Rinberg, Alexander Spiegelman, Edward Bortnikov, Eshcar Hillel, Idit Keidar, Lee Rhodes, and Hadar Serviansky. 2022. Fast Concurrent Data Sketches. ACM Trans. Parallel Comput. 9, 2 (2022), 6:1--6:35. https://doi.org/10.1145/3512758

Digital Library

[36]

Viktor Sanca and Anastasia Ailamaki. 2022. Sampling-Based AQP in Modern Analytical Engines. In DaMoN. ACM, 4:1--4:8.

[37]

Ben Shneiderman. 1996. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. In Proceedings of the 1996 IEEE Symposium on Visual Languages (VL '96). IEEE Computer Society, USA, 336.

Digital Library

[38]

Panagiotis Sioulas, Viktor Sanca, Ioannis Mytilinis, and Anastasia Ailamaki. 2021. Accelerating Complex Analytics using Speculation. In CIDR.

[39]

Ashraf Tahmasbi, Ellango Jothimurugesan, Srikanta Tirthapura, and Phillip B. Gibbons. 2021. DriftSurf: Stable-State / Reactive-State Learning under Concept Drift. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research), Marina Meila and Tong Zhang (Eds.), Vol. 139. PMLR, 10054--10064. http://proceedings.mlr.press/v139/tahmasbi21a.html

[40]

Kanat Tangwongsan, Martin Hirzel, Scott Schneider, and Kun-Lung Wu. 2015. General Incremental Sliding-Window Aggregation. Proc. VLDB Endow. 8, 7 (2015), 702--713. https://doi.org/10.14778/2752939.2752940

Digital Library

[41]

Chris Wyman. 2021. Ray Tracing Gems II: Next Generation Real-Time Rendering with DXR, Vulkan, and OptiX. Apress, Berkeley, CA, Chapter 22, Weighted Reservoir Sampling: Randomly Sampling Streams, 345--349. https://doi.org/10.1007/978--1--4842--7185--8_22

Cited By

Hurst ALucani DZhang Q(2024)PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data CompressionProceedings of the VLDB Endowment10.14778/3648160.364818117:6(1432-1445)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648181
Tang XZhang FZhang SLiu YHe BHe BDu XDu X(2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677134

Index Terms

LAQy: Efficient and Reusable Query Approximations via Lazy Sampling
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data access methods
    2. Database management system engines
2. Theory of computation
  1. Design and analysis of algorithms
    1. Streaming, sublinear and near linear time algorithms
      1. Sketching and sampling
  2. Theory and algorithms for application domains
    1. Database theory
      1. Data structures and algorithms for data management

Recommendations

Sampling-Based AQP in Modern Analytical Engines
DaMoN '22: Proceedings of the 18th International Workshop on Data Management on New Hardware

As the data volume grows, reducing the query execution times remains an elusive goal. While approximate query processing (AQP) techniques present a principled method to trade off accuracy for faster queries in analytics, the sample creation is often ...
AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Interactive analytics requires database systems to be able to answer aggregation queries within interactive response times. As the amount of data is continuously growing at an unprecedented rate, this is becoming increasingly challenging. In the past, ...
Sampling-based estimators for subset-based queries

We consider the problem of using sampling to estimate the result of an aggregation operation over a subset-based SQL query, where a subquery is correlated to an outer query by a NOT EXISTS, NOT IN, EXISTS or IN clause. We design an unbiased estimator ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 2

PACMMOD

June 2023

2310 pages

EISSN:2836-6573

DOI:10.1145/3605748

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2023

Published in PACMMOD Volume 1, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

Swiss National Science Foundation (SNSF)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
207
Total Downloads

Downloads (Last 12 months)106
Downloads (Last 6 weeks)6

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hurst ALucani DZhang Q(2024)PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data CompressionProceedings of the VLDB Endowment10.14778/3648160.364818117:6(1432-1445)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.14778/3648160.3648181
Tang XZhang FZhang SLiu YHe BHe BDu XDu X(2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677134

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents