Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

LAQy: Efficient and Reusable Query Approximations via Lazy Sampling

Published: 20 June 2023 Publication History

Abstract

Modern analytical engines rely on Approximate Query Processing (AQP) to provide faster response times than the hardware allows for exact query answering. However, existing AQP methods impose steep performance penalties as workload unpredictability increases. Specifically, offline AQP relies on predictable workloads to create samples that match the queries in a priori to query execution, reducing query response times when queries match the expected workload. As soon as workload predictability diminishes, existing online AQP methods create query-specific samples with little reuse across queries, producing significantly smaller gains in response times. As a result, existing approaches cannot fully exploit the benefits of sampling under increased unpredictability.
We analyze sample creation and propose LAQy, a framework for building, expanding, and merging samples to adapt to the changes in workload predicates. We show the main parameters that affect the sample creation time and propose lazy sampling to overcome the unpredictability issues that cause fast-but-specialized samples to be query-specific. We evaluate LAQy by implementing it in an in-memory code-generation-based scale-up analytical engine to show the adaptivity and practicality of our framework in a modern system. LAQy speeds up online sampling processing as a function of sample reuse ranging from practically zero to full online sampling time and from 2.5x to 19.3x in a simulated exploratory workload.

Supplemental Material

MP4 File
Presentation video for SIGMOD 2023.

References

[1]
Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi. 2012. Mergeable summaries. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20--24, 2012, Michael Benedikt, Markus Krötzsch, and Maurizio Lenzerini (Eds.). ACM, 23--34. https://doi.org/10.1145/2213556.2213562
[2]
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the 8th ACM European Conference on Computer Systems (Prague, Czech Republic) (EuroSys '13). Association for Computing Machinery, New York, NY, USA, 29--42. https://doi.org/10.1145/2465351.2465355
[3]
Ran Ben Basat, Seungbum Jo, Srinivasa Rao Satti, and Shubham Ugare. 2021. Approximate query processing over static sets and sliding windows. Theor. Comput. Sci. 885 (2021), 1--14. https://doi.org/10.1016/j.tcs.2021.06.015
[4]
Altan Birler, Bernhard Radke, and Thomas Neumann. 2020. Concurrent Online Sampling for All, for Free. In Proceedings of the 16th International Workshop on Data Management on New Hardware (Portland, Oregon) (DaMoN '20). Association for Computing Machinery, New York, NY, USA, Article 5, 8 pages. https://doi.org/10.1145/3399666.3399924
[5]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38, 4 (2015), 28--38. http://sites.computer.org/debull/A15dec/p28.pdf
[6]
Donald Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Greg Seidman, Michael Stonebraker, Nesime Tatbul, and Stanley B. Zdonik. 2002. Monitoring Streams - A New Class of Data Management Applications. In Proceedings of 28th International Conference on Very Large Data Bases, VLDB 2002, Hong Kong, August 20--23, 2002. Morgan Kaufmann, 215--226. https://doi.org/10.1016/B978--155860869--6/50027--5
[7]
M. T. Chao. 1982. A general purpose unequal probability sampling plan. Biometrika 69, 3 (12 1982), 653--656. https://doi.org/10.1093/biomet/69.3.653 arXiv:https://academic.oup.com/biomet/article-pdf/69/3/653/591311/69--3--653.pdf
[8]
Surajit Chaudhuri, Gautam Das, and Utkarsh Srivastava. 2004. Effective Use of Block-Level Sampling in Statistics Estimation. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (Paris, France) (SIGMOD '04). Association for Computing Machinery, New York, NY, USA, 287--298. https://doi.org/10.1145/1007568.1007602
[9]
Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate Query Processing: No Silver Bullet. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD '17). Association for Computing Machinery, New York, NY, USA, 511--519. https://doi.org/10.1145/3035918.3056097
[10]
Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. Proc. VLDB Endow. 12, 5 (2019), 544--556. https://doi.org/10.14778/3303753.3303760
[11]
Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. 2012. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Found. Trends Databases 4, 1--3 (2012), 1--294. https://doi.org/10.1561/1900000004
[12]
Philipp Eichmann, Emanuel Zgraggen, Carsten Binnig, and Tim Kraska. 2020. IDEBench: A Benchmark for Interactive Data Exploration. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1555--1569. https://doi.org/10.1145/3318464.3380574
[13]
João Gama, Indrundefined ?liobaitundefined, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A Survey on Concept Drift Adaptation. ACM Comput. Surv. 46, 4, Article 44 (mar 2014), 37 pages. https://doi.org/10.1145/2523813
[14]
Goetz Graefe. 1990. Encapsulation of Parallelism in the Volcano Query Processing System. SIGMOD Rec. 19, 2 (may 1990), 102--111. https://doi.org/10.1145/93605.98720
[15]
Hazar Harmouch and Felix Naumann. 2017. Cardinality Estimation: An Experimental Survey. Proc. VLDB Endow. 11, 4 (Dec. 2017), 499--512. https://doi.org/10.1145/3186728.3164145
[16]
Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online Aggregation. SIGMOD Rec. 26, 2 (June 1997), 171--182. https://doi.org/10.1145/253262.253291
[17]
Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, Not from Queries! Proc. VLDB Endow. 13, 7 (March 2020), 992--1005. https://doi.org/10.14778/3384345.3384349
[18]
Srikanth Kandula, Kukjin Lee, Surajit Chaudhuri, and Marc Friedman. 2019. Experiences with Approximating Queries in Microsoft's Production Big-Data Clusters. Proc. VLDB Endow. 12, 12 (Aug. 2019), 2131--2142. https://doi.org/10.14778/3352063.3352130
[19]
Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. 2016. Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 631--646. https://doi.org/10.1145/2882903.2882940
[20]
Manos Karpathiotakis, Ioannis Alagiannis, and Anastasia Ailamaki. 2016. Fast Queries Over Heterogeneous Data Through Engine Customization. Proc. VLDB Endow. 9, 12 (2016), 972--983. https://doi.org/10.14778/2994509.2994516
[21]
Albert Kim, Eric Blais, Aditya Parameswaran, Piotr Indyk, Sam Madden, and Ronitt Rubinfeld. 2015. Rapid Sampling for Visualizations with Ordering Guarantees. Proc. VLDB Endow. 8, 5 (Jan. 2015), 521--532. https://doi.org/10.14778/2735479.2735485
[22]
Tim Kraska. 2017. Approximate Query Processing for Interactive Data Science. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD '17). Association for Computing Machinery, New York, NY, USA, 525. https://doi.org/10.1145/3035918.3056099
[23]
Kaiyu Li and Guoliang Li. 2018. Approximate Query Processing: What is New and Where to Go? - A Survey on Approximate Query Processing. Data Sci. Eng. 3, 4 (2018), 379--397. https://doi.org/10.1007/s41019-018-0074--4
[24]
Qingzhi Ma and Peter Triantafillou. 2019. DBEst: Revisiting Approximate Query Processing Engines with Machine Learning Models. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 1553--1570. https://doi.org/10.1145/3299869.3324958
[25]
Prashanth Menon, Amadou Ngom, Todd C. Mowry, Andrew Pavlo, and Lin Ma. 2020. Permutable Compiled Queries: Dynamically Adapting Compiled Queries without Recompiling. Proc. VLDB Endow. 14, 2 (2020), 101--113.
[26]
Robert B. Miller. 1968. Response time in man-computer conversational transactions. In American Federation of Information Processing Societies: Proceedings of the AFIPS '68 Fall Joint Computer Conference, December 9--11, 1968, San Francisco, California, USA - Part I (AFIPS Conference Proceedings), Vol. 33. AFIPS / ACM / Thomson Book Company, Washington D.C., 267--277. https://doi.org/10.1145/1476589.1476628
[27]
Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. Proc. VLDB Endow. 4, 9 (June 2011), 539--550. https://doi.org/10.14778/2002938.2002940
[28]
Matthaios Olma, Odysseas Papapetrou, Raja Appuswamy, and Anastasia Ailamaki. 2019. Taster: Self-Tuning, Elastic and Online Approximate Query Processing. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8--11, 2019. IEEE, 482--493. https://doi.org/10.1109/ICDE.2019.00050
[29]
Patrick O'Neil, Elizabeth O'Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. Springer-Verlag, Berlin, Heidelberg, 237--252. https://doi.org/10.1007/978--3--642--10424--4_17
[30]
Odysseas Papapetrou, Minos N. Garofalakis, and Antonios Deligiannakis. 2015. Sketching distributed sliding-window data streams. VLDB J. 24, 3 (2015), 345--368. https://doi.org/10.1007/s00778-015-0380--7
[31]
S. K. Park and K. W. Miller. 1988. Random Number Generators: Good Ones Are Hard to Find. Commun. ACM 31, 10 (Oct. 1988), 1192--1201. https://doi.org/10.1145/63039.63042
[32]
Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. 2018. VerdictDB: Universalizing Approximate Query Processing. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 1461--1476. https://doi.org/10.1145/3183713.3196905
[33]
Jinglin Peng, Dongxiang Zhang, Jiannan Wang, and Jian Pei. 2018. AQP: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 1477--1492. https://doi.org/10.1145/3183713.3183747
[34]
Aunn Raza, Periklis Chrysogelos, Angelos-Christos G. Anadiotis, and Anastasia Ailamaki. 2020. Adaptive HTAP through Elastic Resource Scheduling. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 2043--2054. https://doi.org/10.1145/3318464.3389783
[35]
Arik Rinberg, Alexander Spiegelman, Edward Bortnikov, Eshcar Hillel, Idit Keidar, Lee Rhodes, and Hadar Serviansky. 2022. Fast Concurrent Data Sketches. ACM Trans. Parallel Comput. 9, 2 (2022), 6:1--6:35. https://doi.org/10.1145/3512758
[36]
Viktor Sanca and Anastasia Ailamaki. 2022. Sampling-Based AQP in Modern Analytical Engines. In DaMoN. ACM, 4:1--4:8.
[37]
Ben Shneiderman. 1996. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. In Proceedings of the 1996 IEEE Symposium on Visual Languages (VL '96). IEEE Computer Society, USA, 336.
[38]
Panagiotis Sioulas, Viktor Sanca, Ioannis Mytilinis, and Anastasia Ailamaki. 2021. Accelerating Complex Analytics using Speculation. In CIDR.
[39]
Ashraf Tahmasbi, Ellango Jothimurugesan, Srikanta Tirthapura, and Phillip B. Gibbons. 2021. DriftSurf: Stable-State / Reactive-State Learning under Concept Drift. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research), Marina Meila and Tong Zhang (Eds.), Vol. 139. PMLR, 10054--10064. http://proceedings.mlr.press/v139/tahmasbi21a.html
[40]
Kanat Tangwongsan, Martin Hirzel, Scott Schneider, and Kun-Lung Wu. 2015. General Incremental Sliding-Window Aggregation. Proc. VLDB Endow. 8, 7 (2015), 702--713. https://doi.org/10.14778/2752939.2752940
[41]
Chris Wyman. 2021. Ray Tracing Gems II: Next Generation Real-Time Rendering with DXR, Vulkan, and OptiX. Apress, Berkeley, CA, Chapter 22, Weighted Reservoir Sampling: Randomly Sampling Streams, 345--349. https://doi.org/10.1007/978--1--4842--7185--8_22

Cited By

View all
  • (2024)PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data CompressionProceedings of the VLDB Endowment10.14778/3648160.364818117:6(1432-1445)Online publication date: 3-May-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 2
PACMMOD
June 2023
2310 pages
EISSN:2836-6573
DOI:10.1145/3605748
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2023
Published in PACMMOD Volume 1, Issue 2

Permissions

Request permissions for this article.

Author Tags

  1. AQP
  2. OLAP
  3. adaptive sampling
  4. approximate query processing
  5. hybrid AQP
  6. in-memory analytics
  7. modern hardware
  8. online AQP
  9. sampling
  10. scale-up
  11. workload adaptive

Qualifiers

  • Research-article

Funding Sources

  • Swiss National Science Foundation (SNSF)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)141
  • Downloads (Last 6 weeks)8
Reflects downloads up to 11 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data CompressionProceedings of the VLDB Endowment10.14778/3648160.364818117:6(1432-1445)Online publication date: 3-May-2024

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media