research-article

The analytical bootstrap: a new method for fast error estimation in approximate query processing

Authors:

Barzan Mozafari, and

Carlo ZanioloAuthors Info & Claims

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

June 2014

Pages 277 - 288

https://doi.org/10.1145/2588555.2588579

Published: 18 June 2014 Publication History

Abstract

Sampling is one of the most commonly used techniques in Approximate Query Processing (AQP)-an area of research that is now made more critical by the need for timely and cost-effective analytics over "Big Data". Assessing the quality (i.e., estimating the error) of approximate answers is essential for meaningful AQP, and the two main approaches used in the past to address this problem are based on either (i) analytic error quantification or (ii) the bootstrap method. The first approach is extremely efficient but lacks generality, whereas the second is quite general but suffers from its high computational overhead. In this paper, we introduce a probabilistic relational model for the bootstrap process, along with rigorous semantics and a unified error model, which bridges the gap between these two traditional approaches. Based on our probabilistic framework, we develop efficient algorithms to predict the distribution of the approximation results. These enable the computation of any bootstrap-based quality measure for a large class of SQL queries via a single-round evaluation of a slightly modified query. Extensive experiments on both synthetic and real-world datasets show that our method has superior prediction accuracy for bootstrap-based quality measures, and is several orders of magnitude faster than bootstrap.

References

[1]

Conviva Inc. http://www.conviva.com/.

[2]

MonetDB. http://www.monetdb.org/Home.

[3]

The R Project. http://www.r-project.org/.

[4]

TPC-H Benchmark. http://www.tpc.org/tpch/.

[5]

Vertica Inc. http://www.vertica.com/.

[6]

S. Acharya, P. B. Gibbons, et al. Join Synopses for Approximate Query Answering. In SIGMOD, pages 275--286, 1999.

Digital Library

[7]

S. Acharya, P. B. Gibbons, et al. The Aqua Approximate Query Answering System. In SIGMOD, pages 574--576, 1999.

Digital Library

[8]

S. Agarwal, H. Milner, et al. Knowing When You're Wrong: Building Fast and Reliable Approximate Query Processing Systems. In SIGMOD, 2014.

Digital Library

[9]

S. Agarwal, B. Mozafari, et al. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In EuroSys, pages 29--42, 2013.

Digital Library

[10]

L. Antova, T. Jansen, et al. Fast and Simple Relational Processing of Uncertain Data. In ICDE, pages 983--992, 2008.

Digital Library

[11]

B. Babcock, S. Chaudhuri, et al. Dynamic Sample Selection for Approximate Query Processing. In SIGMOD, pages 539--550, 2003.

Digital Library

[12]

B. Babcock, M. Datar, et al. Load Shedding for Aggregation Queries over Data Streams. In ICDE, page 350, 2004.

Digital Library

[13]

P. J. Bickel and D. A. Freedman. Some Asymptotic Theory for the Bootstrap. The Annals of Statistics, 9(6):1196--1217, 1981.

[14]

G. Box, J. S. Hunter, et al. Statistics for Experimenters: Design, Innovation, Discovery. Wiley-Interscience, 2005.

[15]

M. Charikar, S. Chaudhuri, et al. Towards Estimation Error Guarantees for Distinct Values. In PODS, pages 268--279, 2000.

Digital Library

[16]

S. Chaudhuri, G. Das, et al. Optimized stratified sampling for approximate query processing. TODS, 32(2):9, 2007.

Digital Library

[17]

Y. Cui, J. Widom, et al. Tracing the lineage of view data in a warehousing environment. TODS, 25(2):179--227, 2000.

Digital Library

[18]

N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDBJ, 16:523--544, 2007.

Digital Library

[19]

B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, New York, 1993.

[20]

H. Garcia-Molina, J. D. Ullman, et al. Database systems - the complete book (2. ed.). Pearson Education, 2009.

Digital Library

[21]

B. Glavic and G. Alonso. Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting. In ICDE, pages 174--185, 2009.

Digital Library

[22]

J. M. Hellerstein, P. J. Haas, et al. Online Aggregation. In SIGMOD, pages 171--182, 1997.

Digital Library

[23]

Y. Hu, S. Sundara, et al. Estimating Aggregates in Time-Constrained Approximate Queries in Oracle. In EDBT, pages 1104--1107, 2009.

Digital Library

[24]

C. Jermaine, S. Arumugam, et al. Scalable Approximate Query Processing with DBO Engine. In SIGMOD, pages 1--54, 2007.

Digital Library

[25]

S. Joshi and C. Jermaine. Sampling-Based Estimators for Subset-Based Queries. VLDB J., 18(1):181--202, 2009.

Digital Library

[26]

G. Karvounarakis and T. J. Green. Semiring-Annotated Data: Queries and Provenance? SIGMOD Record, 41(3):5--14, 2012.

Digital Library

[27]

A. Kleiner, A. Talwalkar, et al. A General Bootstrap Performance Diagnostic. In KDD, pages 419--427, 2013.

Digital Library

[28]

A. Kleiner, A. Talwalkar, et al. The Big Data Bootstrap. In ICML, 2012.

Digital Library

[29]

P. G. Kolaitis and M. Y. Vardi. Conjunctive-Query Containment and Constraint Satisfaction. In PODS, pages 205--213, 1998.

Digital Library

[30]

N. Laptev, K. Zeng, et al. Early Accurate Results for Advanced Analytics on MapReduce. PVLDB, 5(10):1028--1039, 2012.

Digital Library

[31]

B. Mozafari and C. Zaniolo. Optimal Load Shedding with Aggregates and Mining Queries. In ICDE, pages 76--88, 2010.

[32]

C. Olston, E. Bortnikov, et al. Interactive Analysis of Web-Scale Data. In CIDR, 2009.

[33]

N. Pansare, V. R. Borkar, et al. Online Aggregation for Large MapReduce Jobs. PVLDB, 4(11):1135--1145, 2011.

Digital Library

[34]

A. Pol and C. Jermaine. Relational Confidence Bounds Are Easy With The Bootstrap. In SIGMOD, pages 587--598, 2005.

Digital Library

[35]

T. Rabl, M. Poess, et al. Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance. In SPEC, pages 361--372, 2013.

Digital Library

[36]

C. Ré and D. Suciu. The Trichotomy of HAVING Queries on a Probabilistic Database. VLDBJ, 18(5):1091--1116, 2009.

Digital Library

[37]

P. Sen, A. Deshpande, et al. Read-Once Functions and Query Evaluation in Probabilistic Databases. PVLDB, 3(1):1068--1079, 2010.

Digital Library

[38]

M. M. Siddiqui and C. Butler. Asymptotic Joint Distribution of Linear Systematic Statistics from Multivariate Distributions. Journal of the American Statistical Association, 64(325):300--305, 1969.

[39]

T. T. L. Tran, Y. Diao, et al. Supporting User-Defined Functions on Uncertain Data. PVLDB, 6(6):469--480, 2013.

Digital Library

[40]

T. T. L. Tran, L. Peng, et al. CLARO: Modeling and Processing Uncertain Data Streams. VLDBJ, (5):651--676, 2012.

Digital Library

[41]

A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer, corrected edition, Nov. 2000.

[42]

D. Z. Wang, E. Michelakis, et al. Bayesstore: Managing large, uncertain data repositories with probabilistic graphical models. PVLDB, 1(1):340--351, 2008.

Digital Library

[43]

S. Wilhelm. tmvtnorm: A Package for the Truncated Multivariate Normal Distribution. sigma, 2:2.

[44]

S. Wu, B. C. Ooi, et al. Continuous Sampling for Online Aggregation over Multiple Queries. In SIGMOD, pages 651--662, 2010.

Digital Library

[45]

K. Zeng, S. Gao, et al. The Analytical Bootstrap: A New Method for Fast Error Estimation in Approximate Query Processing. Technical Report CSD #130028, UCLA, 2013.

Cited By

Hurst ALucani DZhang Q(2024)PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data CompressionProceedings of the VLDB Endowment10.14778/3648160.364818117:6(1432-1445)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648181
Zeng XZhang SZhong HZhang HLu MZheng ZChen Y(2024)PECJ: Stream Window Join on Disorder Data Streams with Proactive Error CompensationProceedings of the ACM on Management of Data10.1145/36392682:1(1-24)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639268
Abriola SCifuentes SMartinez MPardal NPin E(2023)An epistemic approach to model uncertainty in data-graphsInternational Journal of Approximate Reasoning10.1016/j.ijar.2023.108948160(108948)Online publication date: Sep-2023
https://doi.org/10.1016/j.ijar.2023.108948
Show More Cited By

Index Terms

The analytical bootstrap: a new method for fast error estimation in approximate query processing
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Knowing when you're wrong: building fast and reliable approximate query processing systems
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions.The quantity of data and limitations of disk and memory bandwidth often make it ...
Read More
ABS: a system for scalable approximate queries with accuracy guarantees
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Approximate Query Processing (AQP) based on sampling is critical for supporting timely and cost-effective analytics over big data. To be applied successfully, AQP must be accompanied by reliable estimates on the quality of sample-produced approximate ...
Read More
iOLAP: Managing Uncertainty for Efficient Incremental OLAP
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

The size of data and the complexity of analytics continue to grow along with the need for timely and cost-effective analysis. However, the growth of computation power cannot keep up with the growth of data. This calls for a paradigm shift from ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

June 2014

1645 pages

ISBN:9781450323765

DOI:10.1145/2588555

General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'14

Sponsor:

SIGMOD

SIGMOD/PODS'14: International Conference on Management of Data

June 22 - 27, 2014

Utah, Snowbird, USA

Acceptance Rates

SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

87
Total Citations
View Citations
717
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)2

Other Metrics

View Author Metrics

Citations

Cited By

Hurst ALucani DZhang Q(2024)PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data CompressionProceedings of the VLDB Endowment10.14778/3648160.364818117:6(1432-1445)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648181
Zeng XZhang SZhong HZhang HLu MZheng ZChen Y(2024)PECJ: Stream Window Join on Disorder Data Streams with Proactive Error CompensationProceedings of the ACM on Management of Data10.1145/36392682:1(1-24)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639268
Abriola SCifuentes SMartinez MPardal NPin E(2023)An epistemic approach to model uncertainty in data-graphsInternational Journal of Approximate Reasoning10.1016/j.ijar.2023.108948160(108948)Online publication date: Sep-2023
https://doi.org/10.1016/j.ijar.2023.108948
Peng JDing BWang JZeng KZhou JIves ZBonifati AEl Abbadi A(2022)One Size Does Not Fit All: A Bandit-Based Sampler Combination Framework with Theoretical GuaranteesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517900(531-544)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517900
He WCafarella MIves ZBonifati AEl Abbadi A(2022)Controlled Intentional Degradation in Analytical Video SystemsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517899(2105-2119)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517899
Wang YLi KLi GTang N(2022)Road-aware Indexing for Trajectory Range QueriesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3220822(1-14)Online publication date: 2022
https://doi.org/10.1109/TKDE.2022.3220822
Ta NLi KYang YJiao FTang ZLi G(2022)Evaluating Public Anxiety for Topic-Based Communities in Social NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.298975934:3(1191-1205)Online publication date: 1-Mar-2022
https://doi.org/10.1109/TKDE.2020.2989759
Shahrivari HPapapetrou OFletcher G(2022)Workload Prediction for Adaptive Approximate Query Processing2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020614(217-222)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020614
Zhao HZhang HJing YZhang KHe ZWang X(2022)Revisiting Approximate Query Processing and Bootstrap Error Estimation on GPUDatabase Systems for Advanced Applications10.1007/978-3-031-00123-9_5(72-87)Online publication date: 8-Apr-2022
https://doi.org/10.1007/978-3-031-00123-9_5
Shi BZhao ZPeng YLi FPhillips JLi GLi ZIdreos SSrivastava D(2021)At-the-time and Back-in-time Persistent SketchesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452802(1623-1636)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3452802
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents