Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2588555.2588579acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

The analytical bootstrap: a new method for fast error estimation in approximate query processing

Published: 18 June 2014 Publication History
  • Get Citation Alerts
  • Abstract

    Sampling is one of the most commonly used techniques in Approximate Query Processing (AQP)-an area of research that is now made more critical by the need for timely and cost-effective analytics over "Big Data". Assessing the quality (i.e., estimating the error) of approximate answers is essential for meaningful AQP, and the two main approaches used in the past to address this problem are based on either (i) analytic error quantification or (ii) the bootstrap method. The first approach is extremely efficient but lacks generality, whereas the second is quite general but suffers from its high computational overhead. In this paper, we introduce a probabilistic relational model for the bootstrap process, along with rigorous semantics and a unified error model, which bridges the gap between these two traditional approaches. Based on our probabilistic framework, we develop efficient algorithms to predict the distribution of the approximation results. These enable the computation of any bootstrap-based quality measure for a large class of SQL queries via a single-round evaluation of a slightly modified query. Extensive experiments on both synthetic and real-world datasets show that our method has superior prediction accuracy for bootstrap-based quality measures, and is several orders of magnitude faster than bootstrap.

    References

    [1]
    Conviva Inc. http://www.conviva.com/.
    [2]
    MonetDB. http://www.monetdb.org/Home.
    [3]
    The R Project. http://www.r-project.org/.
    [4]
    TPC-H Benchmark. http://www.tpc.org/tpch/.
    [5]
    Vertica Inc. http://www.vertica.com/.
    [6]
    S. Acharya, P. B. Gibbons, et al. Join Synopses for Approximate Query Answering. In SIGMOD, pages 275--286, 1999.
    [7]
    S. Acharya, P. B. Gibbons, et al. The Aqua Approximate Query Answering System. In SIGMOD, pages 574--576, 1999.
    [8]
    S. Agarwal, H. Milner, et al. Knowing When You're Wrong: Building Fast and Reliable Approximate Query Processing Systems. In SIGMOD, 2014.
    [9]
    S. Agarwal, B. Mozafari, et al. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In EuroSys, pages 29--42, 2013.
    [10]
    L. Antova, T. Jansen, et al. Fast and Simple Relational Processing of Uncertain Data. In ICDE, pages 983--992, 2008.
    [11]
    B. Babcock, S. Chaudhuri, et al. Dynamic Sample Selection for Approximate Query Processing. In SIGMOD, pages 539--550, 2003.
    [12]
    B. Babcock, M. Datar, et al. Load Shedding for Aggregation Queries over Data Streams. In ICDE, page 350, 2004.
    [13]
    P. J. Bickel and D. A. Freedman. Some Asymptotic Theory for the Bootstrap. The Annals of Statistics, 9(6):1196--1217, 1981.
    [14]
    G. Box, J. S. Hunter, et al. Statistics for Experimenters: Design, Innovation, Discovery. Wiley-Interscience, 2005.
    [15]
    M. Charikar, S. Chaudhuri, et al. Towards Estimation Error Guarantees for Distinct Values. In PODS, pages 268--279, 2000.
    [16]
    S. Chaudhuri, G. Das, et al. Optimized stratified sampling for approximate query processing. TODS, 32(2):9, 2007.
    [17]
    Y. Cui, J. Widom, et al. Tracing the lineage of view data in a warehousing environment. TODS, 25(2):179--227, 2000.
    [18]
    N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDBJ, 16:523--544, 2007.
    [19]
    B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, New York, 1993.
    [20]
    H. Garcia-Molina, J. D. Ullman, et al. Database systems - the complete book (2. ed.). Pearson Education, 2009.
    [21]
    B. Glavic and G. Alonso. Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting. In ICDE, pages 174--185, 2009.
    [22]
    J. M. Hellerstein, P. J. Haas, et al. Online Aggregation. In SIGMOD, pages 171--182, 1997.
    [23]
    Y. Hu, S. Sundara, et al. Estimating Aggregates in Time-Constrained Approximate Queries in Oracle. In EDBT, pages 1104--1107, 2009.
    [24]
    C. Jermaine, S. Arumugam, et al. Scalable Approximate Query Processing with DBO Engine. In SIGMOD, pages 1--54, 2007.
    [25]
    S. Joshi and C. Jermaine. Sampling-Based Estimators for Subset-Based Queries. VLDB J., 18(1):181--202, 2009.
    [26]
    G. Karvounarakis and T. J. Green. Semiring-Annotated Data: Queries and Provenance? SIGMOD Record, 41(3):5--14, 2012.
    [27]
    A. Kleiner, A. Talwalkar, et al. A General Bootstrap Performance Diagnostic. In KDD, pages 419--427, 2013.
    [28]
    A. Kleiner, A. Talwalkar, et al. The Big Data Bootstrap. In ICML, 2012.
    [29]
    P. G. Kolaitis and M. Y. Vardi. Conjunctive-Query Containment and Constraint Satisfaction. In PODS, pages 205--213, 1998.
    [30]
    N. Laptev, K. Zeng, et al. Early Accurate Results for Advanced Analytics on MapReduce. PVLDB, 5(10):1028--1039, 2012.
    [31]
    B. Mozafari and C. Zaniolo. Optimal Load Shedding with Aggregates and Mining Queries. In ICDE, pages 76--88, 2010.
    [32]
    C. Olston, E. Bortnikov, et al. Interactive Analysis of Web-Scale Data. In CIDR, 2009.
    [33]
    N. Pansare, V. R. Borkar, et al. Online Aggregation for Large MapReduce Jobs. PVLDB, 4(11):1135--1145, 2011.
    [34]
    A. Pol and C. Jermaine. Relational Confidence Bounds Are Easy With The Bootstrap. In SIGMOD, pages 587--598, 2005.
    [35]
    T. Rabl, M. Poess, et al. Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance. In SPEC, pages 361--372, 2013.
    [36]
    C. Ré and D. Suciu. The Trichotomy of HAVING Queries on a Probabilistic Database. VLDBJ, 18(5):1091--1116, 2009.
    [37]
    P. Sen, A. Deshpande, et al. Read-Once Functions and Query Evaluation in Probabilistic Databases. PVLDB, 3(1):1068--1079, 2010.
    [38]
    M. M. Siddiqui and C. Butler. Asymptotic Joint Distribution of Linear Systematic Statistics from Multivariate Distributions. Journal of the American Statistical Association, 64(325):300--305, 1969.
    [39]
    T. T. L. Tran, Y. Diao, et al. Supporting User-Defined Functions on Uncertain Data. PVLDB, 6(6):469--480, 2013.
    [40]
    T. T. L. Tran, L. Peng, et al. CLARO: Modeling and Processing Uncertain Data Streams. VLDBJ, (5):651--676, 2012.
    [41]
    A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer, corrected edition, Nov. 2000.
    [42]
    D. Z. Wang, E. Michelakis, et al. Bayesstore: Managing large, uncertain data repositories with probabilistic graphical models. PVLDB, 1(1):340--351, 2008.
    [43]
    S. Wilhelm. tmvtnorm: A Package for the Truncated Multivariate Normal Distribution. sigma, 2:2.
    [44]
    S. Wu, B. C. Ooi, et al. Continuous Sampling for Online Aggregation over Multiple Queries. In SIGMOD, pages 651--662, 2010.
    [45]
    K. Zeng, S. Gao, et al. The Analytical Bootstrap: A New Method for Fast Error Estimation in Approximate Query Processing. Technical Report CSD #130028, UCLA, 2013.

    Cited By

    View all
    • (2024)PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data CompressionProceedings of the VLDB Endowment10.14778/3648160.364818117:6(1432-1445)Online publication date: 1-Feb-2024
    • (2024)PECJ: Stream Window Join on Disorder Data Streams with Proactive Error CompensationProceedings of the ACM on Management of Data10.1145/36392682:1(1-24)Online publication date: 26-Mar-2024
    • (2023)An epistemic approach to model uncertainty in data-graphsInternational Journal of Approximate Reasoning10.1016/j.ijar.2023.108948160(108948)Online publication date: Sep-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
    June 2014
    1645 pages
    ISBN:9781450323765
    DOI:10.1145/2588555
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. approximate query processing
    2. bootstrap
    3. error estimation

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'14
    Sponsor:

    Acceptance Rates

    SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)30
    • Downloads (Last 6 weeks)2

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data CompressionProceedings of the VLDB Endowment10.14778/3648160.364818117:6(1432-1445)Online publication date: 1-Feb-2024
    • (2024)PECJ: Stream Window Join on Disorder Data Streams with Proactive Error CompensationProceedings of the ACM on Management of Data10.1145/36392682:1(1-24)Online publication date: 26-Mar-2024
    • (2023)An epistemic approach to model uncertainty in data-graphsInternational Journal of Approximate Reasoning10.1016/j.ijar.2023.108948160(108948)Online publication date: Sep-2023
    • (2022)One Size Does Not Fit All: A Bandit-Based Sampler Combination Framework with Theoretical GuaranteesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517900(531-544)Online publication date: 10-Jun-2022
    • (2022)Controlled Intentional Degradation in Analytical Video SystemsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517899(2105-2119)Online publication date: 10-Jun-2022
    • (2022)Road-aware Indexing for Trajectory Range QueriesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3220822(1-14)Online publication date: 2022
    • (2022)Evaluating Public Anxiety for Topic-Based Communities in Social NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.298975934:3(1191-1205)Online publication date: 1-Mar-2022
    • (2022)Workload Prediction for Adaptive Approximate Query Processing2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020614(217-222)Online publication date: 17-Dec-2022
    • (2022)Revisiting Approximate Query Processing and Bootstrap Error Estimation on GPUDatabase Systems for Advanced Applications10.1007/978-3-031-00123-9_5(72-87)Online publication date: 8-Apr-2022
    • (2021)At-the-time and Back-in-time Persistent SketchesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452802(1623-1636)Online publication date: 9-Jun-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media