Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2882903.2915249acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee

Published: 14 June 2016 Publication History

Abstract

Data volumes are growing exponentially for our decision-support systems making it challenging to ensure interactive response time for ad-hoc queries without increasing cost of hardware. Aggregation queries with Group By that produce an aggregate value for every combination of values in the grouping columns are the most important class of ad-hoc queries. As small errors are usually tolerable for such queries, approximate query processing (AQP) has the potential to answer them over very large datasets much faster.
In many cases analysts require the distribution of (group, aggvalue) pairs in the estimated answer to be guaranteed within a certain error threshold of the exact distribution. Existing AQP techniques are inadequate for two main reasons.
First, users cannot express such guarantees. Second, sampling techniques used in traditional AQP can produce arbitrarily large errors even for summ queries. To address those limitations, we first introduce a new precision metric, called distribution precision, to express such error guarantees.
We then study how to provide fast approximate answers to aggregation queries with distribution precision guaranteed within a user-specified error bound. The main challenges are to provide rigorous error guarantees and to handle arbitrary highly selective predicates without maintaining large-sized samples. We propose a novel sampling scheme called measure-biased sampling to address the former challenge.
For the latter, we propose two new indexes to augment in-memory samples. Like other sampling-based AQP techniques, our solution supports any aggregate that can be estimated from random samples.
In addition to deriving theoretical guarantees, we conduct experimental study to compare our system with state-of-the-art AQP techniques and a commercial column-store database system on both synthetic and real enterprise datasets. Our system provides a median speed-up of more than 100x with around 5% distribution error compared with the commercial database.

References

[1]
J. Acharya, I. Diakonikolas, C. Hegde, J. Z. Li, and L. Schmidt. Fast and near-optimal algorithms for approximating distributions by histograms. In PODS, 2015.
[2]
S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In SIGMOD, 2000.
[3]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. In SIGMOD, 1999.
[4]
S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan, S. Madden, B. Mozafari, and I. Stoica. Knowing when you're wrong: building fast and reliable approximate query processing systems. In SIGMOD, 2014.
[5]
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Eurosys, 2013.
[6]
S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materialized views and indexes in sql databases. In VLDB, 2000.
[7]
B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD, 2003.
[8]
P. Bille, A. Pagh, and R. Pagh. Fast evaluation of union-intersection expressions. In ISAAC, 2007.
[9]
K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. VLDBJ, 10(2--3), 2001.
[10]
M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. In PODS, 2000.
[11]
S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. Narasayya. Overcoming limitations of sampling for aggregation queries. In ICDE, 2001.
[12]
S. Chaudhuri, G. Das, and V. Narasayya. A robust, optimization-based approach for approximate answering of aggregate queries. In SIGMOD, 2001.
[13]
S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. TODS, 32(2), 2007.
[14]
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In NSDI, 2010.
[15]
V. Ganti, M.-L. Lee, and R. Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. In VLDB, 2000.
[16]
M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. In VLDB, 2001.
[17]
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Optimal and approximate computation of summary statistics for range aggregates. In PODS, 2001.
[18]
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov., 1(1), 1997.
[19]
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. SIGMOD, 1997.
[20]
A. Kim, E. Blais, A. G. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. Rapid sampling for visualizations with ordering guarantees. PVLDB, 8(5), 2015.
[21]
A. Kleiner, A. Talwalkar, S. Agarwal, I. Stoica, and M. I. Jordan. A general bootstrap performance diagnostic. In KDD, 2013.
[22]
S. L. Lohr. Sampling: Design and Analysis. Thomson, 2009.
[23]
C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, and B. Reed. Interactive analysis of web-scale data. In CIDR, 2009.
[24]
N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. PVLDB, 4(11), 2011.
[25]
N. Potti and J. M. Patel. Daq: a new paradigm for approximate query processing. PVLDB, 8(9), 2015.
[26]
B. Shneiderman. Response time and display rate in human performance with computers. ACM Computing Surveys, 1984.
[27]
L. Sidirourgos, M. L. Kersten, and P. A. Boncz. Sciborq: Scientific data management with bounds on runtime and quality. In CIDR, 2011.
[28]
J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In SIGMOD, 1999.
[29]
K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analytical bootstrap: a new method for fast error estimation in approximate query processing. In SIGMOD, 2014.

Cited By

View all
  • (2024)Guaranteeing an Exact Error Bound for Bounded Approximate Query ProcessingJournal of Information Processing10.2197/ipsjjip.32.90332(903-915)Online publication date: 2024
  • (2024)Biathlon: Harnessing Model Resilience for Accelerating ML Inference PipelinesProceedings of the VLDB Endowment10.14778/3675034.367505217:10(2631-2640)Online publication date: 1-Jun-2024
  • (2024)Visualization-Aware Time Series Min-Max Caching with Error Bound GuaranteesProceedings of the VLDB Endowment10.14778/3659437.365946017:8(2091-2103)Online publication date: 1-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. approximate query processing
  2. indexing
  3. precision guarantee
  4. sampling

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'16
Sponsor:
SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)56
  • Downloads (Last 6 weeks)5
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Guaranteeing an Exact Error Bound for Bounded Approximate Query ProcessingJournal of Information Processing10.2197/ipsjjip.32.90332(903-915)Online publication date: 2024
  • (2024)Biathlon: Harnessing Model Resilience for Accelerating ML Inference PipelinesProceedings of the VLDB Endowment10.14778/3675034.367505217:10(2631-2640)Online publication date: 1-Jun-2024
  • (2024)Visualization-Aware Time Series Min-Max Caching with Error Bound GuaranteesProceedings of the VLDB Endowment10.14778/3659437.365946017:8(2091-2103)Online publication date: 1-Apr-2024
  • (2024)Tailorable Sampling for Progressive Visual AnalyticsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.327808430:8(4809-4824)Online publication date: 1-Aug-2024
  • (2024)CheetahTraj: Efficient Visualization for Large Trajectory Dataset With Quality GuaranteeIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.338748036:11(5737-5752)Online publication date: Nov-2024
  • (2024)Learning-Based Sample Tuning for Approximate Query Processing in Interactive Data ExplorationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334145136:11(6532-6546)Online publication date: Nov-2024
  • (2024)Generalized Measure-Biased Sampling and Priority SamplingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334067336:11(6251-6265)Online publication date: Nov-2024
  • (2023)Efficient Complex Aggregate Queries with Accuracy Guarantee Based on Execution Cost Model over Knowledge GraphsMathematics10.3390/math1118390811:18(3908)Online publication date: 14-Sep-2023
  • (2023)Approximate Queries over Concurrent UpdatesProceedings of the VLDB Endowment10.14778/3611540.361160216:12(3986-3989)Online publication date: 1-Aug-2023
  • (2023)Practical Dynamic Extension for Sampling IndexesProceedings of the ACM on Management of Data10.1145/36267441:4(1-26)Online publication date: 12-Dec-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media