research-article

Free access

Sample synopses for approximate answering of group-by queries

Authors:

Philipp Rösch,

Wolfgang LehnerAuthors Info & Claims

EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology

Pages 403 - 414

https://doi.org/10.1145/1516360.1516408

Published: 24 March 2009 Publication History

Abstract

With the amount of data in current data warehouse databases growing steadily, random sampling is continuously gaining in importance. In particular, interactive analyses of large datasets can greatly benefit from the significantly shorter response times of approximate query processing. Typically, those analytical queries partition the data into groups and aggregate the values within the groups. Further, with the commonly used roll-up and drill-down operations a broad range of group-by queries is posed to the system, which makes the construction of highly-specialized synopses difficult.

In this paper, we propose a general-purpose sampling scheme that is biased in order to answer group-by queries with high accuracy. While existing techniques focus on the size of the group when computing its sample size, our technique is based on its standard deviation. The basic idea is that the more homogeneous a group is, the less representatives are required in order to give a good estimate. With an extensive set of experiments, we show that our approach reduces both the estimation error and the construction cost compared to existing techniques.

References

[1]

S. Acharya, P. Gibbons, and V. Poosala. Congressional Samples for Approximate Answering of Group-By Queries. In SIGMOD, pages 487--498, 2000.

Digital Library

[2]

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. In SIGMOD, pages 574--576, 1999.

Digital Library

[3]

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join Synopses for Approximate Query Answering. In SIGMOD, pages 275--286, 1999.

Digital Library

[4]

B. Babcock, S. Chaudhuri, and G. Das. Dynamic Sample Selection for Approximate Query Processing. In SIGMOD, pages 539--550, 2003.

Digital Library

[5]

P. Brown and P. Haas. BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data. In VLDB, pages 668--679, 2003.

Digital Library

[6]

J. Brutlag and T. Richardson. A block sampling approach to distinct value estimation. Technical report, University of Washington, Department of Statistics, 2000.

[7]

K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. Approximate Query Processing Using Wavelets. In VLDB, pages 111--122, 2000.

Digital Library

[8]

S. Chaudhuri, G. Das, M. Datar, and R. M. V. Narasayya. Overcoming Limitations of Sampling for Aggregation Queries. In ICDE, pages 534--544, 2001.

Digital Library

[9]

S. Chaudhuri, G. Das, and V. Narasayya. A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries. In SIGMOD, pages 295--306, 2001.

Digital Library

[10]

S. Chaudhuri, G. Das, and U. Srivastava. Effective Use of Block-level Sampling in Statistics Estimation. In SIGMOD, pages 287--298, 2004.

Digital Library

[11]

W. Cochran. Sampling Techniques. Wiley Series in Probability & Mathematical Statistics. John Wiley & Sons, 3rd edition, 1977.

[12]

D. DeWitt, J. Naughton, D. Schneider, and S. Seshadri. Practical Skew Handling in Parallel Joins. In VLDB, 1992.

Digital Library

[13]

V. Ganti, M. Lee, and R. Ramakrishnan. ICICLES: Self-Tuning Samples for Approximate Query Answering. In The VLDB Journal, pages 176--187, 2000.

Digital Library

[14]

R. Gemulla, P. Rösch, and W. Lehner. Linked Bernoulli Synopses: Sampling Along Foreign-Keys. In SSDBM, pages 6--23, 2008.

Digital Library

[15]

I. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. In SIGMOD, pages 647--658, 2004.

Digital Library

[16]

Y. E. Ioannidis and V. Poosala. Histogram-Based Approximation of Set-Valued Query-Answers. In VLDB, pages 174--185, 1999.

Digital Library

[17]

C. Jermaine. Robust Estimation With Sampling and Approximate Pre-Aggregation. In VLDB, pages 886--897, 2003.

Digital Library

[18]

T. Johnson, S. Muthukrishnan, and I. Rozenbaum. Sampling Algorithms in a Stream Operator. In SIGMOD, pages 1--12, 2005.

Digital Library

[19]

A. Klein, R. Gemulla, P. Rösch, and W. Lehner. Derby/S: A DBMS for Sample-Based Query Answering (Demo). In SIGMOD, pages 757--759, 2006.

Digital Library

[20]

G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold. An Efficient Approximation Scheme for Data Mining Tasks. In ICDE, pages 453--462, 2001.

Digital Library

[21]

Y. Matias, J. S. Vitter, and M. Wang. Wavelet-Based Histograms for Selectivity Estimation. In SIGMOD, pages 448--459, 1998.

Digital Library

[22]

V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita. Improved Histograms for Selectivity Estimation of Range Predicates. In SIGMOD, pages 294--305, 1996.

Digital Library

[23]

P. Rösch, R. Gemulla, and W. Lehner. Designing Random Sample Synopses with Outliers. In ICDE, pages 1400--1402, 2008.

Digital Library

[24]

H. Toivonen. Sampling Large Databases for Association Rules. In VLDB, pages 134--145, 1996.

Digital Library

[25]

J. Vitter. Random Sampling with a Reservoir. ACM Trans. Mathematical Software, 11(1):37--57, 1985.

Digital Library

Cited By

Nguyen TShih MParvathaneni SXu BSrivastava DTirthapura S(2020)Random Sampling for Group-By Queries2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00053(541-552)Online publication date: Apr-2020
https://doi.org/10.1109/ICDE48307.2020.00053
Han XWang BLi JGao H(2018)Efficiently processing deterministic approximate aggregation query on massive dataKnowledge and Information Systems10.1007/s10115-017-1136-z57:2(437-473)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1007/s10115-017-1136-z
Zhang XLi D(2016)Comparative studies of sampling for analytics on massive data2016 3rd International Conference on Systems and Informatics (ICSAI)10.1109/ICSAI.2016.7811097(1002-1007)Online publication date: Nov-2016
https://doi.org/10.1109/ICSAI.2016.7811097
Show More Cited By

Sample synopses for approximate answering of group-by queries
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

Join synopses for approximate query answering
SIGMOD '99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data

In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good approximate ...
Congressional samples for approximate answering of group-by queries
SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data

In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex decision support queries using precomputed summary statistics, such as samples. Decision support queries routinely segment the data into ...
Join synopses for approximate query answering

In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data. In this paper, we demonstrate the difficulty of providing good approximate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology

March 2009

1180 pages

ISBN:9781605584225

DOI:10.1145/1516360

Editors:
Martin Kersten
CWI, The Netherlands
,
Boris Novikov
University of Saint Petersburg, Russia
,
Jens Teubner
ETH Zurich, Switzerland
,
Vladimir Polutin
HP Labs, Russia
,
Stefan Manegold
CWI, The Netherlands

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 March 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

EDBT/ICDT '09

EDBT/ICDT '09: EDBT/ICDT '09 joint conference

March 24 - 26, 2009

Saint Petersburg, Russia

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
318
Total Downloads

Downloads (Last 12 months)60
Downloads (Last 6 weeks)22

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Nguyen TShih MParvathaneni SXu BSrivastava DTirthapura S(2020)Random Sampling for Group-By Queries2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00053(541-552)Online publication date: Apr-2020
https://doi.org/10.1109/ICDE48307.2020.00053
Han XWang BLi JGao H(2018)Efficiently processing deterministic approximate aggregation query on massive dataKnowledge and Information Systems10.1007/s10115-017-1136-z57:2(437-473)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1007/s10115-017-1136-z
Zhang XLi D(2016)Comparative studies of sampling for analytics on massive data2016 3rd International Conference on Systems and Informatics (ICSAI)10.1109/ICSAI.2016.7811097(1002-1007)Online publication date: Nov-2016
https://doi.org/10.1109/ICSAI.2016.7811097
Di Tria FLefons ETangorra F(2015)Benchmark for Approximate Query Answering SystemsJournal of Database Management10.4018/JDM.201501010126:1(1-29)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.4018/JDM.2015010101
Ci XMeng X(2015)An Efficient Block Sampling Strategy for Online Aggregation in the CloudWeb-Age Information Management10.1007/978-3-319-21042-1_29(362-373)Online publication date: 6-Jun-2015
https://doi.org/10.1007/978-3-319-21042-1_29
Yan YChen LZhang Z(2014)Error-bounded sampling for analytics on big sparse dataProceedings of the VLDB Endowment10.14778/2733004.27330227:13(1508-1519)Online publication date: 1-Aug-2014
https://dl.acm.org/doi/10.14778/2733004.2733022
Cao YFan WWo TYu W(2014)Bounded conjunctive queriesProceedings of the VLDB Endowment10.14778/2732977.27329967:12(1231-1242)Online publication date: 1-Aug-2014
https://dl.acm.org/doi/10.14778/2732977.2732996
Fan WHuai J(2014)Querying Big Data: Bridging Theory and PracticeJournal of Computer Science and Technology10.1007/s11390-014-1473-229:5(849-869)Online publication date: 12-Sep-2014
https://doi.org/10.1007/s11390-014-1473-2
Rösch PLehner W(2013)Optimizing Sample Design for Approximate Query ProcessingInternational Journal of Knowledge-Based Organizations10.4018/ijkbo.20131001013:4(1-21)Online publication date: Oct-2013
https://doi.org/10.4018/ijkbo.2013100101
Di Tria FLefons ETangorra FOssowski SLecca P(2012)Metrics for approximate query engine evaluationProceedings of the 27th Annual ACM Symposium on Applied Computing10.1145/2245276.2245448(885-887)Online publication date: 26-Mar-2012
https://dl.acm.org/doi/10.1145/2245276.2245448
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents