research-article

Approximate Query Processing: No Silver Bullet

Authors:

Surajit Chaudhuri,

Srikanth KandulaAuthors Info & Claims

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Pages 511 - 519

https://doi.org/10.1145/3035918.3056097

Published: 09 May 2017 Publication History

Abstract

In this paper, we reflect on the state of the art of Approximate Query Processing. Although much technical progress has been made in this area of research, we are yet to see its impact on products and services. We discuss two promising avenues to pursue towards integrating Approximate Query Processing into data platforms.

References

[1]

Microsoft powerbi. https://powerbi.microsoft.com/en-us/.

[2]

Oracle data mining blog: To sample or not to sample. https://blogs.oracle.com/datamining/entry/to_sample_or_not_to_sample.

[3]

Sampler in oracle sql server. http://bit.ly/2n7TZow.

[4]

Sampling in google bigquery. https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#approx_top_sum.

[5]

SnappyData.IO. http://www.snappydata.io.

[6]

Sparksql support for continuous answers with error bars. https://www.slideshare.net/SparkSummit/agarwal-zeng.

[7]

Sql server analysis services. https://technet.microsoft.com/en-us/library/ms175609(v=sql.90).aspx.

[8]

Tableau. https://www.tableau.com/products/cloud-bi.

[9]

J. Acharya, I. Diakonikolas, C. Hegde, J. Z. Li, and L. Schmidt. Fast and near-optimal algorithms for approximating distributions by histograms. In PODS, 2015.

Digital Library

[10]

S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In SIGMOD, 2000.

Digital Library

[11]

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. In SIGMOD, 1999.

Digital Library

[12]

S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan, S. Madden, B. Mozafari, and I. Stoica. Knowing when you're wrong: Building fast and reliable approximate query processing systems. In SIGMOD, 2014.

Digital Library

[13]

S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Eurosys, 2013.

Digital Library

[14]

S. Agarwal, A. Panda, B. Mozafari, A. P. Iyer, S. Madden, and I. Stoica. Blink and it's done: Interactive queries on very large data. PVLDB, 5(12), 2012.

Digital Library

[15]

S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materialized views and indexes in sql databases. In VLDB, 2000.

Digital Library

[16]

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark sql: Relational data processing in spark. In SIGMOD, 2015.

Digital Library

[17]

B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD, 2003.

Digital Library

[18]

B. Babcock, M. Datar, and R. Motwani. Sampling from a moving window over streaming data. In SODA, 2002.

Digital Library

[19]

A. Bagchi, A. Chaudhary, D. Eppstein, and M. T. Goodrich. Deterministic sampling and range counting in geometric data streams. ACM Trans. Algorithms, 2007.

Digital Library

[20]

A. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences. IEEE Computer Society, 1997.

Digital Library

[21]

J. Brutlag. Speed matters for Google web search. http://bit.ly/1b4RKoZ, 2009.

[22]

Y. Cao and W. Fan. An effective syntax for bounded relational queries. In SIGMOD, 2016.

Digital Library

[23]

Y. Cao, W. Fan, and C. Hu. Data driven approximation with bounded resources. PVLDB, 10, 2017.

[24]

Y. Cao, W. Fan, T. Wo, and W. Yu. Bounded conjunctive queries. PVLDB, 7(12), 2014.

Digital Library

[25]

R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Datasets. In VLDB, 2008.

Digital Library

[26]

K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. VLDBJ, 10(2--3), 2001.

Digital Library

[27]

B. Chandramouli, J. Goldstein, M. Barnett, R. DeLine, D. Fisher, J. Platt, J. Terwilliger, J. Wernsing, and R. DeLine. Trill: A high-performance incremental query processor for diverse analytics. In VLDB, 2015.

Digital Library

[28]

B. Chandramouli, J. Goldstein, and A. Quamar. Scalable progressive analytics on big data in the cloud. In VLDB, 2014.

Digital Library

[29]

M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. In PODS, 2000.

Digital Library

[30]

S. Chaudhuri. What next?: a half-dozen data management research goals for big data and the cloud. In PODS, 2012.

Digital Library

[31]

S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. Narasayya. Overcoming limitations of sampling for aggregation queries. In ICDE, 2001.

Digital Library

[32]

S. Chaudhuri, G. Das, and V. Narasayya. A robust, optimization-based approach for approximate answering of aggregate queries. In SIGMOD, 2001.

Digital Library

[33]

S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. TODS, 32(2), 2007.

Digital Library

[34]

S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. In SIGMOD, 1999.

Digital Library

[35]

S. Chaudhuri, R. Motwani, and V. R. Narasayya. Random sampling for histogram construction: How much is enough? In SIGMOD, 1998.

Digital Library

[36]

K.-T. Chuang, H.-L. Chen, and M.-S. Chen. Feature-preserved sampling over streaming data. ACM Trans. Knowl. Discov. Data, 2009.

Digital Library

[37]

T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In NSDI, 2010.

Digital Library

[38]

G. Cormode, M. Garofalakis, P. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends databases, 2012.

Digital Library

[39]

K. Delaney, C. Cunningham, J. Kehayias, B. Nevarez, and P. S. Randal. Microsoft SQL Server 2012 Internals. Microsoft Press, 2013.

[40]

B. Ding, S. Huang, S. Chaudhuri, K. Chakrabarti, and C. Wang. Sample seek: Approximating aggregates with distribution precision guarantee. In SIGMOD, 2016

Digital Library

[41]

M. Durand and P. Flajolet. Loglog counting of large cardinalities (extended abstract). In ESA, 2003.

[42]

C. Estan and J. F. Naughton. End-biased samples for join cardinality estimation. In ICDE, 2006.

Digital Library

[43]

W. Fan, F. Geerts, Y. Cao, T. Deng, and P. Lu. Querying big data by accessing small data. In PODS, 2015.

Digital Library

[44]

W. Fan, F. Geerts, and L. Libkin. On scale independence for querying big data. In PODS, 2014.

Digital Library

[45]

W. Fan, X. Wang, and Y. Wu. Querying big graphs within bounded resources. In SIGMOD, 2014.

Digital Library

[46]

P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In DMTCS, 2007.

[47]

V. Ganti, M.-L. Lee, and R. Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. In VLDB, 2000.

Digital Library

[48]

M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. In VLDB, 2001.

Digital Library

[49]

A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Optimal and approximate computation of summary statistics for range aggregates. In PODS, 2001.

Digital Library

[50]

J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov., 1(1), 1997.

Digital Library

[51]

P. J. Haas and J. M. Hellerstein. Ripple joins for online aggregation. In SIGMOD, 1999.

Digital Library

[52]

J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2011.

Digital Library

[53]

V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In SIGMOD, 1996.

Digital Library

[54]

J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. SIGMOD, 1997.

Digital Library

[55]

C. Jermaine, A. Dobra, S. Arumugam, S. Joshi, and A. Pol. The sort-merge-shrink join. ACM Trans. Database Syst., 2006.

Digital Library

[56]

C. Jermaine, A. Pol, and S. Arumugam. Online maintenance of very large random samples. In SIGMOD, 2004.

Digital Library

[57]

C. M. Jermaine. Online random shuffling of large database tables. IEEE Trans. Knowl. Data Eng., 19(1):73--84, 2007.

Digital Library

[58]

C. M. Jermaine, S. Arumugam, A. Pol, and A. Dobra. Scalable approximate query processing with the DBO engine. In SIGMOD, 2007.

Digital Library

[59]

N. Kamat and A. Nandi. Perfect and maximum randomness in stratified sampling over joins. CoRR, abs/1601.05118, 2016.

[60]

S. Kandula. Errata and proofs for "quickr". Technical Report TR-2017-14, MSR, 2017.

[61]

S. Kandula, A. Shanbhag, A. Vitorovic, M. Olma, R. Grandl, S. Chaudhuri, and B. Ding. Quickr: Lazily approximating complex adhoc queries in bigdata clusters. In SIGMOD, 2016.

Digital Library

[62]

A. Kim, E. Blais, A. G. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. Rapid sampling for visualizations with ordering guarantees. PVLDB, 8(5), 2015.

Digital Library

[63]

S. Krishnan, J. Wang, M. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. In VLDB, 2015.

Digital Library

[64]

P.-A. Larson, W. Lehner, J. Zhou, and P. Zabback. Cardinality estimation using sample views with quality assurance. In SIGMOD, 2007.

Digital Library

[65]

F. Li, B. Wu, K. Yi, and Z. Zhao. Wander join: Online aggregation via random walks. In SIGMOD, 2016.

Digital Library

[66]

S. L. Lohr. Sampling: Design and Analysis. Thomson, 2009.

[67]

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In VLDB, 2010.

Digital Library

[68]

B. Mozafari, E. Z. Y. Goh, and D. Y. Yoon. Cliffguard: A principled framework for finding robust database designs. In SIGMOD, 2015.

Digital Library

[69]

B. Mozafari and N. Niu. A handbook for building an approximate query engine. In IEEE Data Engineering Bulletin, 2015.

[70]

S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 2005.

Digital Library

[71]

S. Nirkhiwale, A. Dobra, and C. Jermaine. A sampling algebra for aggregate estimation. In PVLDB, 2013.

Digital Library

[72]

F. Olken. Random Sampling from Databases. PhD thesis, UCBerkeley, 1993.

[73]

C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, and B. Reed. Interactive analysis of web-scale data. In CIDR, 2009.

[74]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In SIGMOD, 2008.

Digital Library

[75]

N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. PVLDB, 4(11), 2011.

[76]

G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuples satisfying a condition. In SIGMOD, 1984.

Digital Library

[77]

A. Pol, C. M. Jermaine, and S. Arumugam. Maintaining very large random samples using the geometric file. VLDB J., 17(5):997--1018, 2008.

Digital Library

[78]

N. Potti and J. M. Patel. Daq: a new paradigm for approximate query processing. PVLDB, 8(9), 2015.

Digital Library

[79]

J. Ramnarayan, B. Mozafari, S. Wale, S. Menon, N. Kumar, H. Bhanawat, S. C. Y. Mahajan, R. Mishra, and K. Bachhav. Snappydata: A hybrid transactional analytical store built on spark. In SIGMOD, 2016.

Digital Library

[80]

L. Sidirourgos, M. L. Kersten, and P. A. Boncz. Sciborq: Scientific data management with bounds on runtime and quality. In CIDR, 2011.

[81]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive- a warehousing solution over a map-reduce framework. In VLDB, 2009.

Digital Library

[82]

D. Vengerov, A. Menck, M. Zait, and S. Chakkappen. Join size estimation subject to filter conditions. In VLDB, 2015.

Digital Library

[83]

J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In SIGMOD, 1999.

Digital Library

[84]

M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In SOSP, 2013.

Digital Library

[85]

K. Zeng, S. Agarwal, A. Dave, M. Armbrust, and I. Stoica. G-ola: Generalized on-line aggregation for interactive analysis on big data. In SIGMOD, 2015.

Digital Library

[86]

K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analytical bootstrap: a new method for fast error estimation in approximate query processing. In SIGMOD, 2014.

Digital Library

Cited By

Fallahian MDorodchi MKreth K(2024)GAN-Based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and SolutionsMachine Learning and Knowledge Extraction10.3390/make60100106:1(171-198)Online publication date: 16-Jan-2024
https://doi.org/10.3390/make6010010
Lim WMa LZhang WButrovich MArch SPavlo A(2024)Hit the Gym: Accelerating Query Execution to Efficiently Bootstrap Behavior Models for Self-Driving Database Management SystemsProceedings of the VLDB Endowment10.14778/3681954.368203017:11(3680-3693)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682030
Hurst ALucani DZhang Q(2024)PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data CompressionProceedings of the VLDB Endowment10.14778/3648160.364818117:6(1432-1445)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648181
Show More Cited By

Index Terms

Approximate Query Processing: No Silver Bullet

Recommendations

Approximate Query Processing with Error Guarantees
Big-Data-Analytics in Astronomy, Science, and Engineering
Abstract
In recent years, with the increase of data and the sophistication of analysis requirements, query processing in databases has become more important. Recently, approximate query processing (AQP) was proposed for efficiently executing database ...
Approximate query processing using wavelets

Approximate query processing has emerged as a cost-effective approach for dealing with the huge data volumes and stringent response-time requirements of today's decision support systems (DSS). Most work in this area, however, has so far been limited in ...
Combining Joint and Semi-Join Operations for Distributed Query Processing

The application of a combination of join and semi-join operations to minimize the amount of data transmission required for distributed query processing is discussed. Specifically, two important concepts that occur with the use of join operations as ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

May 2017

1810 pages

ISBN:9781450341974

DOI:10.1145/3035918

General Chairs:
Rada Chirkova
North Carolina State University, USA
,
Jun Yang
Duke University, USA
,
Program Chair:
Dan Suciu
University of Washington, USA

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'17

Sponsor:

SIGMOD

SIGMOD/PODS'17: International Conference on Management of Data

May 14 - 19, 2017

Illinois, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

94
Total Citations
View Citations
2,115
Total Downloads

Downloads (Last 12 months)119
Downloads (Last 6 weeks)16

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Fallahian MDorodchi MKreth K(2024)GAN-Based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and SolutionsMachine Learning and Knowledge Extraction10.3390/make60100106:1(171-198)Online publication date: 16-Jan-2024
https://doi.org/10.3390/make6010010
Lim WMa LZhang WButrovich MArch SPavlo A(2024)Hit the Gym: Accelerating Query Execution to Efficiently Bootstrap Behavior Models for Self-Driving Database Management SystemsProceedings of the VLDB Endowment10.14778/3681954.368203017:11(3680-3693)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682030
Hurst ALucani DZhang Q(2024)PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data CompressionProceedings of the VLDB Endowment10.14778/3648160.364818117:6(1432-1445)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648181
Sanca VChrysogelos PAilamaki A(2024)Efficient and Reusable Lazy SamplingACM SIGMOD Record10.1145/3665252.366526153:1(33-42)Online publication date: 14-May-2024
https://dl.acm.org/doi/10.1145/3665252.3665261
Jo STrummer I(2024)ThalamusDB: Approximate Query Processing on Multi-Modal DataProceedings of the ACM on Management of Data10.1145/36549892:3(1-26)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654989
Chang ZLi FShen Y(2024)Generalized Measure-Biased Sampling and Priority SamplingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334067336:11(6251-6265)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2023.3340673
Wang YYe SXu XGeng YZhao ZKe XWu T(2024)Scalable Community Search with Accuracy Guarantee on Attributed Graphs2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00214(2737-2750)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00214
Cohen IYehezkel AYakhini Z(2024)Efficient Random Sampling from Very Large DatabasesDatabase and Expert Systems Applications10.1007/978-3-031-68309-1_10(124-138)Online publication date: 18-Aug-2024
https://doi.org/10.1007/978-3-031-68309-1_10
Wooders SMo XNarang ALin KStoica IHellerstein JCrooks NGonzalez J(2023)RALF: Accuracy-Aware Scheduling for Feature Store MaintenanceProceedings of the VLDB Endowment10.14778/3632093.363211617:3(563-576)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.14778/3632093.3632116
Sun DDong WYi K(2023)Confidence Intervals for Private Query ProcessingProceedings of the VLDB Endowment10.14778/3632093.363210217:3(373-385)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.14778/3632093.3632102
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents