Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3035918.3056097acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Approximate Query Processing: No Silver Bullet

Published: 09 May 2017 Publication History

Abstract

In this paper, we reflect on the state of the art of Approximate Query Processing. Although much technical progress has been made in this area of research, we are yet to see its impact on products and services. We discuss two promising avenues to pursue towards integrating Approximate Query Processing into data platforms.

References

[1]
Microsoft powerbi. https://powerbi.microsoft.com/en-us/.
[2]
Oracle data mining blog: To sample or not to sample. https://blogs.oracle.com/datamining/entry/to_sample_or_not_to_sample.
[3]
Sampler in oracle sql server. http://bit.ly/2n7TZow.
[4]
Sampling in google bigquery. https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#approx_top_sum.
[5]
SnappyData.IO. http://www.snappydata.io.
[6]
Sparksql support for continuous answers with error bars. https://www.slideshare.net/SparkSummit/agarwal-zeng.
[7]
Sql server analysis services. https://technet.microsoft.com/en-us/library/ms175609(v=sql.90).aspx.
[8]
Tableau. https://www.tableau.com/products/cloud-bi.
[9]
J. Acharya, I. Diakonikolas, C. Hegde, J. Z. Li, and L. Schmidt. Fast and near-optimal algorithms for approximating distributions by histograms. In PODS, 2015.
[10]
S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In SIGMOD, 2000.
[11]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. In SIGMOD, 1999.
[12]
S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan, S. Madden, B. Mozafari, and I. Stoica. Knowing when you're wrong: Building fast and reliable approximate query processing systems. In SIGMOD, 2014.
[13]
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Eurosys, 2013.
[14]
S. Agarwal, A. Panda, B. Mozafari, A. P. Iyer, S. Madden, and I. Stoica. Blink and it's done: Interactive queries on very large data. PVLDB, 5(12), 2012.
[15]
S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materialized views and indexes in sql databases. In VLDB, 2000.
[16]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark sql: Relational data processing in spark. In SIGMOD, 2015.
[17]
B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In SIGMOD, 2003.
[18]
B. Babcock, M. Datar, and R. Motwani. Sampling from a moving window over streaming data. In SODA, 2002.
[19]
A. Bagchi, A. Chaudhary, D. Eppstein, and M. T. Goodrich. Deterministic sampling and range counting in geometric data streams. ACM Trans. Algorithms, 2007.
[20]
A. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences. IEEE Computer Society, 1997.
[21]
J. Brutlag. Speed matters for Google web search. http://bit.ly/1b4RKoZ, 2009.
[22]
Y. Cao and W. Fan. An effective syntax for bounded relational queries. In SIGMOD, 2016.
[23]
Y. Cao, W. Fan, and C. Hu. Data driven approximation with bounded resources. PVLDB, 10, 2017.
[24]
Y. Cao, W. Fan, T. Wo, and W. Yu. Bounded conjunctive queries. PVLDB, 7(12), 2014.
[25]
R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Datasets. In VLDB, 2008.
[26]
K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. VLDBJ, 10(2--3), 2001.
[27]
B. Chandramouli, J. Goldstein, M. Barnett, R. DeLine, D. Fisher, J. Platt, J. Terwilliger, J. Wernsing, and R. DeLine. Trill: A high-performance incremental query processor for diverse analytics. In VLDB, 2015.
[28]
B. Chandramouli, J. Goldstein, and A. Quamar. Scalable progressive analytics on big data in the cloud. In VLDB, 2014.
[29]
M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. In PODS, 2000.
[30]
S. Chaudhuri. What next?: a half-dozen data management research goals for big data and the cloud. In PODS, 2012.
[31]
S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. Narasayya. Overcoming limitations of sampling for aggregation queries. In ICDE, 2001.
[32]
S. Chaudhuri, G. Das, and V. Narasayya. A robust, optimization-based approach for approximate answering of aggregate queries. In SIGMOD, 2001.
[33]
S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. TODS, 32(2), 2007.
[34]
S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. In SIGMOD, 1999.
[35]
S. Chaudhuri, R. Motwani, and V. R. Narasayya. Random sampling for histogram construction: How much is enough? In SIGMOD, 1998.
[36]
K.-T. Chuang, H.-L. Chen, and M.-S. Chen. Feature-preserved sampling over streaming data. ACM Trans. Knowl. Discov. Data, 2009.
[37]
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In NSDI, 2010.
[38]
G. Cormode, M. Garofalakis, P. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends databases, 2012.
[39]
K. Delaney, C. Cunningham, J. Kehayias, B. Nevarez, and P. S. Randal. Microsoft SQL Server 2012 Internals. Microsoft Press, 2013.
[40]
B. Ding, S. Huang, S. Chaudhuri, K. Chakrabarti, and C. Wang. Sample seek: Approximating aggregates with distribution precision guarantee. In SIGMOD, 2016
[41]
M. Durand and P. Flajolet. Loglog counting of large cardinalities (extended abstract). In ESA, 2003.
[42]
C. Estan and J. F. Naughton. End-biased samples for join cardinality estimation. In ICDE, 2006.
[43]
W. Fan, F. Geerts, Y. Cao, T. Deng, and P. Lu. Querying big data by accessing small data. In PODS, 2015.
[44]
W. Fan, F. Geerts, and L. Libkin. On scale independence for querying big data. In PODS, 2014.
[45]
W. Fan, X. Wang, and Y. Wu. Querying big graphs within bounded resources. In SIGMOD, 2014.
[46]
P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In DMTCS, 2007.
[47]
V. Ganti, M.-L. Lee, and R. Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. In VLDB, 2000.
[48]
M. N. Garofalakis and P. B. Gibbons. Approximate query processing: Taming the terabytes. In VLDB, 2001.
[49]
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Optimal and approximate computation of summary statistics for range aggregates. In PODS, 2001.
[50]
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov., 1(1), 1997.
[51]
P. J. Haas and J. M. Hellerstein. Ripple joins for online aggregation. In SIGMOD, 1999.
[52]
J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2011.
[53]
V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In SIGMOD, 1996.
[54]
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. SIGMOD, 1997.
[55]
C. Jermaine, A. Dobra, S. Arumugam, S. Joshi, and A. Pol. The sort-merge-shrink join. ACM Trans. Database Syst., 2006.
[56]
C. Jermaine, A. Pol, and S. Arumugam. Online maintenance of very large random samples. In SIGMOD, 2004.
[57]
C. M. Jermaine. Online random shuffling of large database tables. IEEE Trans. Knowl. Data Eng., 19(1):73--84, 2007.
[58]
C. M. Jermaine, S. Arumugam, A. Pol, and A. Dobra. Scalable approximate query processing with the DBO engine. In SIGMOD, 2007.
[59]
N. Kamat and A. Nandi. Perfect and maximum randomness in stratified sampling over joins. CoRR, abs/1601.05118, 2016.
[60]
S. Kandula. Errata and proofs for "quickr". Technical Report TR-2017-14, MSR, 2017.
[61]
S. Kandula, A. Shanbhag, A. Vitorovic, M. Olma, R. Grandl, S. Chaudhuri, and B. Ding. Quickr: Lazily approximating complex adhoc queries in bigdata clusters. In SIGMOD, 2016.
[62]
A. Kim, E. Blais, A. G. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. Rapid sampling for visualizations with ordering guarantees. PVLDB, 8(5), 2015.
[63]
S. Krishnan, J. Wang, M. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. In VLDB, 2015.
[64]
P.-A. Larson, W. Lehner, J. Zhou, and P. Zabback. Cardinality estimation using sample views with quality assurance. In SIGMOD, 2007.
[65]
F. Li, B. Wu, K. Yi, and Z. Zhao. Wander join: Online aggregation via random walks. In SIGMOD, 2016.
[66]
S. L. Lohr. Sampling: Design and Analysis. Thomson, 2009.
[67]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In VLDB, 2010.
[68]
B. Mozafari, E. Z. Y. Goh, and D. Y. Yoon. Cliffguard: A principled framework for finding robust database designs. In SIGMOD, 2015.
[69]
B. Mozafari and N. Niu. A handbook for building an approximate query engine. In IEEE Data Engineering Bulletin, 2015.
[70]
S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 2005.
[71]
S. Nirkhiwale, A. Dobra, and C. Jermaine. A sampling algebra for aggregate estimation. In PVLDB, 2013.
[72]
F. Olken. Random Sampling from Databases. PhD thesis, UCBerkeley, 1993.
[73]
C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, and B. Reed. Interactive analysis of web-scale data. In CIDR, 2009.
[74]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In SIGMOD, 2008.
[75]
N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. PVLDB, 4(11), 2011.
[76]
G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuples satisfying a condition. In SIGMOD, 1984.
[77]
A. Pol, C. M. Jermaine, and S. Arumugam. Maintaining very large random samples using the geometric file. VLDB J., 17(5):997--1018, 2008.
[78]
N. Potti and J. M. Patel. Daq: a new paradigm for approximate query processing. PVLDB, 8(9), 2015.
[79]
J. Ramnarayan, B. Mozafari, S. Wale, S. Menon, N. Kumar, H. Bhanawat, S. C. Y. Mahajan, R. Mishra, and K. Bachhav. Snappydata: A hybrid transactional analytical store built on spark. In SIGMOD, 2016.
[80]
L. Sidirourgos, M. L. Kersten, and P. A. Boncz. Sciborq: Scientific data management with bounds on runtime and quality. In CIDR, 2011.
[81]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive- a warehousing solution over a map-reduce framework. In VLDB, 2009.
[82]
D. Vengerov, A. Menck, M. Zait, and S. Chakkappen. Join size estimation subject to filter conditions. In VLDB, 2015.
[83]
J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In SIGMOD, 1999.
[84]
M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In SOSP, 2013.
[85]
K. Zeng, S. Agarwal, A. Dave, M. Armbrust, and I. Stoica. G-ola: Generalized on-line aggregation for interactive analysis on big data. In SIGMOD, 2015.
[86]
K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analytical bootstrap: a new method for fast error estimation in approximate query processing. In SIGMOD, 2014.

Cited By

View all
  • (2024)GAN-Based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and SolutionsMachine Learning and Knowledge Extraction10.3390/make60100106:1(171-198)Online publication date: 16-Jan-2024
  • (2024)Hit the Gym: Accelerating Query Execution to Efficiently Bootstrap Behavior Models for Self-Driving Database Management SystemsProceedings of the VLDB Endowment10.14778/3681954.368203017:11(3680-3693)Online publication date: 1-Jul-2024
  • (2024)PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data CompressionProceedings of the VLDB Endowment10.14778/3648160.364818117:6(1432-1445)Online publication date: 1-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
May 2017
1810 pages
ISBN:9781450341974
DOI:10.1145/3035918
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. approximate query processing
  2. big data
  3. error guarantee
  4. olap
  5. pre-computation
  6. query optimization
  7. query processing
  8. sampling
  9. transformation rules

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)119
  • Downloads (Last 6 weeks)16
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)GAN-Based Tabular Data Generator for Constructing Synopsis in Approximate Query Processing: Challenges and SolutionsMachine Learning and Knowledge Extraction10.3390/make60100106:1(171-198)Online publication date: 16-Jan-2024
  • (2024)Hit the Gym: Accelerating Query Execution to Efficiently Bootstrap Behavior Models for Self-Driving Database Management SystemsProceedings of the VLDB Endowment10.14778/3681954.368203017:11(3680-3693)Online publication date: 1-Jul-2024
  • (2024)PairwiseHist: Fast, Accurate and Space-Efficient Approximate Query Processing with Data CompressionProceedings of the VLDB Endowment10.14778/3648160.364818117:6(1432-1445)Online publication date: 1-Feb-2024
  • (2024)Efficient and Reusable Lazy SamplingACM SIGMOD Record10.1145/3665252.366526153:1(33-42)Online publication date: 14-May-2024
  • (2024)ThalamusDB: Approximate Query Processing on Multi-Modal DataProceedings of the ACM on Management of Data10.1145/36549892:3(1-26)Online publication date: 30-May-2024
  • (2024)Generalized Measure-Biased Sampling and Priority SamplingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334067336:11(6251-6265)Online publication date: Nov-2024
  • (2024)Scalable Community Search with Accuracy Guarantee on Attributed Graphs2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00214(2737-2750)Online publication date: 13-May-2024
  • (2024)Efficient Random Sampling from Very Large DatabasesDatabase and Expert Systems Applications10.1007/978-3-031-68309-1_10(124-138)Online publication date: 18-Aug-2024
  • (2023)RALF: Accuracy-Aware Scheduling for Feature Store MaintenanceProceedings of the VLDB Endowment10.14778/3632093.363211617:3(563-576)Online publication date: 1-Nov-2023
  • (2023)Confidence Intervals for Private Query ProcessingProceedings of the VLDB Endowment10.14778/3632093.363210217:3(373-385)Online publication date: 1-Nov-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media