research-article

Dynamically optimizing queries over large scale data platforms

Authors:

Konstantinos Karanasos,

Jesse JacksonAuthors Info & Claims

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 943 - 954

https://doi.org/10.1145/2588555.2610531

Published: 18 June 2014 Publication History

Abstract

Enterprises are adapting large-scale data processing platforms, such as Hadoop, to gain actionable insights from their "big data". Query optimization is still an open challenge in this environment due to the volume and heterogeneity of data, comprising both structured and un/semi-structured datasets. Moreover, it has become common practice to push business logic close to the data via user-defined functions (UDFs), which are usually opaque to the optimizer, further complicating cost-based optimization. As a result, classical relational query optimization techniques do not fit well in this setting, while at the same time, suboptimal query plans can be disastrous with large datasets. In this paper, we propose new techniques that take into account UDFs and correlations between relations for optimizing queries running on large scale clusters. We introduce "pilot runs", which execute part of the query over a sample of the data to estimate selectivities, and employ a cost-based optimizer that uses these selectivities to choose an initial query plan. Then, we follow a dynamic optimization approach, in which plans evolve as parts of the queries get executed. Our experimental results show that our techniques produce plans that are at least as good as, and up to 2x (4x) better for Jaql (Hive) than, the best hand-written left-deep query plans.

References

[1]

S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing data-parallel computing. In NSDI, 2012.

Digital Library

[2]

S. Babu, P. Bizarro, and D. J. DeWitt. Proactive re-optimization. In SIGMOD Conference, pages 107--118, 2005.

Digital Library

[3]

D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In SoCC, pages 119--130, 2010.

Digital Library

[4]

P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. R. Jr. Query processing in a system for distributed databases (SDD-1). ACM Trans. Database Syst., 6(4):602--625, 1981.

Digital Library

[5]

K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C.-C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. PVLDB, 4(12), 2011.

[6]

K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD, pages 199--210, 2007.

Digital Library

[7]

S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In SIGMOD, pages 975--986, 2010.

Digital Library

[8]

N. Bruno, S. Jain, and J. Zhou. Continuous cloud-scale query optimization and processing. In VLDB, 2013.

Digital Library

[9]

M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. In PODS, pages 268--279, 2000.

Digital Library

[10]

S. Chaudhuri, G. Das, and U. Srivastava. Effective use of block-level sampling in statistics estimation. In SIGMOD Conference, 2004.

Digital Library

[11]

S. Chaudhuri and K. Shim. Optimization of queries with user-defined predicates. ACM Trans. Database Syst., 24(2):177--228, 1999.

Digital Library

[12]

Columbia Query Optimizer. http://web.cecs.pdx.edu/len/Columbia.

[13]

B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In SoCC, pages 143--154, 2010.

Digital Library

[14]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.

Digital Library

[15]

A. Deshpande, Z. G. Ives, and V. Raman. Adaptive query processing. Foundations and Trends in Databases, 1(1):1--140, 2007.

Digital Library

[16]

D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6):85--98, 1992.

Digital Library

[17]

A. Gates, J. Dai, and T. Nair. Apache Pig's optimizer. IEEE Data Eng. Bull., 36(1):34--45, 2013.

[18]

A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of MapReduce: The Pig experience. PVLDB, 2(2):1414--1425, 2009.

Digital Library

[19]

A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte, and H.-A. Jacobsen. BigBench: towards an industry standard benchmark for big data analytics. In SIGMOD, pages 1197--1208, 2013.

Digital Library

[20]

G. Graefe. Query evaluation techniques for large databases. ACM Comput. Surv., 25(2):73--170, 1993.

Digital Library

[21]

G. Graefe. The Cascades framework for query optimization. IEEE Data Eng. Bull., 18(3):19--29, 1995.

[22]

W.-S. Han, J. Ng, V. Markl, H. Kache, and M. Kandil. Progressive optimization in a shared-nothing parallel database. In SIGMOD, pages 809--820, 2007.

Digital Library

[23]

Z. He, B. S. Lee, and R. R. Snapp. Self-tuning cost modeling of user-defined functions in an object-relational dbms. ACM Trans. Database Syst., 30(3):812--853, 2005.

Digital Library

[24]

J. M. Hellerstein. Optimization techniques for queries with expensive methods. ACM TODS, 23(2):113--157, 1998.

Digital Library

[25]

F. Hueske, M. Peters, M. Sax, A. Rheinländer, R. Bergmann, A. Krettek, and K. Tzoumas. Opening the black boxes in data flow optimization. PVLDB, 5(11):1256--1267, 2012.

Digital Library

[26]

I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. CORDS: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, pages 647--658, 2004.

Digital Library

[27]

Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. In SIGMOD Conference, pages 268--277, 1991.

Digital Library

[28]

N. Kabra and D. J. DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. In SIGMOD Conference, pages 106--117, 1998.

Digital Library

[29]

B. S. Lee, L. Chen, J. Buzas, and V. Kannoth. Regression-based self-tuning modeling of smooth user-defined function costs for an object-relational database management system query optimizer. Comput. J., 47(6):673--693, 2004.

[30]

H. Lim, H. Herodotou, and S. Babu. Stubby: A transformation-based optimizer for mapreduce workflows. PVLDB, 5(11):1196--1207, 2012.

Digital Library

[31]

V. Markl, V. Raman, D. E. Simmen, G. M. Lohman, and H. Pirahesh. Robust query processing through progressive optimization. In SIGMOD, pages 659--670, 2004.

Digital Library

[32]

N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large MapReduce jobs. PVLDB, 4(11):1135--1145, 2011.

Digital Library

[33]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178, 2009.

Digital Library

[34]

H. Pirahesh, J. M. Hellerstein, and W. Hasan. Extensible/rule based query rewrite optimization in Starburst. In SIGMOD, pages 39--48, 1992.

Digital Library

[35]

P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD, pages 23--34, 1979.

Digital Library

[36]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - a warehousing solution over a Map-Reduce framework. PVLDB, 2(2):1626--1629, 2009.

Digital Library

[37]

TPC-H Benchmark. http://www.tpc.org/tpch.

[38]

R. Vernica, A. Balmin, K. S. Beyer, and V. Ercegovac. Adaptive MapReduce using situation-aware mappers. In EDBT, pages 420--431, 2012.

Digital Library

[39]

S. Wu, F. Li, S. Mehrotra, and B. C. Ooi. Query optimization for massively parallel data processing. In SoCC, page 12, 2011.

Digital Library

[40]

R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD, pages 13--24, 2013.

Digital Library

[41]

Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pages 1--14, 2008.

Digital Library

[42]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.

Digital Library

Cited By

Winter CGiceva JNeumann TKemper A(2022)On-demand state separation for cloud data warehousingProceedings of the VLDB Endowment10.14778/3551793.355184515:11(2966-2979)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3551793.3551845
Jankov DYuan BLuo SJermaine C(2021)Distributed numerical and machine learning computations via two-phase execution of aggregated join treesProceedings of the VLDB Endowment10.14778/3450980.345099114:7(1228-1240)Online publication date: 12-Apr-2021
https://dl.acm.org/doi/10.14778/3450980.3450991
Trummer IWang JWei ZMaram DMoseley SJo SAntonakakis JRayabhari A(2021)SkinnerDB: Regret-bounded Query Evaluation via Reinforcement LearningACM Transactions on Database Systems10.1145/346438946:3(1-45)Online publication date: 28-Sep-2021
https://dl.acm.org/doi/10.1145/3464389
Show More Cited By

Index Terms

Dynamically optimizing queries over large scale data platforms
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Optimizing large star-schema queries with snowflakes via heuristic-based query rewriting
CASCON '03: Proceedings of the 2003 conference of the Centre for Advanced Studies on Collaborative research

User queries have been becoming increasingly complex (e.g., involving a large number of joins) as database technology is applied to some application domains such as data warehouses and life sciences. Query optimizers in existing database management ...
Equivalence and minimization of conjunctive queries under combined semantics
ICDT '12: Proceedings of the 15th International Conference on Database Theory

The problems of query containment, equivalence, and minimization are fundamental problems in the context of query processing and optimization. In their classic work [2] published in 1977, Chandra and Merlin solved the three problems for the language of ...
Optimizing complex queries based on similarities of subqueries

As database technology is applied to more and more application domains, user queries are becoming increasingly complex (e.g. involving a large number of joins and a complex query structure). Query optimizers in existing database management systems (DBMS)...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

June 2014

1645 pages

ISBN:9781450323765

DOI:10.1145/2588555

General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'14

Sponsor:

SIGMOD

SIGMOD/PODS'14: International Conference on Management of Data

June 22 - 27, 2014

Utah, Snowbird, USA

Acceptance Rates

SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
854
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)5

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Winter CGiceva JNeumann TKemper A(2022)On-demand state separation for cloud data warehousingProceedings of the VLDB Endowment10.14778/3551793.355184515:11(2966-2979)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3551793.3551845
Jankov DYuan BLuo SJermaine C(2021)Distributed numerical and machine learning computations via two-phase execution of aggregated join treesProceedings of the VLDB Endowment10.14778/3450980.345099114:7(1228-1240)Online publication date: 12-Apr-2021
https://dl.acm.org/doi/10.14778/3450980.3450991
Trummer IWang JWei ZMaram DMoseley SJo SAntonakakis JRayabhari A(2021)SkinnerDB: Regret-bounded Query Evaluation via Reinforcement LearningACM Transactions on Database Systems10.1145/346438946:3(1-45)Online publication date: 28-Sep-2021
https://dl.acm.org/doi/10.1145/3464389
Li YInterlandi MPsallidas FWang WZaniolo C(2021)SEIZE: Runtime Inspection for Parallel Dataflow SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.303517032:4(842-854)Online publication date: 1-Apr-2021
https://doi.org/10.1109/TPDS.2020.3035170
Zhao YChen R(2021)Spark SQL Query Optimization Based on Runtime Statistics Collection2021 IEEE 6th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA)10.1109/ICCCBDA51879.2021.9442524(250-255)Online publication date: 24-Apr-2021
https://doi.org/10.1109/ICCCBDA51879.2021.9442524
Chugh ASharma VBhatia MJain C(2021)A Big Data Query Optimization Framework for Telecom Customer Churn AnalysisInternational Conference on Innovative Computing and Communications10.1007/978-981-16-2597-8_40(475-484)Online publication date: 1-Sep-2021
https://doi.org/10.1007/978-981-16-2597-8_40
Benkrid SBellatreche LMestoui YOrdonez C(2021)Towards an Adaptive Multidimensional Partitioning for Accelerating Spark SQLBig Data Analytics and Knowledge Discovery10.1007/978-3-030-86534-4_3(27-38)Online publication date: 5-Sep-2021
https://doi.org/10.1007/978-3-030-86534-4_3
Schiavio FBonetta DBinder W(2020)Dynamic speculative optimizations for SQL compilation in Apache SparkProceedings of the VLDB Endowment10.14778/3377369.337738213:5(754-767)Online publication date: 19-Feb-2020
https://dl.acm.org/doi/10.14778/3377369.3377382
Sikdar SJermaine CMaier DPottinger RDoan ATan WAlawini ANgo H(2020)MONSOON: Multi-Step Optimization and Execution of Queries with Partially Obscured PredicatesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389728(225-240)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389728
Chugh ASharma VJain C(2020)Big Data and Query Optimization TechniquesAdvances in Computing and Intelligent Systems10.1007/978-981-15-0222-4_30(337-345)Online publication date: 3-Jan-2020
https://doi.org/10.1007/978-981-15-0222-4_30
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents