article

Continuous cloud-scale query optimization and processing

Authors:

Jingren ZhouAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 6, Issue 11

Pages 961 - 972

https://doi.org/10.14778/2536222.2536223

Published: 01 August 2013 Publication History

Abstract

Massive data analysis in cloud-scale data centers plays a crucial role in making critical business decisions. High-level scripting languages free developers from understanding various system trade-offs, but introduce new challenges for query optimization. One key optimization challenge is missing accurate data statistics, typically due to massive data volumes and their distributed nature, complex computation logic, and frequent usage of user-defined functions. In this paper we propose novel techniques to adapt query processing in the Scope system, the cloud-scale computation environment in Microsoft Online Services. We continuously monitor query execution, collect actual runtime statistics, and adapt parallel execution plans as the query executes. We discuss similarities and differences between our approach and alternatives proposed in the context of traditional centralized systems. Experiments on large-scale Scope production clusters show that the proposed techniques systematically solve the challenge of missing/inaccurate data statistics, detect and resolve partition skew and plan structure, and improve query latency by a few folds for real workloads. Although we focus on optimizing high-level languages, the same ideas are also applicable for MapReduce systems.

References

[1]

S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing data-parallel computing. In Proceedings of NSDI, 2012.

[2]

Apache. Hadoop. http://hadoop.apache.org/.

[3]

R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In Proceedings of SIGMOD Conference, 2000.

[4]

D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In Proceedings of the ACM symposium on Cloud computing, 2010.

[5]

K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. Kanne, F. Ozcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. In Proceedings of VLDB Conference, 2011.

[6]

V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of ICDE Conference, 2011.

[7]

N. Bruno, S. Agarwal, S. Kandula, M.-C. Wu, B. Shi, and J. Zhou. Recurring job optimization in scope. In Proceedings of SIGMOD Conference, 2012.

[8]

B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing: A SQL implementation on the mapreduce framework. In Proceedings of VLDB Conference, 2011.

[9]

R. L. Cole and G. Graefe. Optimization of dynamic query evaluation plans. In Proceedings of SIGMOD Conference, 1994.

[10]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of OSDI Conference, 2004.

[11]

G. Graefe. The Cascades framework for query optimization. Data Engineering Bulletin, 18(3), 1995.

[12]

G. Graefe and K. Ward. Dynamic query evaluation plans. In Proceedings of SIGMOD Conference, 1989.

[13]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. of EuroSys Conference, 2007.

[14]

M. Isard et al. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. of EuroSys Conference, 2007.

[15]

N. Kabra and D. J. DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. In Proceedings of SIGMOD Conference, 1998.

[16]

Q. Li, M. Shao, V. Markl, K. S. Beyer, L. S. Colby, and G. M. Lohman. Adaptively reordering joins during query execution. In Proceedings of ICDE Conference, 2007.

[17]

H. Lim, H. Herodotou, and S. Babu. Stubby: A transformation-based optimizer for mapreduce workflows. In Proceedings of VLDB Conference, 2012.

[18]

H. Lim, H. Herodotou, and S. Babu. Stubby: A Transformation-based Optimizer for MapReduce Workflows. PVLDB, 5(11), 2012.

[19]

V. Markl, V. Raman, D. E. Simmen, G. M. Lohman, and H. Pirahesh. Robust query processing through progressive optimization. In Proceedings of SIGMOD Conference, 2004.

[20]

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of webscale datasets. In Proceedings of VLDB Conference, 2010.

[21]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In Proceedings of SIGMOD Conference, 2008.

[22]

V. Raman, A. Deshpande, and J. M. Hellerstein. Using state modules for adaptive query processing. In Proceedings of ICDE Conference, 2003.

[23]

M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. LEO - DB2's LEarning Optimizer. In Proceedings of VLDB Conference, 2001.

[24]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using Hadoop. In Proceedings of ICDE Conference, 2010.

[25]

Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proc. of OSDI Conference, 2008.

[26]

J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. SCOPE: Parallel databases meet mapreduce. The VLDB Journal, 21(5), 2012.

[27]

J. Zhou, P.-Å. Larson, and R. Chaiken. Incorporating partitioning and parallel plans into the SCOPE optimizer. In Proceedings of ICDE Conference, 2010.

Cited By

Chen YHoque MXu PLu JTarkoma S(2024)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s10619-023-07436-y
Pavlopoulou CCarey MTsotras V(2023)Revisiting Runtime Dynamic Optimization for Join Queries in Big Data Management SystemsACM SIGMOD Record10.1145/3604437.360446052:1(104-113)Online publication date: 8-Jun-2023
https://dl.acm.org/doi/10.1145/3604437.3604460
Zhu YSen RHorton RAgosta J(2023)Runtime Variation in Big Data AnalyticsProceedings of the ACM on Management of Data10.1145/35889211:1(1-20)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588921
Show More Cited By

Index Terms

Continuous cloud-scale query optimization and processing
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Big data multi-query optimisation with Apache Flink

Big data analytic frameworks, such as MapReduce, Spark and Flink, have recently gained more popularity to process large data. Flink is an open-source Apache-hosted big data analytic framework for processing batch and streaming data. For historical data ...
Query optimization for massively parallel data processing
SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

MapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. Some vendors have ...
Distributed stream join query processing with semijoins

This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 6, Issue 11

August 2013

237 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2013

Published in PVLDB Volume 6, Issue 11

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
299
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)2

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen YHoque MXu PLu JTarkoma S(2024)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s10619-023-07436-y
Pavlopoulou CCarey MTsotras V(2023)Revisiting Runtime Dynamic Optimization for Join Queries in Big Data Management SystemsACM SIGMOD Record10.1145/3604437.360446052:1(104-113)Online publication date: 8-Jun-2023
https://dl.acm.org/doi/10.1145/3604437.3604460
Zhu YSen RHorton RAgosta J(2023)Runtime Variation in Big Data AnalyticsProceedings of the ACM on Management of Data10.1145/35889211:1(1-20)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588921
Cheng LWang YJhaveri RWang QMao Y(2023)Toward Network-Aware Query Execution Systems in Large DatacentersIEEE Transactions on Network and Service Management10.1109/TNSM.2023.327316620:4(4494-4504)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TNSM.2023.3273166
Chen YWang JLu YHan YLv ZMin XCai HZhang WFan HLi CGuan TLin WJia YZhou J(2021)FangornProceedings of the VLDB Endowment10.14778/3476311.347637614:12(2972-2985)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.14778/3476311.3476376
Zhu YKrishnan SKaranasos KTarte IPower CModi AKumar MZhang DMuthyala KJurgens NSakalanaga SDarbha SIyer MAgarwal ACurino CLi GLi ZIdreos SSrivastava D(2021)KEA: Tuning an Exabyte-Scale Data InfrastructureProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457569(2667-2680)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457569
Wang CGruenwald Ld’Orazio LLeal E(2021)Cloud Query Processing with Reinforcement Learning-Based Multi-objective Re-optimizationModel and Data Engineering10.1007/978-3-030-78428-7_12(141-155)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1007/978-3-030-78428-7_12
Jindal A(2020)Applied Research Lessons from CloudViews ProjectACM SIGMOD Record10.1145/3444831.344483949:3(37-42)Online publication date: 17-Dec-2020
https://dl.acm.org/doi/10.1145/3444831.3444839
Sikdar SJermaine CMaier DPottinger RDoan ATan WAlawini ANgo H(2020)MONSOON: Multi-Step Optimization and Execution of Queries with Partially Obscured PredicatesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389728(225-240)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389728
Siddiqui TJindal AQiao SPatel HLe WMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our FindingsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380584(99-113)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3380584
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents