Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Continuous cloud-scale query optimization and processing

Published: 01 August 2013 Publication History

Abstract

Massive data analysis in cloud-scale data centers plays a crucial role in making critical business decisions. High-level scripting languages free developers from understanding various system trade-offs, but introduce new challenges for query optimization. One key optimization challenge is missing accurate data statistics, typically due to massive data volumes and their distributed nature, complex computation logic, and frequent usage of user-defined functions. In this paper we propose novel techniques to adapt query processing in the Scope system, the cloud-scale computation environment in Microsoft Online Services. We continuously monitor query execution, collect actual runtime statistics, and adapt parallel execution plans as the query executes. We discuss similarities and differences between our approach and alternatives proposed in the context of traditional centralized systems. Experiments on large-scale Scope production clusters show that the proposed techniques systematically solve the challenge of missing/inaccurate data statistics, detect and resolve partition skew and plan structure, and improve query latency by a few folds for real workloads. Although we focus on optimizing high-level languages, the same ideas are also applicable for MapReduce systems.

References

[1]
S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing data-parallel computing. In Proceedings of NSDI, 2012.
[2]
Apache. Hadoop. http://hadoop.apache.org/.
[3]
R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In Proceedings of SIGMOD Conference, 2000.
[4]
D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In Proceedings of the ACM symposium on Cloud computing, 2010.
[5]
K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. Kanne, F. Ozcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. In Proceedings of VLDB Conference, 2011.
[6]
V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of ICDE Conference, 2011.
[7]
N. Bruno, S. Agarwal, S. Kandula, M.-C. Wu, B. Shi, and J. Zhou. Recurring job optimization in scope. In Proceedings of SIGMOD Conference, 2012.
[8]
B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing: A SQL implementation on the mapreduce framework. In Proceedings of VLDB Conference, 2011.
[9]
R. L. Cole and G. Graefe. Optimization of dynamic query evaluation plans. In Proceedings of SIGMOD Conference, 1994.
[10]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of OSDI Conference, 2004.
[11]
G. Graefe. The Cascades framework for query optimization. Data Engineering Bulletin, 18(3), 1995.
[12]
G. Graefe and K. Ward. Dynamic query evaluation plans. In Proceedings of SIGMOD Conference, 1989.
[13]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. of EuroSys Conference, 2007.
[14]
M. Isard et al. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. of EuroSys Conference, 2007.
[15]
N. Kabra and D. J. DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. In Proceedings of SIGMOD Conference, 1998.
[16]
Q. Li, M. Shao, V. Markl, K. S. Beyer, L. S. Colby, and G. M. Lohman. Adaptively reordering joins during query execution. In Proceedings of ICDE Conference, 2007.
[17]
H. Lim, H. Herodotou, and S. Babu. Stubby: A transformation-based optimizer for mapreduce workflows. In Proceedings of VLDB Conference, 2012.
[18]
H. Lim, H. Herodotou, and S. Babu. Stubby: A Transformation-based Optimizer for MapReduce Workflows. PVLDB, 5(11), 2012.
[19]
V. Markl, V. Raman, D. E. Simmen, G. M. Lohman, and H. Pirahesh. Robust query processing through progressive optimization. In Proceedings of SIGMOD Conference, 2004.
[20]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of webscale datasets. In Proceedings of VLDB Conference, 2010.
[21]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In Proceedings of SIGMOD Conference, 2008.
[22]
V. Raman, A. Deshpande, and J. M. Hellerstein. Using state modules for adaptive query processing. In Proceedings of ICDE Conference, 2003.
[23]
M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. LEO - DB2's LEarning Optimizer. In Proceedings of VLDB Conference, 2001.
[24]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using Hadoop. In Proceedings of ICDE Conference, 2010.
[25]
Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proc. of OSDI Conference, 2008.
[26]
J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. SCOPE: Parallel databases meet mapreduce. The VLDB Journal, 21(5), 2012.
[27]
J. Zhou, P.-Å. Larson, and R. Chaiken. Incorporating partitioning and parallel plans into the SCOPE optimizer. In Proceedings of ICDE Conference, 2010.

Cited By

View all
  • (2024)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 1-Mar-2024
  • (2023)Revisiting Runtime Dynamic Optimization for Join Queries in Big Data Management SystemsACM SIGMOD Record10.1145/3604437.360446052:1(104-113)Online publication date: 8-Jun-2023
  • (2023)Runtime Variation in Big Data AnalyticsProceedings of the ACM on Management of Data10.1145/35889211:1(1-20)Online publication date: 30-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 6, Issue 11
August 2013
237 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2013
Published in PVLDB Volume 6, Issue 11

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)2
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 1-Mar-2024
  • (2023)Revisiting Runtime Dynamic Optimization for Join Queries in Big Data Management SystemsACM SIGMOD Record10.1145/3604437.360446052:1(104-113)Online publication date: 8-Jun-2023
  • (2023)Runtime Variation in Big Data AnalyticsProceedings of the ACM on Management of Data10.1145/35889211:1(1-20)Online publication date: 30-May-2023
  • (2023)Toward Network-Aware Query Execution Systems in Large DatacentersIEEE Transactions on Network and Service Management10.1109/TNSM.2023.327316620:4(4494-4504)Online publication date: 1-Dec-2023
  • (2021)FangornProceedings of the VLDB Endowment10.14778/3476311.347637614:12(2972-2985)Online publication date: 1-Jul-2021
  • (2021)KEA: Tuning an Exabyte-Scale Data InfrastructureProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457569(2667-2680)Online publication date: 9-Jun-2021
  • (2021)Cloud Query Processing with Reinforcement Learning-Based Multi-objective Re-optimizationModel and Data Engineering10.1007/978-3-030-78428-7_12(141-155)Online publication date: 21-Jun-2021
  • (2020)Applied Research Lessons from CloudViews ProjectACM SIGMOD Record10.1145/3444831.344483949:3(37-42)Online publication date: 17-Dec-2020
  • (2020)MONSOON: Multi-Step Optimization and Execution of Queries with Partially Obscured PredicatesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389728(225-240)Online publication date: 11-Jun-2020
  • (2020)Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our FindingsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380584(99-113)Online publication date: 11-Jun-2020
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media