research-article

Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings

Authors:

Tarique Siddiqui,

Wangchao LeAuthors Info & Claims

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 99 - 113

https://doi.org/10.1145/3318464.3380584

Published: 31 May 2020 Publication History

Abstract

Query processing over big data is ubiquitous in modern clouds, where the system takes care of picking both the physical query execution plans and the resources needed to run those plans, using a cost-based query optimizer. A good cost model, therefore, is akin to better resource efficiency and lower operational costs. Unfortunately, the production workloads at Microsoft show that costs are very complex to model for big data systems. In this work, we investigate two key questions: (i) can we learn accurate cost models for big data systems, and (ii) can we integrate the learned models within the query optimizer. To answer these, we make three core contributions. First, we exploit workload patterns to learn a large number of individual cost models and combine them to achieve high accuracy and coverage over a long period. Second, we propose extensions to Cascades framework to pick optimal resources, i.e, number of containers, during query planning. And third, we integrate the learned cost models within the Cascade-style query optimizer of SCOPE at Microsoft. We evaluate the resulting system, Cleo, in a production environment using both production and TPC-H workloads. Our results show that the learned cost models are 2 to 3 orders of magnitude more accurate, and 20X more correlated with the actual runtimes, with a large majority (70%) of the plan changes leading to substantial improvements in latency as well as resource usage.

Supplementary Material

MP4 File (3318464.3380584.mp4)

Presentation Video

Download
103.71 MB

References

[1]

S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing data-parallel computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 21--21. USENIX Association, 2012.

Digital Library

[2]

M. Akdere, U. cC etintemel, M. Riondato, E. Upfal, and S. B. Zdonik. Learning-based query performance modeling and prediction. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pages 390--401. IEEE, 2012.

Digital Library

[3]

O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In NSDI, volume 2, pages 4--2, 2017.

[4]

AWS Athena. https://aws.amazon.com/athena/.

[5]

F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of machine learning research, 3(Jul):1--48, 2002.

[6]

N. Bruno, S. Agarwal, S. Kandula, B. Shi, M. Wu, and J. Zhou. Recurring job optimization in scope. In SIGMOD, pages 805--806, 2012.

Digital Library

[7]

N. Bruno, S. Jain, and J. Zhou. Continuous cloud-scale query optimization and processing. Proceedings of the VLDB Endowment, 6(11):961--972, 2013.

Digital Library

[8]

N. Bruno, S. Jain, and J. Zhou. Recurring Job Optimization for Massively Distributed Query Processing. IEEE Data Eng. Bull., 36(1):46--55, 2013.

[9]

N. Bruno, Y. Kwon, and M.-C. Wu. Advanced join strategies for large-scale distributed computation. Proceedings of the VLDB Endowment, 7(13):1484--1495, 2014.

Digital Library

[10]

R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008.

Digital Library

[11]

S. Chaudhuri, V. Narasayya, and R. Ramamurthy. Estimating progress of execution for sql queries. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 803--814. ACM, 2004.

Digital Library

[12]

CLEO: Technical Report. http://arxiv.org/abs/2002.12393.

[13]

A. Dutt, C. Wang, A. Nazi, S. Kandula, V. Narasayya, and S. Chaudhuri. Selectivity Estimation for Range Predicates Using Lightweight Models. PVLDB, 12(9):1044--1057, 2019.

[14]

FastTree. https://www.nuget.org/packages/Microsoft.ML.FastTree/.

[15]

A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: guaranteed job latency in data parallel clusters. In EuroSys, pages 99--112, 2012.

Digital Library

[16]

J. H. Friedman. Stochastic gradient boosting. Computational statistics & data analysis, 38(4):367--378, 2002.

[17]

A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting multiple metrics for queries: Better decisions enabled by machine learning. In Data Engineering, 2009. ICDE'09. IEEE 25th International Conference on, pages 592--603. IEEE, 2009.

Digital Library

[18]

Google BigQuery. https://cloud.google.com/bigquery.

[19]

G. Graefe. The Cascades framework for query optimization. IEEE Data Eng. Bull., 18(3):19--29, 1995.

[20]

IBM BigSQL. https://www.ibm.com/products/db2-big-sql.

[21]

A. Jindal, K. Karanasos, S. Rao, and H. Patel. Selecting Subexpressions to Materialize at Datacenter Scale. In VLDB, 2018.

Digital Library

[22]

A. Jindal, S. Qiao, H. Patel, Z. Yin, J. Di, M. Bag, M. Friedman, Y. Lin, K. Karanasos, and S. Rao. Computation Reuse in Analytics Job Service at Microsoft. In SIGMOD, 2018.

Digital Library

[23]

S. A. Jyothi, C. Curino, I. Menache, S. M. Narayanamurthy, A. Tumanov, J. Yaniv, R. Mavlyutov, Í. Goiri, S. Krishnan, J. Kulkarni, et al. Morpheus: Towards automated slos for enterprise clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 117--134, 2016.

[24]

A. Kipf, T. Kipf, B. Radke, V. Leis, P. Boncz, and A. Kemper. Learned cardinalities: Estimating correlated joins with deep learning. CIDR, 2019.

[25]

T. Kraska, M. Alizadeh, A. Beutel, E. Chi, J. Ding, A. Kristo, G. Leclerc, S. Madden, H. Mao, and V. Nathan. Sagedb: A learned database system. CIDR, 2019.

[26]

T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, pages 489--504. ACM, 2018.

Digital Library

[27]

C. Lei, Z. Zhuang, E. A. Rundensteiner, and M. Y. Eltabakh. Redoop infrastructure for recurring big data queries. PVLDB, 7(13):1589--1592, 2014.

Digital Library

[28]

V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann. How good are query optimizers, really? Proceedings of the VLDB Endowment, 9(3):204--215, 2015.

Digital Library

[29]

J. Li, A. C. König, V. Narasayya, and S. Chaudhuri. Robust estimation of resource consumption for sql queries using statistical techniques. Proceedings of the VLDB Endowment, 5(11):1555--1566, 2012.

Digital Library

[30]

G. Lohman. Is query optimization a "solved" problem. In Proc. Workshop on Database Query Optimization, volume 13. Oregon Graduate Center Comp. Sci. Tech. Rep, 2014.

[31]

G. Luo, J. F. Naughton, C. J. Ellmann, and M. W. Watzke. Toward a progress indicator for database queries. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 791--802. ACM, 2004.

Digital Library

[32]

R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh, T. Kraska, O. Papaemmanouil, and N. Tatbul. Neo: A learned query optimizer. arXiv preprint arXiv:1904.03711, 2019.

[33]

MART. http://statweb.stanford.edu/jhf/MART.html.

[34]

M. Poess and C. Floyd. New tpc benchmarks for decision support and web commerce. ACM Sigmod Record, 29(4):64--71, 2000.

Digital Library

[35]

K. Rajan, D. Kakadia, C. Curino, and S. Krishnan. Perforator: eloquent performance models for resource optimization. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages 415--427. ACM, 2016.

Digital Library

[36]

R. Ramakrishnan, B. Sridharan, J. R. Douceur, P. Kasturi, B. Krishnamachari-Sampath, K. Krishnamoorthy, P. Li, M. Manu, S. Michaylov, R. Ramos, et al. Azure data lake store: a hyperscale distributed file service for big data analytics. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 51--63. ACM, 2017.

Digital Library

[37]

A. Roy, A. Jindal, H. Patel, A. Gosalia, S. Krishnan, and C. Curino. SparkCruise: Handsfree Computation Reuse in Spark. PVLDB, 12(12):1850--1853, 2019.

[38]

J. Schad, J. Dittrich, and J.-A. Quiané-Ruiz. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. PVLDB, 3(1--2):460--471, 2010.

[39]

M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. Leo-db2's learning optimizer. In VLDB, volume 1, pages 19--28, 2001.

[40]

S. Venkataraman and others. Ernest: Efficient performance prediction for large-scale advanced analytics. In NSDI, pages 363--378, 2016.

Digital Library

[41]

L. Viswanathan, A. Jindal, and K. Karanasos. Query and Resource Optimization: Bridging the Gap. In ICDE, pages 1384--1387, 2018.

[42]

C. Wu, A. Jindal, S. Amizadeh, H. Patel, W. Le, S. Qiao, and S. Rao. Towards a Learning Optimizer for Shared Clouds. PVLDB, 12(3):210--222, 2018.

Digital Library

[43]

D. Xin, S. Macke, L. Ma, J. Liu, S. Song, and A. Parameswaran. HELIX: Holistic Optimization for Accelerating Iterative Machine Learning. PVLDB, 12(4):446--460, 2018.

[44]

Z. Yint, J. Sun, M. Li, J. Ekanayake, H. Lin, M. Friedman, J. A. Blakeley, C. Szyperski, and N. R. Devanur. Bubble execution: resource-aware reliable analytics at cloud scale. Proceedings of the VLDB Endowment, 11(7):746--758, 2018.

Digital Library

[45]

J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. SCOPE: parallel databases meet MapReduce. VLDB J., 21(5):611--636, 2012.

Digital Library

[46]

Q. Zhou, J. Arulraj, S. Navathe, W. Harris, and D. Xu. Automated Verification of Query Equivalence Using Satisfiability Modulo Theories. PVLDB, 12(11):1276--1288, 2019.

Digital Library

[47]

H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301--320, 2005.

Cited By

Verma GRaskar SEmani MChapman B(2024)Cross-Feature Transfer Learning for Efficient Tensor Program GenerationApplied Sciences10.3390/app1402051314:2(513)Online publication date: 6-Jan-2024
https://doi.org/10.3390/app14020513
Zhu RWeng LWei WWu DPeng JWang YDing BLian DZheng BZhou J(2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641209
Wang XWu WWang CNarasayya VChaudhuri S(2024)Wii: Dynamic Budget Reallocation In Index TuningProceedings of the ACM on Management of Data10.1145/36549852:3(1-26)Online publication date: 30-May-2024
https://doi.org/10.1145/3654985
Show More Cited By

Index Terms

Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Query optimization

Recommendations

A learned cost model for big data query processing
Abstract
The efficiency of query processing in the Spark SQL big data processing engine is significantly affected by execution plans and allocated resources. However, existing cost models for Spark SQL rely on hand-crafted rules. While learning-based cost ...
Query optimization for massively parallel data processing
SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

MapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. Some vendors have ...
Cloud Query Processing with Reinforcement Learning-Based Multi-objective Re-optimization
Model and Data Engineering
Abstract
Query processing on cloud database systems is a challenging problem due to the dynamic cloud environment. The configuration and utilization of the distributed hardware used to process queries change continuously. A query optimizer aims to generate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

June 2020

2925 pages

ISBN:9781450367356

DOI:10.1145/3318464

General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 14 - 19, 2020

OR, Portland, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

46
Total Citations
View Citations
1,218
Total Downloads

Downloads (Last 12 months)189
Downloads (Last 6 weeks)6

Reflects downloads up to

Other Metrics

View Author Metrics

Citations

Cited By

Verma GRaskar SEmani MChapman B(2024)Cross-Feature Transfer Learning for Efficient Tensor Program GenerationApplied Sciences10.3390/app1402051314:2(513)Online publication date: 6-Jan-2024
https://doi.org/10.3390/app14020513
Zhu RWeng LWei WWu DPeng JWang YDing BLian DZheng BZhou J(2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641209
Wang XWu WWang CNarasayya VChaudhuri S(2024)Wii: Dynamic Budget Reallocation In Index TuningProceedings of the ACM on Management of Data10.1145/36549852:3(1-26)Online publication date: 30-May-2024
https://doi.org/10.1145/3654985
Siddiqui TWu W(2024)ML-Powered Index Tuning: An Overview of Recent Progress and Open ChallengesACM SIGMOD Record10.1145/3641832.364183652:4(19-30)Online publication date: 19-Jan-2024
https://dl.acm.org/doi/10.1145/3641832.3641836
Huang HSiddiqui TAlotaibi RCurino CLeeka JJindal AZhao JCamacho-Rodríguez JTian Y(2024)Sibyl: Forecasting Time-Evolving Query WorkloadsProceedings of the ACM on Management of Data10.1145/36393082:1(1-27)Online publication date: 26-Mar-2024
https://doi.org/10.1145/3639308
Kurmanji MTriantafillou ETriantafillou P(2024)Machine Unlearning in Learned Databases: An Experimental AnalysisProceedings of the ACM on Management of Data10.1145/36393042:1(1-26)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639304
Wu PIves Z(2024)Modeling Shifting Workloads for Learned Database SystemsProceedings of the ACM on Management of Data10.1145/36392932:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639293
Zhu RWeng LDing BZhou JBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Learned Query Optimizer: What is New and What is NextCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654692(561-569)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654692
Kamali AKantere VZuzarte C(2024)Robust Query Optimization in the Era of Machine Learning: State-of-the-Art and Future Directions2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00408(5371-5375)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00408
Feng JLi ZChen Q(2024)Towards Exploratory Query Optimization for Template-Based SQL Workloads2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00019(151-164)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00019
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents