Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318464.3380584acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings

Published: 31 May 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Query processing over big data is ubiquitous in modern clouds, where the system takes care of picking both the physical query execution plans and the resources needed to run those plans, using a cost-based query optimizer. A good cost model, therefore, is akin to better resource efficiency and lower operational costs. Unfortunately, the production workloads at Microsoft show that costs are very complex to model for big data systems. In this work, we investigate two key questions: (i) can we learn accurate cost models for big data systems, and (ii) can we integrate the learned models within the query optimizer. To answer these, we make three core contributions. First, we exploit workload patterns to learn a large number of individual cost models and combine them to achieve high accuracy and coverage over a long period. Second, we propose extensions to Cascades framework to pick optimal resources, i.e, number of containers, during query planning. And third, we integrate the learned cost models within the Cascade-style query optimizer of SCOPE at Microsoft. We evaluate the resulting system, Cleo, in a production environment using both production and TPC-H workloads. Our results show that the learned cost models are 2 to 3 orders of magnitude more accurate, and 20X more correlated with the actual runtimes, with a large majority (70%) of the plan changes leading to substantial improvements in latency as well as resource usage.

    Supplementary Material

    MP4 File (3318464.3380584.mp4)
    Presentation Video

    References

    [1]
    S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing data-parallel computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 21--21. USENIX Association, 2012.
    [2]
    M. Akdere, U. cC etintemel, M. Riondato, E. Upfal, and S. B. Zdonik. Learning-based query performance modeling and prediction. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pages 390--401. IEEE, 2012.
    [3]
    O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In NSDI, volume 2, pages 4--2, 2017.
    [4]
    AWS Athena. https://aws.amazon.com/athena/.
    [5]
    F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of machine learning research, 3(Jul):1--48, 2002.
    [6]
    N. Bruno, S. Agarwal, S. Kandula, B. Shi, M. Wu, and J. Zhou. Recurring job optimization in scope. In SIGMOD, pages 805--806, 2012.
    [7]
    N. Bruno, S. Jain, and J. Zhou. Continuous cloud-scale query optimization and processing. Proceedings of the VLDB Endowment, 6(11):961--972, 2013.
    [8]
    N. Bruno, S. Jain, and J. Zhou. Recurring Job Optimization for Massively Distributed Query Processing. IEEE Data Eng. Bull., 36(1):46--55, 2013.
    [9]
    N. Bruno, Y. Kwon, and M.-C. Wu. Advanced join strategies for large-scale distributed computation. Proceedings of the VLDB Endowment, 7(13):1484--1495, 2014.
    [10]
    R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008.
    [11]
    S. Chaudhuri, V. Narasayya, and R. Ramamurthy. Estimating progress of execution for sql queries. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 803--814. ACM, 2004.
    [12]
    CLEO: Technical Report. http://arxiv.org/abs/2002.12393.
    [13]
    A. Dutt, C. Wang, A. Nazi, S. Kandula, V. Narasayya, and S. Chaudhuri. Selectivity Estimation for Range Predicates Using Lightweight Models. PVLDB, 12(9):1044--1057, 2019.
    [14]
    FastTree. https://www.nuget.org/packages/Microsoft.ML.FastTree/.
    [15]
    A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: guaranteed job latency in data parallel clusters. In EuroSys, pages 99--112, 2012.
    [16]
    J. H. Friedman. Stochastic gradient boosting. Computational statistics & data analysis, 38(4):367--378, 2002.
    [17]
    A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting multiple metrics for queries: Better decisions enabled by machine learning. In Data Engineering, 2009. ICDE'09. IEEE 25th International Conference on, pages 592--603. IEEE, 2009.
    [18]
    Google BigQuery. https://cloud.google.com/bigquery.
    [19]
    G. Graefe. The Cascades framework for query optimization. IEEE Data Eng. Bull., 18(3):19--29, 1995.
    [20]
    IBM BigSQL. https://www.ibm.com/products/db2-big-sql.
    [21]
    A. Jindal, K. Karanasos, S. Rao, and H. Patel. Selecting Subexpressions to Materialize at Datacenter Scale. In VLDB, 2018.
    [22]
    A. Jindal, S. Qiao, H. Patel, Z. Yin, J. Di, M. Bag, M. Friedman, Y. Lin, K. Karanasos, and S. Rao. Computation Reuse in Analytics Job Service at Microsoft. In SIGMOD, 2018.
    [23]
    S. A. Jyothi, C. Curino, I. Menache, S. M. Narayanamurthy, A. Tumanov, J. Yaniv, R. Mavlyutov, Í. Goiri, S. Krishnan, J. Kulkarni, et al. Morpheus: Towards automated slos for enterprise clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 117--134, 2016.
    [24]
    A. Kipf, T. Kipf, B. Radke, V. Leis, P. Boncz, and A. Kemper. Learned cardinalities: Estimating correlated joins with deep learning. CIDR, 2019.
    [25]
    T. Kraska, M. Alizadeh, A. Beutel, E. Chi, J. Ding, A. Kristo, G. Leclerc, S. Madden, H. Mao, and V. Nathan. Sagedb: A learned database system. CIDR, 2019.
    [26]
    T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, pages 489--504. ACM, 2018.
    [27]
    C. Lei, Z. Zhuang, E. A. Rundensteiner, and M. Y. Eltabakh. Redoop infrastructure for recurring big data queries. PVLDB, 7(13):1589--1592, 2014.
    [28]
    V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann. How good are query optimizers, really? Proceedings of the VLDB Endowment, 9(3):204--215, 2015.
    [29]
    J. Li, A. C. König, V. Narasayya, and S. Chaudhuri. Robust estimation of resource consumption for sql queries using statistical techniques. Proceedings of the VLDB Endowment, 5(11):1555--1566, 2012.
    [30]
    G. Lohman. Is query optimization a "solved" problem. In Proc. Workshop on Database Query Optimization, volume 13. Oregon Graduate Center Comp. Sci. Tech. Rep, 2014.
    [31]
    G. Luo, J. F. Naughton, C. J. Ellmann, and M. W. Watzke. Toward a progress indicator for database queries. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 791--802. ACM, 2004.
    [32]
    R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh, T. Kraska, O. Papaemmanouil, and N. Tatbul. Neo: A learned query optimizer. arXiv preprint arXiv:1904.03711, 2019.
    [33]
    MART. http://statweb.stanford.edu/jhf/MART.html.
    [34]
    M. Poess and C. Floyd. New tpc benchmarks for decision support and web commerce. ACM Sigmod Record, 29(4):64--71, 2000.
    [35]
    K. Rajan, D. Kakadia, C. Curino, and S. Krishnan. Perforator: eloquent performance models for resource optimization. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages 415--427. ACM, 2016.
    [36]
    R. Ramakrishnan, B. Sridharan, J. R. Douceur, P. Kasturi, B. Krishnamachari-Sampath, K. Krishnamoorthy, P. Li, M. Manu, S. Michaylov, R. Ramos, et al. Azure data lake store: a hyperscale distributed file service for big data analytics. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 51--63. ACM, 2017.
    [37]
    A. Roy, A. Jindal, H. Patel, A. Gosalia, S. Krishnan, and C. Curino. SparkCruise: Handsfree Computation Reuse in Spark. PVLDB, 12(12):1850--1853, 2019.
    [38]
    J. Schad, J. Dittrich, and J.-A. Quiané-Ruiz. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. PVLDB, 3(1--2):460--471, 2010.
    [39]
    M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. Leo-db2's learning optimizer. In VLDB, volume 1, pages 19--28, 2001.
    [40]
    S. Venkataraman and others. Ernest: Efficient performance prediction for large-scale advanced analytics. In NSDI, pages 363--378, 2016.
    [41]
    L. Viswanathan, A. Jindal, and K. Karanasos. Query and Resource Optimization: Bridging the Gap. In ICDE, pages 1384--1387, 2018.
    [42]
    C. Wu, A. Jindal, S. Amizadeh, H. Patel, W. Le, S. Qiao, and S. Rao. Towards a Learning Optimizer for Shared Clouds. PVLDB, 12(3):210--222, 2018.
    [43]
    D. Xin, S. Macke, L. Ma, J. Liu, S. Song, and A. Parameswaran. HELIX: Holistic Optimization for Accelerating Iterative Machine Learning. PVLDB, 12(4):446--460, 2018.
    [44]
    Z. Yint, J. Sun, M. Li, J. Ekanayake, H. Lin, M. Friedman, J. A. Blakeley, C. Szyperski, and N. R. Devanur. Bubble execution: resource-aware reliable analytics at cloud scale. Proceedings of the VLDB Endowment, 11(7):746--758, 2018.
    [45]
    J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. SCOPE: parallel databases meet MapReduce. VLDB J., 21(5):611--636, 2012.
    [46]
    Q. Zhou, J. Arulraj, S. Navathe, W. Harris, and D. Xu. Automated Verification of Query Equivalence Using Satisfiability Modulo Theories. PVLDB, 12(11):1276--1288, 2019.
    [47]
    H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301--320, 2005.

    Cited By

    View all
    • (2024)Cross-Feature Transfer Learning for Efficient Tensor Program GenerationApplied Sciences10.3390/app1402051314:2(513)Online publication date: 6-Jan-2024
    • (2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
    • (2024)Wii: Dynamic Budget Reallocation In Index TuningProceedings of the ACM on Management of Data10.1145/36549852:3(1-26)Online publication date: 30-May-2024
    • Show More Cited By

    Index Terms

    1. Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
      June 2020
      2925 pages
      ISBN:9781450367356
      DOI:10.1145/3318464
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 31 May 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cost models
      2. machine learning
      3. query optimization
      4. resource optimization

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)189
      • Downloads (Last 6 weeks)6
      Reflects downloads up to

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Cross-Feature Transfer Learning for Efficient Tensor Program GenerationApplied Sciences10.3390/app1402051314:2(513)Online publication date: 6-Jan-2024
      • (2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
      • (2024)Wii: Dynamic Budget Reallocation In Index TuningProceedings of the ACM on Management of Data10.1145/36549852:3(1-26)Online publication date: 30-May-2024
      • (2024)ML-Powered Index Tuning: An Overview of Recent Progress and Open ChallengesACM SIGMOD Record10.1145/3641832.364183652:4(19-30)Online publication date: 19-Jan-2024
      • (2024)Sibyl: Forecasting Time-Evolving Query WorkloadsProceedings of the ACM on Management of Data10.1145/36393082:1(1-27)Online publication date: 26-Mar-2024
      • (2024)Machine Unlearning in Learned Databases: An Experimental AnalysisProceedings of the ACM on Management of Data10.1145/36393042:1(1-26)Online publication date: 26-Mar-2024
      • (2024)Modeling Shifting Workloads for Learned Database SystemsProceedings of the ACM on Management of Data10.1145/36392932:1(1-27)Online publication date: 26-Mar-2024
      • (2024)Learned Query Optimizer: What is New and What is NextCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654692(561-569)Online publication date: 9-Jun-2024
      • (2024)Robust Query Optimization in the Era of Machine Learning: State-of-the-Art and Future Directions2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00408(5371-5375)Online publication date: 13-May-2024
      • (2024)Towards Exploratory Query Optimization for Template-Based SQL Workloads2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00019(151-164)Online publication date: 13-May-2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media