Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

SimCost: cost-effective resource provision prediction and recommendation for spark workloads

Published: 22 June 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Spark is one of the most popular big data analytical platforms. To save time, achieve high resource utilization, and remain cost-effective for Spark jobs, it is challenging but imperative for data scientists to configure suitable resource portions.In this paper, we investigate the proper parameter values that meet workloads’ performance requirements with minimized resource cost and resource utilization time. We propose SimCost, a simulation-based cost model, to predict the performance of jobs accurately. We achieve low-cost training by taking advantage of simulation framework, i.e., Monte Carlo simulation, which uses a small amount of data and resources to make a reliable prediction for larger datasets and clusters. Our method’s salient feature is that it allows us to invest low training costs while obtaining an accurate prediction. Through empirical experiments with 12 benchmark workloads, we show that the cost model yields less than 5% error on average prediction accuracy, and the recommendation achieves up to 6x resource cost saving.

    References

    [2]
    Aken, D.V., Pavlo, A., Gordon, G.J., Zhang, B.: Automatic database management system tuning through large-scale machine learning. In: SIGMOD Conference, pp. 1009–1024. ACM (2017)
    [3]
    Awan, A.J., Brorsson, M.: Vlassov, V., Ayguadé, E.: Architectural impact on performance of in-memory data analytics: apache spark case study. CoRR https://arxiv.org/1604.08484 (2016)
    [4]
    Bao, L., Liu, X., Chen, W.: Learning-based automatic parameter tuning for big data analytics frameworks. In: BigData, pp. 181–190. IEEE (2018)
    [5]
    Binder K Monte Carlo Simulations in Statistical Physics, Encyclopedia of Complexity and Systems Science 2009 New York Springer 5667-5677
    [6]
    Bruno N, Jain S, and Zhou J Continuous cloud-scale query optimization and processing Proc. VLDB Endow. 2013 6 11 961-972
    [7]
    Chaisiri S, Lee B, and Niyato D Optimization of resource provisioning cost in cloud computing IEEE Trans. Serv. Comput. 2012 5 2 164-177
    [8]
    Chen K, Powers J, Guo S, and Tian F CRESP: towards optimal resource provisioning for mapreduce computing in public clouds IEEE Trans. Parallel Distrib. Syst. 2014 25 6 1403-1412
    [9]
    Chen, Y.: Performance tuning and query optimization for big data management (2021)
    [10]
    Chen, Y., Goetsch, P., Hoque, M.A., Lu, J., Tarkoma, S.: d-simplexed: adaptive delaunay triangulation for performance modeling and prediction on big data analytics. IEEE Trans. Big Data (2019)
    [11]
    Chen, Y., Lu, J., Chen, C., Hoque, M., Tarkoma, S.: Cost-effective resource provisioning for spark workloads. In: CIKM, pp. 2477–2480. ACM (2019)
    [12]
    Cheng D, Zhou X, Xu Y, Liu L, and Jiang C Deadline-aware mapreduce job scheduling with dynamic resource availability IEEE Trans. Parallel Distrib. Syst. 2019 30 4 814-826
    [13]
    Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on apache spark. In: ISPASS, pp. 112–121. IEEE Computer Society (2016)
    [15]
    Genkin, M., Dehne, F., Pospelova, M., Chen, Y., Navarro, P.: Automatic, on-line tuning of YARN container memory and CPU parameters. In: HPCC/SmartCity/DSS, pp. 317–324. IEEE Computer Society (2016)
    [16]
    Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., Stoica, I.: Dominant resource fairness: fair allocation of multiple resource types. In: NSDI, USENIX Association (2011)
    [17]
    Gounaris A, Kougka G, Tous R, Montes CT, and Torres J Dynamic configuration of partitioning in spark applications IEEE Trans. Parallel Distrib. Syst. 2017 28 7 1891-1904
    [18]
    Gunther NJ, Puglia P, and Tomasette K Hadoop superlinear scalability ACM Queue 2015 13 5 20
    [19]
    Hernández ÁB, Perez MS, Gupta S, and Muntés-Mulero V Using machine learning to optimize parallelism in big data applications Future Gener. Comput. Syst. 2018 86 1076-1092
    [20]
    Herodotou H and Babu S Profiling, what-if analysis, and cost-based optimization of mapreduce programs PVLDB 2011 4 11 1111-1122
    [21]
    Herodotou H, Chen Y, and Lu J A survey on automatic parameter tuning for big data processing systems ACM Comput. Surv. 2020 53 2 43:1-43:37
    [22]
    Herodotou, H., Dong, F., Babu, S.: No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: SoCC, p. 18. ACM (2011)
    [23]
    Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: a self-tuning system for big data analytics. In: CIDR, pp. 261–272. www.cidrdb.org (2011)
    [24]
    Herodotou, H., Odysseos, L., Chen, Y., Lu, J.: Automatic performance tuning for distributed data stream processing systems. In: ICDE, pp. 3194–3197. IEEE (2022)
    [25]
    Huang, B., Babu, S., Yang, J.: Cumulon: optimizing statistical data analysis in the cloud. In: SIGMOD Conference, pp. 1–12. ACM (2013)
    [26]
    Huang, B., Boehm M., Tian, Y., Reinwald, B., Tatikonda, S., Reiss, F.R.: Resource elasticity for large-scale machine learning. In: SIGMOD Conference, pp. 137–152. ACM (2015)
    [27]
    Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: ICDE Workshops, pp. 41–51. IEEE Computer Society (2010)
    [28]
    Huang, Z., Balasubramanian, B., Wang, M., Lan, T., Chiang, M., Tsang, D.H.K.: Need for speed: CORA scheduler for optimizing completion-times in the cloud. In: INFOCOM, pp. 891–899. IEEE (2015)
    [29]
    Huang, Z., Weinberg, S.M., Zheng, L., Joe-Wong, C., Chiang, M.: Discovering valuations and enforcing truthfulness in a deadline-aware scheduler. In: INFOCOM, pp. 1–9. IEEE (2017)
    [30]
    Hurst, S.: The characteristic function of the student t distribution. Research report: statistics research report/Centre for mathematics and its applications (Canberra) (1995)
    [31]
    Jia, Z., Xue, C., Chen, G., Zhan, J., Zhang, L., Lin, Y., Hofstee, P.: Auto-tuning spark big data workloads on POWER8: prediction-based dynamic SMT threading. In: PACT, pp. 387–400. ACM (2016)
    [32]
    Ketchen DJ and Shook CL The application of cluster analysis in strategic management research: an analysis and critique Strateg. Manag. J. 1996 17 6 441-458
    [33]
    Krishna, R., Tang, C., Sullivan, K.J., Ray, B.: Conex: efficient exploration of big-data system configurations for better performance. CoRR https://arxiv.org/1910.09644 (2019)
    [34]
    Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.J.: A platform for scalable one-pass analytics using mapreduce. In: SIGMOD Conference, pp. 985–996. ACM (2011)
    [35]
    Li, M., Zeng, L., Meng, S., Tan, J., Zhang, L., Butt, A.R., Fuller, N.C.: MRONLINE: mapreduce online performance tuning. In: HPDC, pp. 165–176. ACM (2014)
    [36]
    Li YL and Dong J Study and improvement of mapreduce based on hadoop Comput. Eng. Des. 2012 33 8 3110-3116
    [37]
    Lu J, Chen Y, Herodotou H, and Babu S Speedup your analytics: automatic parameter tuning for databases and big data systems Proc. VLDB Endow. 2019 12 12 1970-1973
    [38]
    Nair, V., Menzies, T., Siegmund, N., Apel, S.: Using bad learners to find good configurations. In: ESEC/SIGSOFT FSE, pp. 257–267. ACM (2017)
    [39]
    Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.: Making sense of performance in data analytics frameworks. In: NSDI, pp. 293–307. USENIX Association (2015)
    [40]
    Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Golden section search in one dimension. Numerical Recipes in C: The Art of Scientific Computing (1992)
    [41]
    Royall RM On finite population sampling theory under certain linear regression models Biometrika 1970 57 2 377-387
    [42]
    Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B, and Özcan F Clash of the titans: mapreduce vs. spark for large scale data analytics PVLDB 2015 8 13 2110-2121
    [43]
    Shi J, Zou J, Lu J, Cao Z, Li S, and Wang C Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs PVLDB 2014 7 13 1319-1330
    [44]
    Singhal, R., Singh, P.: Performance assurance model for applications on SPARK platform. In: TPCTC, vol. 10661 of Lecture Notes in Computer Science, pp. 131–146. Springer (2017)
    [45]
    Soror, A.A., Minhas, U.F., Aboulnaga, A., Salem, K., Kokosielis, P., Kamath, S.: Automatic virtual machine configuration for database workloads. In: SIGMOD Conference, pp. 953–966. ACM (2008)
    [46]
    Tan J, Zhang T, Li F, Chen J, Zheng Q, Zhang P, Qiao H, Shi Y, Cao W, and Zhang R ibtune: individualized buffer tuning for large-scale cloud databases PVLDB 2019 12 10 1221-1234
    [47]
    Tous, R., Gounaris A., Tripiana C., Torres J., Girona S., Ayguadé, E., Labarta, J., Becerra, Y., Carrera, D., Valero, M.: Spark deployment and performance evaluation on the marenostrum supercomputer. In: Big Data, pp. 299–306. IEEE (2015)
    [48]
    Venkataraman, S., Yang Z., Franklin M.J., Recht, B., Stoica, I.: Ernest: Efficient performance prediction for large-scale advanced analytics. In: NSDI, pp. 363–378. USENIX Association (2016)
    [49]
    Wang, G., Xu J., He, B.: A novel method for tuning configuration parameters of spark based on machine learning. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 586–593. IEEE (2016)
    [50]
    Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: HPCC/CSS/ICESS, pp. 166–173. IEEE (2015)
    [51]
    Wirtz, T., Ge, R.: Improving mapreduce energy efficiency for computation intensive workloads. In: IGCC, pp. 1–8. IEEE Computer Society (2011)
    [52]
    Wu, D., Gokhale, A.S.: A self-tuning system based on application profiling and performance analysis for optimizing hadoop mapreduce cluster configuration. In: HiPC, pp. 89–98. IEEE Computer Society (2013)
    [53]
    Ye, T., Kalyanaraman, S.: A recursive random search algorithm for large-scale network parameter configuration. In: SIGMETRICS, pp. 196–205. ACM (2003)
    [54]
    Yigitbasi, N., Willke, T.L., Liao, G., Epema D.H.J.: Towards machine learning-based auto-tuning of mapreduce. In: MASCOTS, pp. 11–20. IEEE Computer Society (2013)
    [55]
    Yu, Z., Bei, Z., Qian, X.: Datasize-aware high dimensional configurations auto-tuning of in-memory cluster computing. In: Proc. of the 23rd Intl. Conf. on Architectural Support for Programming Languages and Operating Systems ASPLOS, pp. 564–577. ACM (2018)
    [56]
    Zhang, J., Liu, Y., Zhou, K., Li, G., Xiao, Z., Cheng, B., Xing, J., Wang, Y., Cheng, T., Liu, L., Ran, M., Li Z.: An end-to-end automatic cloud database tuning system using deep reinforcement learning. In: SIGMOD Conference, pp. 415–432. ACM (2019)
    [57]
    Zhu, Y., Liu, J., Guo, M., Bao, Y., Ma, W., Liu, Z., Song, K., Yang, Y.: BestConfig: tapping the performance potential of systems via automatic configuration tuning. In: Proc. of the 8th ACM Symp. on Cloud Computing (SoCC), pp. 338–350. ACM (2017)

    Index Terms

    1. SimCost: cost-effective resource provision prediction and recommendation for spark workloads
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image Distributed and Parallel Databases
            Distributed and Parallel Databases  Volume 42, Issue 1
            Mar 2024
            140 pages

            Publisher

            Kluwer Academic Publishers

            United States

            Publication History

            Published: 22 June 2023
            Accepted: 29 May 2023

            Author Tags

            1. Parameter tuning
            2. Cost modeling
            3. Spark
            4. Resource provisioning

            Qualifiers

            • Research-article

            Funding Sources

            • Academy of Finland
            • University of Helsinki including Helsinki University Central Hospital

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • 0
              Total Citations
            • 0
              Total Downloads
            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 27 Jul 2024

            Other Metrics

            Citations

            View Options

            View options

            Get Access

            Login options

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media