research-article

SimCost: cost-effective resource provision prediction and recommendation for spark workloads

Authors:

Mohammad A. Hoque,

Sasu TarkomaAuthors Info & Claims

Distributed and Parallel Databases, Volume 42, Issue 1

Pages 73 - 102

https://doi.org/10.1007/s10619-023-07436-y

Published: 22 June 2023 Publication History

Abstract

Spark is one of the most popular big data analytical platforms. To save time, achieve high resource utilization, and remain cost-effective for Spark jobs, it is challenging but imperative for data scientists to configure suitable resource portions.In this paper, we investigate the proper parameter values that meet workloads’ performance requirements with minimized resource cost and resource utilization time. We propose SimCost, a simulation-based cost model, to predict the performance of jobs accurately. We achieve low-cost training by taking advantage of simulation framework, i.e., Monte Carlo simulation, which uses a small amount of data and resources to make a reliable prediction for larger datasets and clusters. Our method’s salient feature is that it allows us to invest low training costs while obtaining an accurate prediction. Through empirical experiments with 12 benchmark workloads, we show that the cost model yields less than 5% error on average prediction accuracy, and the recommendation achieves up to 6x resource cost saving.

References

[1]

Apache Spark REST API. https://spark.apache.org/docs/latest/monitoring.html

[2]

Aken, D.V., Pavlo, A., Gordon, G.J., Zhang, B.: Automatic database management system tuning through large-scale machine learning. In: SIGMOD Conference, pp. 1009–1024. ACM (2017)

[3]

Awan, A.J., Brorsson, M.: Vlassov, V., Ayguadé, E.: Architectural impact on performance of in-memory data analytics: apache spark case study. CoRR https://arxiv.org/1604.08484 (2016)

[4]

Bao, L., Liu, X., Chen, W.: Learning-based automatic parameter tuning for big data analytics frameworks. In: BigData, pp. 181–190. IEEE (2018)

[5]

Binder K Monte Carlo Simulations in Statistical Physics, Encyclopedia of Complexity and Systems Science 2009 New York Springer 5667-5677

[6]

Bruno N, Jain S, and Zhou J Continuous cloud-scale query optimization and processing Proc. VLDB Endow. 2013 6 11 961-972

Digital Library

[7]

Chaisiri S, Lee B, and Niyato D Optimization of resource provisioning cost in cloud computing IEEE Trans. Serv. Comput. 2012 5 2 164-177

Digital Library

[8]

Chen K, Powers J, Guo S, and Tian F CRESP: towards optimal resource provisioning for mapreduce computing in public clouds IEEE Trans. Parallel Distrib. Syst. 2014 25 6 1403-1412

Digital Library

[9]

Chen, Y.: Performance tuning and query optimization for big data management (2021)

[10]

Chen, Y., Goetsch, P., Hoque, M.A., Lu, J., Tarkoma, S.: d-simplexed: adaptive delaunay triangulation for performance modeling and prediction on big data analytics. IEEE Trans. Big Data (2019)

[11]

Chen, Y., Lu, J., Chen, C., Hoque, M., Tarkoma, S.: Cost-effective resource provisioning for spark workloads. In: CIKM, pp. 2477–2480. ACM (2019)

[12]

Cheng D, Zhou X, Xu Y, Liu L, and Jiang C Deadline-aware mapreduce job scheduling with dynamic resource availability IEEE Trans. Parallel Distrib. Syst. 2019 30 4 814-826

Digital Library

[13]

Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on apache spark. In: ISPASS, pp. 112–121. IEEE Computer Society (2016)

[14]

Control Groups (CGroups). https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt

[15]

Genkin, M., Dehne, F., Pospelova, M., Chen, Y., Navarro, P.: Automatic, on-line tuning of YARN container memory and CPU parameters. In: HPCC/SmartCity/DSS, pp. 317–324. IEEE Computer Society (2016)

[16]

Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., Stoica, I.: Dominant resource fairness: fair allocation of multiple resource types. In: NSDI, USENIX Association (2011)

[17]

Gounaris A, Kougka G, Tous R, Montes CT, and Torres J Dynamic configuration of partitioning in spark applications IEEE Trans. Parallel Distrib. Syst. 2017 28 7 1891-1904

Digital Library

[18]

Gunther NJ, Puglia P, and Tomasette K Hadoop superlinear scalability ACM Queue 2015 13 5 20

Digital Library

[19]

Hernández ÁB, Perez MS, Gupta S, and Muntés-Mulero V Using machine learning to optimize parallelism in big data applications Future Gener. Comput. Syst. 2018 86 1076-1092

Digital Library

[20]

Herodotou H and Babu S Profiling, what-if analysis, and cost-based optimization of mapreduce programs PVLDB 2011 4 11 1111-1122

Digital Library

[21]

Herodotou H, Chen Y, and Lu J A survey on automatic parameter tuning for big data processing systems ACM Comput. Surv. 2020 53 2 43:1-43:37

[22]

Herodotou, H., Dong, F., Babu, S.: No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics. In: SoCC, p. 18. ACM (2011)

[23]

Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: a self-tuning system for big data analytics. In: CIDR, pp. 261–272. www.cidrdb.org (2011)

[24]

Herodotou, H., Odysseos, L., Chen, Y., Lu, J.: Automatic performance tuning for distributed data stream processing systems. In: ICDE, pp. 3194–3197. IEEE (2022)

[25]

Huang, B., Babu, S., Yang, J.: Cumulon: optimizing statistical data analysis in the cloud. In: SIGMOD Conference, pp. 1–12. ACM (2013)

[26]

Huang, B., Boehm M., Tian, Y., Reinwald, B., Tatikonda, S., Reiss, F.R.: Resource elasticity for large-scale machine learning. In: SIGMOD Conference, pp. 137–152. ACM (2015)

[27]

Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: ICDE Workshops, pp. 41–51. IEEE Computer Society (2010)

[28]

Huang, Z., Balasubramanian, B., Wang, M., Lan, T., Chiang, M., Tsang, D.H.K.: Need for speed: CORA scheduler for optimizing completion-times in the cloud. In: INFOCOM, pp. 891–899. IEEE (2015)

[29]

Huang, Z., Weinberg, S.M., Zheng, L., Joe-Wong, C., Chiang, M.: Discovering valuations and enforcing truthfulness in a deadline-aware scheduler. In: INFOCOM, pp. 1–9. IEEE (2017)

[30]

Hurst, S.: The characteristic function of the student t distribution. Research report: statistics research report/Centre for mathematics and its applications (Canberra) (1995)

[31]

Jia, Z., Xue, C., Chen, G., Zhan, J., Zhang, L., Lin, Y., Hofstee, P.: Auto-tuning spark big data workloads on POWER8: prediction-based dynamic SMT threading. In: PACT, pp. 387–400. ACM (2016)

[32]

Ketchen DJ and Shook CL The application of cluster analysis in strategic management research: an analysis and critique Strateg. Manag. J. 1996 17 6 441-458

[33]

Krishna, R., Tang, C., Sullivan, K.J., Ray, B.: Conex: efficient exploration of big-data system configurations for better performance. CoRR https://arxiv.org/1910.09644 (2019)

[34]

Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.J.: A platform for scalable one-pass analytics using mapreduce. In: SIGMOD Conference, pp. 985–996. ACM (2011)

[35]

Li, M., Zeng, L., Meng, S., Tan, J., Zhang, L., Butt, A.R., Fuller, N.C.: MRONLINE: mapreduce online performance tuning. In: HPDC, pp. 165–176. ACM (2014)

[36]

Li YL and Dong J Study and improvement of mapreduce based on hadoop Comput. Eng. Des. 2012 33 8 3110-3116

[37]

Lu J, Chen Y, Herodotou H, and Babu S Speedup your analytics: automatic parameter tuning for databases and big data systems Proc. VLDB Endow. 2019 12 12 1970-1973

Digital Library

[38]

Nair, V., Menzies, T., Siegmund, N., Apel, S.: Using bad learners to find good configurations. In: ESEC/SIGSOFT FSE, pp. 257–267. ACM (2017)

[39]

Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.: Making sense of performance in data analytics frameworks. In: NSDI, pp. 293–307. USENIX Association (2015)

[40]

Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Golden section search in one dimension. Numerical Recipes in C: The Art of Scientific Computing (1992)

[41]

Royall RM On finite population sampling theory under certain linear regression models Biometrika 1970 57 2 377-387

[42]

Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B, and Özcan F Clash of the titans: mapreduce vs. spark for large scale data analytics PVLDB 2015 8 13 2110-2121

Digital Library

[43]

Shi J, Zou J, Lu J, Cao Z, Li S, and Wang C Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs PVLDB 2014 7 13 1319-1330

Digital Library

[44]

Singhal, R., Singh, P.: Performance assurance model for applications on SPARK platform. In: TPCTC, vol. 10661 of Lecture Notes in Computer Science, pp. 131–146. Springer (2017)

[45]

Soror, A.A., Minhas, U.F., Aboulnaga, A., Salem, K., Kokosielis, P., Kamath, S.: Automatic virtual machine configuration for database workloads. In: SIGMOD Conference, pp. 953–966. ACM (2008)

[46]

Tan J, Zhang T, Li F, Chen J, Zheng Q, Zhang P, Qiao H, Shi Y, Cao W, and Zhang R ibtune: individualized buffer tuning for large-scale cloud databases PVLDB 2019 12 10 1221-1234

Digital Library

[47]

Tous, R., Gounaris A., Tripiana C., Torres J., Girona S., Ayguadé, E., Labarta, J., Becerra, Y., Carrera, D., Valero, M.: Spark deployment and performance evaluation on the marenostrum supercomputer. In: Big Data, pp. 299–306. IEEE (2015)

[48]

Venkataraman, S., Yang Z., Franklin M.J., Recht, B., Stoica, I.: Ernest: Efficient performance prediction for large-scale advanced analytics. In: NSDI, pp. 363–378. USENIX Association (2016)

[49]

Wang, G., Xu J., He, B.: A novel method for tuning configuration parameters of spark based on machine learning. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 586–593. IEEE (2016)

[50]

Wang, K., Khan, M.M.H.: Performance prediction for apache spark platform. In: HPCC/CSS/ICESS, pp. 166–173. IEEE (2015)

[51]

Wirtz, T., Ge, R.: Improving mapreduce energy efficiency for computation intensive workloads. In: IGCC, pp. 1–8. IEEE Computer Society (2011)

[52]

Wu, D., Gokhale, A.S.: A self-tuning system based on application profiling and performance analysis for optimizing hadoop mapreduce cluster configuration. In: HiPC, pp. 89–98. IEEE Computer Society (2013)

[53]

Ye, T., Kalyanaraman, S.: A recursive random search algorithm for large-scale network parameter configuration. In: SIGMETRICS, pp. 196–205. ACM (2003)

[54]

Yigitbasi, N., Willke, T.L., Liao, G., Epema D.H.J.: Towards machine learning-based auto-tuning of mapreduce. In: MASCOTS, pp. 11–20. IEEE Computer Society (2013)

[55]

Yu, Z., Bei, Z., Qian, X.: Datasize-aware high dimensional configurations auto-tuning of in-memory cluster computing. In: Proc. of the 23rd Intl. Conf. on Architectural Support for Programming Languages and Operating Systems ASPLOS, pp. 564–577. ACM (2018)

[56]

Zhang, J., Liu, Y., Zhou, K., Li, G., Xiao, Z., Cheng, B., Xing, J., Wang, Y., Cheng, T., Liu, L., Ran, M., Li Z.: An end-to-end automatic cloud database tuning system using deep reinforcement learning. In: SIGMOD Conference, pp. 415–432. ACM (2019)

[57]

Zhu, Y., Liu, J., Guo, M., Bao, Y., Ma, W., Liu, Z., Song, K., Yang, Y.: BestConfig: tapping the performance potential of systems via automatic configuration tuning. In: Proc. of the 8th ACM Symp. on Cloud Computing (SoCC), pp. 338–350. ACM (2017)

Index Terms

SimCost: cost-effective resource provision prediction and recommendation for spark workloads

Index terms have been assigned to the content through auto-classification.

Recommendations

Cost-effective Resource Provisioning for Spark Workloads
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Spark is one of the prevalent big data analytical platforms. Configuring proper resource provision for Spark jobs is challenging but essential for organizations to save time, achieve high resource utilization, and remain cost-effective. In this paper, ...
Optimal Resource Provisioning Approach based on Cost Modeling for Spark Applications in Public Clouds
Middleware Doct Symposium '15: Proceedings of the Doctoral Symposium of the 16th International Middleware Conference

Efficient resource provisioning is required when running Spark applications in public clouds. However, how to optimize resource provisioning to minimize the time and/or monetary cost for a specific application remains an intractable problem since ...
Estimating resource costs of data-intensive workloads in public clouds
MGC '12: Proceedings of the 10th International Workshop on Middleware for Grids, Clouds and e-Science

The promise of "infinite" resources given by the cloud computing paradigm has led to recent interest in exploiting clouds for large-scale data-intensive computing. In this paper, we present a model to estimate the resource costs for executing data-...

Comments

Information & Contributors

Information

Published In

cover image Distributed and Parallel Databases

Distributed and Parallel Databases Volume 42, Issue 1

Mar 2024

140 pages

ISSN:0926-8782

Issue’s Table of Contents

© The Author(s) 2023.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 22 June 2023

Accepted: 29 May 2023

Author Tags

Qualifiers

Research-article

Funding Sources

Academy of Finland
University of Helsinki including Helsinki University Central Hospital

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents