research-article

Using machine learning to optimize parallelism in big data applications

Authors:

Álvaro Brandón Hernández,

María S. Perez,

Victor Muntés-MuleroAuthors Info & Claims

Volume 86, Issue C

Pages 1076 - 1092

https://doi.org/10.1016/j.future.2017.07.003

Published: 01 September 2018 Publication History

Abstract

In-memory cluster computing platforms have gained momentum in the last years, due to their ability to analyse big amounts of data in parallel. These platforms are complex and difficult-to-manage environments. In addition, there is a lack of tools to better understand and optimize such platforms that consequently form the backbone of big data infrastructure and technologies. This directly leads to underutilization of available resources and application failures in such environment. One of the key aspects that can address this problem is optimization of the task parallelism of application in such environments. In this paper, we propose a machine learning based method that recommends optimal parameters for task parallelization in big data workloads. By monitoring and gathering metrics at system and application level, we are able to find statistical correlations that allow us to characterize and predict the effect of different parallelism settings on performance. These predictions are used to recommend an optimal configuration to users before launching their workloads in the cluster, avoiding possible failures, performance degradation and wastage of resources. We evaluate our method with a benchmark of 15 Spark applications on the Grid5000 testbed. We observe up to a 51% gain on performance when using the recommended parallelism settings. The model is also interpretable and can give insights to the user into how different metrics and parameters affect the performance.

Highlights

•

We train a machine learning model to predict the duration of Big Data workloads.

•

We leverage these predictions to recommend an optimal task configuration.

•

We evaluate our method with an Apache Spark benchmark on a testbed.

•

We observe up to a 51% gain on performance with these recommendations.

•

The model is also user-interpretable.

References

[1]

Nadkarni A., Vesset D., Worldwide Big Data Technology and Services Forecast, 2016–2020, International Data Corporation, IDC, 2016.

[2]

Dean J., Ghemawat S., MapReduce: simplified data processing on large clusters, Commun. ACM 51 (1) (2008) 107–113.

Digital Library

[3]

Zaharia M., Chowdhury M., Franklin M.J., Shenker S., Stoica I., Spark: cluster computing with working sets, HotCloud 10 (2010) 10–10.

[4]

Alexandrov A., Bergmann R., Ewen S., Freytag J.-C., Hueske F., Heise A., Kao O., Leich M., Leser U., Markl V., et al., The Stratosphere platform for big data analytics, VLDB J. 23 (6) (2014) 939–964.

Digital Library

[5]

Vavilapalli V.K., Murthy A.C., Douglas C., Agarwal S., Konar M., Evans R., Graves T., Lowe J., Shah H., Seth S., et al., Apache hadoop yarn: Yet another resource negotiator, in: Proceedings of the 4th Annual Symposium on Cloud Computing, ACM, 2013, p. 5.

[6]

Hindman B., Konwinski A., Zaharia M., Ghodsi A., Joseph A.D., Katz R.H., Shenker S., Stoica I., Mesos: A platform for fine-grained resource sharing in the data center, NSDI, vol. 11, 2011, pp. 22–22.

[7]

Balouek D., Carpen Amarie A., Charrier G., Desprez F., Jeannot E., Jeanvoine E., Lèbre A., Margery D., Niclausse N., Nussbaum L., Richard O., Pérez C., Quesnel F., Rohr C., Sarzyniec L., Adding virtualization capabilities to the grid’5000 testbed, in: Ivanov I., Sinderen M., Leymann F., Shan T. (Eds.), Cloud Computing and Services Science, in: Communications in Computer and Information Science, 367, Springer International Publishing, 2013, pp. 3–20,.

[8]

Ghodsi A., Zaharia M., Hindman B., Konwinski A., Shenker S., Stoica I., Dominant resource fairness: Fair allocation of multiple resource types, NSDI, vol. 11, 2011, pp. 24–24.

[9]

Zaharia M., Chowdhury M., Das T., Dave A., Ma J., McCauley M., Franklin M.J., Shenker S., Stoica I., Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, 2012, pp. 2–2.

[10]

Dynamic allocation in spark, http://spark.apache.org/docs/latest/job-scheduling.html/.

[11]

Electrical consumption on grid’5000 lyon site, https://intranet.grid5000.fr/supervision/lyon/wattmetre/.

[12]

Managing cpu Resources, https://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html.

[13]

Montes J., Sánchez A., Memishi B., Pérez M.S., Antoniu G., GMonE: A complete approach to cloud monitoring, Future Gener. Comput. Syst. 29 (8) (2013) 2026–2040.

Digital Library

[14]

Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E., Scikit-learn: machine learning in Python, J. Mach. Learn. Res. 12 (2011) 2825–2830.

[15]

Kohavi R., et al., A study of cross-validation and bootstrap for accuracy estimation and model selection, Ijcai, vol. 14, 1995, pp. 1137–1145.

[16]

Willmott C.J., Matsuura K., Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance, Climate Res. 30 (1) (2005) 79–82.

[17]

Bruno N., Agarwal S., Kandula S., Shi B., Wu M.-C., Zhou J., Recurring job optimization in scope, Proceedings of the 2012 International Conference on Management of Data - SIGMOD ’12, 2012, p. 805,.

Digital Library

[18]

Li M., Tan J., Wang Y., Zhang L., Salapura V., Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark, in: Proceedings of the 12th ACM International Conference on Computing Frontiers, ACM, 2015, p. 53.

[19]

Wang L., Zhan J., Luo C., Zhu Y., Yang Q., He Y., Gao W., Jia Z., Shi Y., Zhang S., et al., Bigdatabench: A big data benchmark suite from internet services, in: 2014 IEEE 20th International Symposium on High Performance Computer Architecture, HPCA, IEEE, 2014, pp. 488–499.

[20]

Chung I.H., Hollingsworth J.K., Using information from prior runs to improve automated tuning systems, Proceedings of the ACM/IEEE SC 2004 Conference: Bridging Communities, 00 (c), 2004,.

Digital Library

[21]

Jayakumar A., Murali P., Vadhiyar S., Matching application signatures for performance predictions using a single execution, in: Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS, 2015, pp. 1161–1170,.

Digital Library

[22]

Miu T., Missier P., Predicting the execution time of workflow activities based on their input features, in: Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC, 2012, pp. 64–72,.

Digital Library

[23]

Ipek E., De Supinski B.R., Schulz M., McKee S.A., An approach to performance prediction for parallel applications, in: European Conference on Parallel Processing, Springer, 2005, pp. 196–205.

[24]

Herodotou H., Lim H., Luo G., Borisov N., Dong L., Cetin F.B., Babu S., Starfish: A self-tuning system for big data analytics, CIDR, 11, 2011, pp. 261–272.

[25]

Popescu A.D., Ercegovac V., Balmin A., Branco M., Ailamaki A., Same queries, different data: Can we predict runtime performance?, in: Proceedings - 2012 IEEE 28th International Conference on Data Engineering Workshops, ICDEW, 2012, pp. 275–280,.

Digital Library

[26]

Berral J.L., Poggi N., Carrera D., Call A., Reinauer R., Green D., ALOJA-ML: A framework for automating characterization and knowledge discovery in hadoop deployments, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 1701–1710,.

Digital Library

[27]

Lama P., Zhou X., Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud, in: Proceedings of the 9th International Conference on Autonomic Computing, ACM, 2012, pp. 63–72.

[28]

Li M., Zeng L., Meng S., Tan J., Zhang L., Butt A.R., Fuller N., MRONLINE: mapreduce online performance tuning, in: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, ACM, 2014, pp. 165–176.

[29]

Cheng D., Rao J., Guo Y., Zhou X., Improving MapReduce performance in heterogeneous environments with adaptive task tuning, in: Proceedings of the 15th International Middleware Conference, ACM, 2014, pp. 97–108.

[30]

Wang K., Khan M.M.H., Performance prediction for apache spark platform, in: High Performance Computing and Communications, HPCC, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, CSS, 2015 IEEE 12th International Conferen on Embedded Software and Systems, ICESS, 2015 IEEE 17th International Conference on, IEEE, 2015, pp. 166–173.

[31]

Xu L., Li M., Zhang L., Butt A.R., Wang Y., Hu Z.Z., MEMTUNE: Dynamic memory management for in-memory data analytic platforms, in: Proceedings - 2016 International Parallel and Distributed Processing Symposium, 2016, pp. 1161–1170,.

[32]

Petridis P., Gounaris A., Torres J., Spark parameter tuning via trial-and-error, in: INNS Conference on Big Data, Springer, 2016, pp. 226–237.

[33]

J. Huang, Tuning java garbage collection for spark applications introduction to spark and garbage collection how java’s garbage collectors work, 2015, 1–13, https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html.

[34]

Davidson A., Or A., Optimizing Shuffle Performance in Spark, Tech. Rep, University of California, Berkeley-Department of Electrical Engineering and Computer Sciences, 2013.

[35]

Mars J., Tang L., Hundt R., Skadron K., Soffa M.L., Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations, Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-44 ’11, 2011, p. 248,.

Digital Library

[36]

Delimitrou C., Kozyrakis C., Paragon: QoS-aware Scheduling for Heterogeneous Datacenters, Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS ’13, 2013, pp. 77–88,.

Digital Library

[37]

Xu G., Xu C.-z., Jiang S., Prophet: Scheduling executors with time-varying resource demands on data-parallel computation frameworks, in: Autonomic Computing, ICAC, 2016, 2016, p. 45,54.

Cited By

Chen CXin JYu Z(2024)TIE: Fast Experiment-Driven ML-Based Configuration Tuning for In-Memory Data AnalyticsIEEE Transactions on Computers10.1109/TC.2024.336593773:5(1233-1247)Online publication date: 14-Feb-2024
https://dl.acm.org/doi/10.1109/TC.2024.3365937
Azizi SFarzin PShojafar MRana O(2024)A scalable and flexible platform for service placement in multi-fog and multi-cloud environmentsThe Journal of Supercomputing10.1007/s11227-023-05520-980:1(1109-1136)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s11227-023-05520-9
Chen YHoque MXu PLu JTarkoma S(2024)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s10619-023-07436-y
Show More Cited By

Index Terms

Using machine learning to optimize parallelism in big data applications
1. Computing methodologies
  1. Machine learning
2. Software and its engineering

Index terms have been assigned to the content through auto-classification.

Recommendations

Big Data Processing using Machine Learning algorithms: MLlib and Mahout Use Case
SITA'18: Proceedings of the 12th International Conference on Intelligent Systems: Theories and Applications

Machine learning is a field within artificial intelligence that allows machines to learn on their own from existing information to make predictions or/and decisions. There are three main categories of machine learning techniques: Collaborative filtering ...
A Multi-dimensional Comparison of Toolkits for Machine Learning with Big Data
IRI '15: Proceedings of the 2015 IEEE International Conference on Information Reuse and Integration

Big data is a big business, and effective modeling of this data is key. This paper provides a comprehensive multidimensional analysis of various open source tools for machine learning with big data. An evaluation standard is proposed along with detailed ...
Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink

With the ever-increasing need to analyze large amounts of data to get useful insights, it is essential to develop complex parallel machine learning algorithms that can scale with data and number of parallel processes. These algorithms need to run on ...

Comments

Information & Contributors

Information

Published In

cover image Future Generation Computer Systems

Future Generation Computer Systems Volume 86, Issue C

Sep 2018

1535 pages

ISSN:0167-739X

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 September 2018

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen CXin JYu Z(2024)TIE: Fast Experiment-Driven ML-Based Configuration Tuning for In-Memory Data AnalyticsIEEE Transactions on Computers10.1109/TC.2024.336593773:5(1233-1247)Online publication date: 14-Feb-2024
https://dl.acm.org/doi/10.1109/TC.2024.3365937
Azizi SFarzin PShojafar MRana O(2024)A scalable and flexible platform for service placement in multi-fog and multi-cloud environmentsThe Journal of Supercomputing10.1007/s11227-023-05520-980:1(1109-1136)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s11227-023-05520-9
Chen YHoque MXu PLu JTarkoma S(2024)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s10619-023-07436-y
Tipu AConbhuí PHowley E(2024)Artificial neural networks based predictions towards the auto-tuning and optimization of parallel IO bandwidth in HPC systemCluster Computing10.1007/s10586-022-03814-w27:1(71-90)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s10586-022-03814-w
Li YJiang HShen YFang YYang XHuang DZhang XZhang WZhang CChen PCui B(2023)Towards General and Efficient Online Tuning for SparkProceedings of the VLDB Endowment10.14778/3611540.361154816:12(3570-3583)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611548
Dou HWei XWang KZhang YChen PHuang Y(2023)EFTuner: A Bi-Objective Configuration Parameter Auto-Tuning Method Towards Energy-Efficient Big Data ProcessingProceedings of the 14th Asia-Pacific Symposium on Internetware10.1145/3609437.3609443(292-301)Online publication date: 4-Aug-2023
https://dl.acm.org/doi/10.1145/3609437.3609443
Wrembel R(2023)Data Integration Revitalized: From Data Warehouse Through Data Lake to Data MeshDatabase and Expert Systems Applications10.1007/978-3-031-39847-6_1(3-18)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-39847-6_1
Starlin Jini SChenthalir Indra D(2022)Understanding the Impact of Data Parallelism on Neural Network ClassificationOptical Memory and Neural Networks10.3103/S1060992X2201010631:1(107-121)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.3103/S1060992X22010106
Bodziony MMorawski RWrembel RGroppe SGruenwald LHsu C(2022)Evaluating push-down on NoSQL data sourcesProceedings of the International Workshop on Big Data in Emergent Distributed Environments10.1145/3530050.3532916(1-6)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3530050.3532916
Xin JHwang KYu ZIves ZBonifati AEl Abbadi A(2022)LOCAT: Low-Overhead Online Configuration Auto-Tuning of Spark SQL ApplicationsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526157(674-684)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526157
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents