Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Using machine learning to optimize parallelism in big data applications

Published: 01 September 2018 Publication History

Abstract

In-memory cluster computing platforms have gained momentum in the last years, due to their ability to analyse big amounts of data in parallel. These platforms are complex and difficult-to-manage environments. In addition, there is a lack of tools to better understand and optimize such platforms that consequently form the backbone of big data infrastructure and technologies. This directly leads to underutilization of available resources and application failures in such environment. One of the key aspects that can address this problem is optimization of the task parallelism of application in such environments. In this paper, we propose a machine learning based method that recommends optimal parameters for task parallelization in big data workloads. By monitoring and gathering metrics at system and application level, we are able to find statistical correlations that allow us to characterize and predict the effect of different parallelism settings on performance. These predictions are used to recommend an optimal configuration to users before launching their workloads in the cluster, avoiding possible failures, performance degradation and wastage of resources. We evaluate our method with a benchmark of 15 Spark applications on the Grid5000 testbed. We observe up to a 51% gain on performance when using the recommended parallelism settings. The model is also interpretable and can give insights to the user into how different metrics and parameters affect the performance.

Highlights

We train a machine learning model to predict the duration of Big Data workloads.
We leverage these predictions to recommend an optimal task configuration.
We evaluate our method with an Apache Spark benchmark on a testbed.
We observe up to a 51% gain on performance with these recommendations.
The model is also user-interpretable.

References

[1]
Nadkarni A., Vesset D., Worldwide Big Data Technology and Services Forecast, 2016–2020, International Data Corporation, IDC, 2016.
[2]
Dean J., Ghemawat S., MapReduce: simplified data processing on large clusters, Commun. ACM 51 (1) (2008) 107–113.
[3]
Zaharia M., Chowdhury M., Franklin M.J., Shenker S., Stoica I., Spark: cluster computing with working sets, HotCloud 10 (2010) 10–10.
[4]
Alexandrov A., Bergmann R., Ewen S., Freytag J.-C., Hueske F., Heise A., Kao O., Leich M., Leser U., Markl V., et al., The Stratosphere platform for big data analytics, VLDB J. 23 (6) (2014) 939–964.
[5]
Vavilapalli V.K., Murthy A.C., Douglas C., Agarwal S., Konar M., Evans R., Graves T., Lowe J., Shah H., Seth S., et al., Apache hadoop yarn: Yet another resource negotiator, in: Proceedings of the 4th Annual Symposium on Cloud Computing, ACM, 2013, p. 5.
[6]
Hindman B., Konwinski A., Zaharia M., Ghodsi A., Joseph A.D., Katz R.H., Shenker S., Stoica I., Mesos: A platform for fine-grained resource sharing in the data center, NSDI, vol. 11, 2011, pp. 22–22.
[7]
Balouek D., Carpen Amarie A., Charrier G., Desprez F., Jeannot E., Jeanvoine E., Lèbre A., Margery D., Niclausse N., Nussbaum L., Richard O., Pérez C., Quesnel F., Rohr C., Sarzyniec L., Adding virtualization capabilities to the grid’5000 testbed, in: Ivanov I., Sinderen M., Leymann F., Shan T. (Eds.), Cloud Computing and Services Science, in: Communications in Computer and Information Science, 367, Springer International Publishing, 2013, pp. 3–20,.
[8]
Ghodsi A., Zaharia M., Hindman B., Konwinski A., Shenker S., Stoica I., Dominant resource fairness: Fair allocation of multiple resource types, NSDI, vol. 11, 2011, pp. 24–24.
[9]
Zaharia M., Chowdhury M., Das T., Dave A., Ma J., McCauley M., Franklin M.J., Shenker S., Stoica I., Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, 2012, pp. 2–2.
[11]
Electrical consumption on grid’5000 lyon site, https://intranet.grid5000.fr/supervision/lyon/wattmetre/.
[13]
Montes J., Sánchez A., Memishi B., Pérez M.S., Antoniu G., GMonE: A complete approach to cloud monitoring, Future Gener. Comput. Syst. 29 (8) (2013) 2026–2040.
[14]
Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E., Scikit-learn: machine learning in Python, J. Mach. Learn. Res. 12 (2011) 2825–2830.
[15]
Kohavi R., et al., A study of cross-validation and bootstrap for accuracy estimation and model selection, Ijcai, vol. 14, 1995, pp. 1137–1145.
[16]
Willmott C.J., Matsuura K., Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance, Climate Res. 30 (1) (2005) 79–82.
[17]
Bruno N., Agarwal S., Kandula S., Shi B., Wu M.-C., Zhou J., Recurring job optimization in scope, Proceedings of the 2012 International Conference on Management of Data - SIGMOD ’12, 2012, p. 805,.
[18]
Li M., Tan J., Wang Y., Zhang L., Salapura V., Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark, in: Proceedings of the 12th ACM International Conference on Computing Frontiers, ACM, 2015, p. 53.
[19]
Wang L., Zhan J., Luo C., Zhu Y., Yang Q., He Y., Gao W., Jia Z., Shi Y., Zhang S., et al., Bigdatabench: A big data benchmark suite from internet services, in: 2014 IEEE 20th International Symposium on High Performance Computer Architecture, HPCA, IEEE, 2014, pp. 488–499.
[20]
Chung I.H., Hollingsworth J.K., Using information from prior runs to improve automated tuning systems, Proceedings of the ACM/IEEE SC 2004 Conference: Bridging Communities, 00 (c), 2004,.
[21]
Jayakumar A., Murali P., Vadhiyar S., Matching application signatures for performance predictions using a single execution, in: Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS, 2015, pp. 1161–1170,.
[22]
Miu T., Missier P., Predicting the execution time of workflow activities based on their input features, in: Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC, 2012, pp. 64–72,.
[23]
Ipek E., De Supinski B.R., Schulz M., McKee S.A., An approach to performance prediction for parallel applications, in: European Conference on Parallel Processing, Springer, 2005, pp. 196–205.
[24]
Herodotou H., Lim H., Luo G., Borisov N., Dong L., Cetin F.B., Babu S., Starfish: A self-tuning system for big data analytics, CIDR, 11, 2011, pp. 261–272.
[25]
Popescu A.D., Ercegovac V., Balmin A., Branco M., Ailamaki A., Same queries, different data: Can we predict runtime performance?, in: Proceedings - 2012 IEEE 28th International Conference on Data Engineering Workshops, ICDEW, 2012, pp. 275–280,.
[26]
Berral J.L., Poggi N., Carrera D., Call A., Reinauer R., Green D., ALOJA-ML: A framework for automating characterization and knowledge discovery in hadoop deployments, in: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 1701–1710,.
[27]
Lama P., Zhou X., Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud, in: Proceedings of the 9th International Conference on Autonomic Computing, ACM, 2012, pp. 63–72.
[28]
Li M., Zeng L., Meng S., Tan J., Zhang L., Butt A.R., Fuller N., MRONLINE: mapreduce online performance tuning, in: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, ACM, 2014, pp. 165–176.
[29]
Cheng D., Rao J., Guo Y., Zhou X., Improving MapReduce performance in heterogeneous environments with adaptive task tuning, in: Proceedings of the 15th International Middleware Conference, ACM, 2014, pp. 97–108.
[30]
Wang K., Khan M.M.H., Performance prediction for apache spark platform, in: High Performance Computing and Communications, HPCC, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, CSS, 2015 IEEE 12th International Conferen on Embedded Software and Systems, ICESS, 2015 IEEE 17th International Conference on, IEEE, 2015, pp. 166–173.
[31]
Xu L., Li M., Zhang L., Butt A.R., Wang Y., Hu Z.Z., MEMTUNE: Dynamic memory management for in-memory data analytic platforms, in: Proceedings - 2016 International Parallel and Distributed Processing Symposium, 2016, pp. 1161–1170,.
[32]
Petridis P., Gounaris A., Torres J., Spark parameter tuning via trial-and-error, in: INNS Conference on Big Data, Springer, 2016, pp. 226–237.
[33]
J. Huang, Tuning java garbage collection for spark applications introduction to spark and garbage collection how java’s garbage collectors work, 2015, 1–13, https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html.
[34]
Davidson A., Or A., Optimizing Shuffle Performance in Spark, Tech. Rep, University of California, Berkeley-Department of Electrical Engineering and Computer Sciences, 2013.
[35]
Mars J., Tang L., Hundt R., Skadron K., Soffa M.L., Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations, Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-44 ’11, 2011, p. 248,.
[36]
Delimitrou C., Kozyrakis C., Paragon: QoS-aware Scheduling for Heterogeneous Datacenters, Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS ’13, 2013, pp. 77–88,.
[37]
Xu G., Xu C.-z., Jiang S., Prophet: Scheduling executors with time-varying resource demands on data-parallel computation frameworks, in: Autonomic Computing, ICAC, 2016, 2016, p. 45,54.

Cited By

View all
  • (2024)TIE: Fast Experiment-Driven ML-Based Configuration Tuning for In-Memory Data AnalyticsIEEE Transactions on Computers10.1109/TC.2024.336593773:5(1233-1247)Online publication date: 14-Feb-2024
  • (2024)A scalable and flexible platform for service placement in multi-fog and multi-cloud environmentsThe Journal of Supercomputing10.1007/s11227-023-05520-980:1(1109-1136)Online publication date: 1-Jan-2024
  • (2024)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 1-Mar-2024
  • Show More Cited By

Index Terms

  1. Using machine learning to optimize parallelism in big data applications
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Future Generation Computer Systems
      Future Generation Computer Systems  Volume 86, Issue C
      Sep 2018
      1535 pages

      Publisher

      Elsevier Science Publishers B. V.

      Netherlands

      Publication History

      Published: 01 September 2018

      Author Tags

      1. Machine learning
      2. Spark
      3. Parallelism
      4. Big data

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 30 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)TIE: Fast Experiment-Driven ML-Based Configuration Tuning for In-Memory Data AnalyticsIEEE Transactions on Computers10.1109/TC.2024.336593773:5(1233-1247)Online publication date: 14-Feb-2024
      • (2024)A scalable and flexible platform for service placement in multi-fog and multi-cloud environmentsThe Journal of Supercomputing10.1007/s11227-023-05520-980:1(1109-1136)Online publication date: 1-Jan-2024
      • (2024)SimCost: cost-effective resource provision prediction and recommendation for spark workloadsDistributed and Parallel Databases10.1007/s10619-023-07436-y42:1(73-102)Online publication date: 1-Mar-2024
      • (2024)Artificial neural networks based predictions towards the auto-tuning and optimization of parallel IO bandwidth in HPC systemCluster Computing10.1007/s10586-022-03814-w27:1(71-90)Online publication date: 1-Feb-2024
      • (2023)Towards General and Efficient Online Tuning for SparkProceedings of the VLDB Endowment10.14778/3611540.361154816:12(3570-3583)Online publication date: 1-Aug-2023
      • (2023)EFTuner: A Bi-Objective Configuration Parameter Auto-Tuning Method Towards Energy-Efficient Big Data ProcessingProceedings of the 14th Asia-Pacific Symposium on Internetware10.1145/3609437.3609443(292-301)Online publication date: 4-Aug-2023
      • (2023)Data Integration Revitalized: From Data Warehouse Through Data Lake to Data MeshDatabase and Expert Systems Applications10.1007/978-3-031-39847-6_1(3-18)Online publication date: 28-Aug-2023
      • (2022)Understanding the Impact of Data Parallelism on Neural Network ClassificationOptical Memory and Neural Networks10.3103/S1060992X2201010631:1(107-121)Online publication date: 1-Mar-2022
      • (2022)Evaluating push-down on NoSQL data sourcesProceedings of the International Workshop on Big Data in Emergent Distributed Environments10.1145/3530050.3532916(1-6)Online publication date: 12-Jun-2022
      • (2022)LOCAT: Low-Overhead Online Configuration Auto-Tuning of Spark SQL ApplicationsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526157(674-684)Online publication date: 10-Jun-2022
      • Show More Cited By

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media