Abstract
In this era of the Internet of Things (IoT), a large number of sensor devices collect and generate various sensing data over time. It is very essential to mine fresh information by analyzing large amounts of data, predict the future, and make correct decisions. Therefore, a growing number of data-intensive computing frameworks have been proposed, such as Hadoop, Spark, Flink, etc. Rather than reading and writing files to disks, Spark processes data with a memory-based computing framework to improve the performance, which has attracted more attention from researchers. However, due to a wealth of operators provided by Spark, a certain application can be implemented in various ways, which also show big differences in performance. Therefore, tuning a Spark application is a very error-prone and time-consuming process, and requires developers to have a deep understanding of Spark’s operating principles and characteristics. In this paper, we summarize a series of rules such as operator reordering and operator replacement to design and implement a Spark program optimizer, called SPOAHA, based on the artificial Hummingbird algorithm. Experimental results show that without changing the semantics of the original program, the optimized program dramatically reduces the amount of data involved in the shuffling period, and speeds up the execution time by up to 2.7\(\times \).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394 (2015)
Beheshti, Z., Shamsuddin, S.M.H.: A review of population-based meta-heuristic algorithms. Int. J. Adv. Soft Comput. Appl 5(1), 1–35 (2013)
Cheng, G., Ying, S., Wang, B., Li, Y.: Efficient performance prediction for apache spark. J. Parallel Distrib. Comput. 149, 40–51 (2021)
De Francisci Morales, G., Bifet, A., Khan, L., Gama, J., Fan, W.: IoT big data stream mining. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2119–2120 (2016)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Flink: website. https://flink.apache.org/
Hadoop: website. http://hadoop.apache.org/
Herodotou, H., Chen, Y., Lu, J.: A survey on automatic parameter tuning for big data processing systems. ACM Comput. Surv. (CSUR) 53(2), 1–37 (2020)
Karau, H., Warren, R.: High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark. O’Reilly Media, Inc., Sebastopol (2017)
Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SparkBench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, pp. 1–8. ACM (2015)
Li, X.: Spark performance tuning guide. https://tech.meituan.com/2016/04/29/spark-tuning-basic.html (2016)
Li, Y., Li, M., Ding, L., Interlandi, M.: Rios: Runtime integrated optimizer for spark. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 275–287 (2018)
Mothe, R., Tharun Reddy, S., Vijay Kumar, B., Rajeshwar Rao, A., Chythanya, K.R.: A review on big data analytics in Internet of Things (IoT) and Its roles, applications and challenges. In: Kumar, A., Senatore, S., Gunjan, V.K. (eds.) ICDSMLA 2020. LNEE, vol. 783, pp. 765–773. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-3690-5_70
Nguyen, N., Khan, M.M.H., Albayram, Y., Wang, K.: Understanding the influence of configuration settings: an execution model-driven framework for apache spark platform. In: 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), pp. 802–807. IEEE (2017)
Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G.: Making sense of performance in data analytics frameworks. In: 12th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 15), pp. 293–307 (2015)
Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int. J. Data Sci. Anal. 1(3), 145–164 (2016)
Shmeis, Z., Jaber, M.: A rewrite-based optimizer for spark. Futur. Gener. Comput. Syst. 98, 586–599 (2019)
Spark: website. http://spark.apache.org/
Wang, G., Xu, J., He, B.: A novel method for tuning configuration parameters of spark based on machine learning. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 586–593. IEEE (2016)
Xu, L., Li, M., Zhang, L., Butt, A.R., Wang, Y., Hu, Z.Z.: Memtune: dynamic memory management for in-memory data analytic platforms. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 383–392. IEEE (2016)
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: 9th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 12), pp. 15–28 (2012)
Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Zhang, H., Huang, H., Wang, L.: MRapid: an efficient short job optimizer on Hadoop. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)., pp. 459–468. IEEE (2017)
Zhao, W., Wang, L., Mirjalili, S.: Artificial hummingbird algorithm: a new bio-inspired optimizer with its engineering applications. Comput. Methods Appl. Mech. Eng. 388, 114194 (2022)
Acknowledgement
This work is supported by Hebei Natural Science Foundation of China [No. F2019201361].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, M., Zhen, J., Ma, Y., Huang, X., Zhang, H. (2023). SPOAHA: Spark Program Optimizer Based on Artificial Hummingbird Algorithm. In: Jin, Z., Jiang, Y., Buchmann, R.A., Bi, Y., Ghiran, AM., Ma, W. (eds) Knowledge Science, Engineering and Management. KSEM 2023. Lecture Notes in Computer Science(), vol 14119. Springer, Cham. https://doi.org/10.1007/978-3-031-40289-0_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-40289-0_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40288-3
Online ISBN: 978-3-031-40289-0
eBook Packages: Computer ScienceComputer Science (R0)