SPOAHA: Spark Program Optimizer Based on Artificial Hummingbird Algorithm

Wang, Miao; Zhen, Jiteng; Ma, Yupeng; Huang, Xu; Zhang, Hong

doi:10.1007/978-3-031-40289-0_26

Miao Wang¹³,
Jiteng Zhen¹³,
Yupeng Ma¹³,
Xu Huang¹³ &
…
Hong Zhang¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14119))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

577 Accesses
2 Citations

Abstract

In this era of the Internet of Things (IoT), a large number of sensor devices collect and generate various sensing data over time. It is very essential to mine fresh information by analyzing large amounts of data, predict the future, and make correct decisions. Therefore, a growing number of data-intensive computing frameworks have been proposed, such as Hadoop, Spark, Flink, etc. Rather than reading and writing files to disks, Spark processes data with a memory-based computing framework to improve the performance, which has attracted more attention from researchers. However, due to a wealth of operators provided by Spark, a certain application can be implemented in various ways, which also show big differences in performance. Therefore, tuning a Spark application is a very error-prone and time-consuming process, and requires developers to have a deep understanding of Spark’s operating principles and characteristics. In this paper, we summarize a series of rules such as operator reordering and operator replacement to design and implement a Spark program optimizer, called SPOAHA, based on the artificial Hummingbird algorithm. Experimental results show that without changing the semantics of the original program, the optimized program dramatically reduces the amount of data involved in the shuffling period, and speeds up the execution time by up to 2.7$\times $.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394 (2015)
Google Scholar
Beheshti, Z., Shamsuddin, S.M.H.: A review of population-based meta-heuristic algorithms. Int. J. Adv. Soft Comput. Appl 5(1), 1–35 (2013)
Google Scholar
Cheng, G., Ying, S., Wang, B., Li, Y.: Efficient performance prediction for apache spark. J. Parallel Distrib. Comput. 149, 40–51 (2021)
Article Google Scholar
De Francisci Morales, G., Bifet, A., Khan, L., Gama, J., Fan, W.: IoT big data stream mining. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2119–2120 (2016)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Flink: website. https://flink.apache.org/
Hadoop: website. http://hadoop.apache.org/
Herodotou, H., Chen, Y., Lu, J.: A survey on automatic parameter tuning for big data processing systems. ACM Comput. Surv. (CSUR) 53(2), 1–37 (2020)
Article Google Scholar
Karau, H., Warren, R.: High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark. O’Reilly Media, Inc., Sebastopol (2017)
Google Scholar
Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SparkBench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, pp. 1–8. ACM (2015)
Google Scholar
Li, X.: Spark performance tuning guide. https://tech.meituan.com/2016/04/29/spark-tuning-basic.html (2016)
Li, Y., Li, M., Ding, L., Interlandi, M.: Rios: Runtime integrated optimizer for spark. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 275–287 (2018)
Google Scholar
Mothe, R., Tharun Reddy, S., Vijay Kumar, B., Rajeshwar Rao, A., Chythanya, K.R.: A review on big data analytics in Internet of Things (IoT) and Its roles, applications and challenges. In: Kumar, A., Senatore, S., Gunjan, V.K. (eds.) ICDSMLA 2020. LNEE, vol. 783, pp. 765–773. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-3690-5_70
Chapter Google Scholar
Nguyen, N., Khan, M.M.H., Albayram, Y., Wang, K.: Understanding the influence of configuration settings: an execution model-driven framework for apache spark platform. In: 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), pp. 802–807. IEEE (2017)
Google Scholar
Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G.: Making sense of performance in data analytics frameworks. In: 12th $\{$USENIX$\}$ Symposium on Networked Systems Design and Implementation ($\{$NSDI$\}$ 15), pp. 293–307 (2015)
Google Scholar
Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int. J. Data Sci. Anal. 1(3), 145–164 (2016)
Article Google Scholar
Shmeis, Z., Jaber, M.: A rewrite-based optimizer for spark. Futur. Gener. Comput. Syst. 98, 586–599 (2019)
Article Google Scholar
Spark: website. http://spark.apache.org/
Wang, G., Xu, J., He, B.: A novel method for tuning configuration parameters of spark based on machine learning. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 586–593. IEEE (2016)
Google Scholar
Xu, L., Li, M., Zhang, L., Butt, A.R., Wang, Y., Hu, Z.Z.: Memtune: dynamic memory management for in-memory data analytic platforms. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 383–392. IEEE (2016)
Google Scholar
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: 9th $\{$USENIX$\}$ Symposium on Networked Systems Design and Implementation ($\{$NSDI$\}$ 12), pp. 15–28 (2012)
Google Scholar
Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Article Google Scholar
Zhang, H., Huang, H., Wang, L.: MRapid: an efficient short job optimizer on Hadoop. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)., pp. 459–468. IEEE (2017)
Google Scholar
Zhao, W., Wang, L., Mirjalili, S.: Artificial hummingbird algorithm: a new bio-inspired optimizer with its engineering applications. Comput. Methods Appl. Mech. Eng. 388, 114194 (2022)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgement

This work is supported by Hebei Natural Science Foundation of China [No. F2019201361].

Author information

Authors and Affiliations

School of Cyber Security and Computer, Hebei University, Baoding, 071002, China
Miao Wang, Jiteng Zhen, Yupeng Ma, Xu Huang & Hong Zhang

Authors

Miao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiteng Zhen
View author publications
You can also search for this author in PubMed Google Scholar
Yupeng Ma
View author publications
You can also search for this author in PubMed Google Scholar
Xu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hong Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong Zhang .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhi Jin
South China Normal University, Guangzhou, China
Yuncheng Jiang
Babeș-Bolyai University, Cluj-Napoca, Romania
Robert Andrei Buchmann
Ulster University, Belfast, UK
Yaxin Bi
Babeș-Bolyai University, Cluj-Napoca, Romania
Ana-Maria Ghiran
South China Normal University, Guangzhou, China
Wenjun Ma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, M., Zhen, J., Ma, Y., Huang, X., Zhang, H. (2023). SPOAHA: Spark Program Optimizer Based on Artificial Hummingbird Algorithm. In: Jin, Z., Jiang, Y., Buchmann, R.A., Bi, Y., Ghiran, AM., Ma, W. (eds) Knowledge Science, Engineering and Management. KSEM 2023. Lecture Notes in Computer Science(), vol 14119. Springer, Cham. https://doi.org/10.1007/978-3-031-40289-0_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-40289-0_26
Published: 09 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40288-3
Online ISBN: 978-3-031-40289-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SPOAHA: Spark Program Optimizer Based on Artificial Hummingbird Algorithm