Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

SPOAHA: Spark Program Optimizer Based on Artificial Hummingbird Algorithm

  • Conference paper
  • First Online:
Knowledge Science, Engineering and Management (KSEM 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14119))

Abstract

In this era of the Internet of Things (IoT), a large number of sensor devices collect and generate various sensing data over time. It is very essential to mine fresh information by analyzing large amounts of data, predict the future, and make correct decisions. Therefore, a growing number of data-intensive computing frameworks have been proposed, such as Hadoop, Spark, Flink, etc. Rather than reading and writing files to disks, Spark processes data with a memory-based computing framework to improve the performance, which has attracted more attention from researchers. However, due to a wealth of operators provided by Spark, a certain application can be implemented in various ways, which also show big differences in performance. Therefore, tuning a Spark application is a very error-prone and time-consuming process, and requires developers to have a deep understanding of Spark’s operating principles and characteristics. In this paper, we summarize a series of rules such as operator reordering and operator replacement to design and implement a Spark program optimizer, called SPOAHA, based on the artificial Hummingbird algorithm. Experimental results show that without changing the semantics of the original program, the optimized program dramatically reduces the amount of data involved in the shuffling period, and speeds up the execution time by up to 2.7\(\times \).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394 (2015)

    Google Scholar 

  2. Beheshti, Z., Shamsuddin, S.M.H.: A review of population-based meta-heuristic algorithms. Int. J. Adv. Soft Comput. Appl 5(1), 1–35 (2013)

    Google Scholar 

  3. Cheng, G., Ying, S., Wang, B., Li, Y.: Efficient performance prediction for apache spark. J. Parallel Distrib. Comput. 149, 40–51 (2021)

    Article  Google Scholar 

  4. De Francisci Morales, G., Bifet, A., Khan, L., Gama, J., Fan, W.: IoT big data stream mining. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2119–2120 (2016)

    Google Scholar 

  5. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  6. Flink: website. https://flink.apache.org/

  7. Hadoop: website. http://hadoop.apache.org/

  8. Herodotou, H., Chen, Y., Lu, J.: A survey on automatic parameter tuning for big data processing systems. ACM Comput. Surv. (CSUR) 53(2), 1–37 (2020)

    Article  Google Scholar 

  9. Karau, H., Warren, R.: High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark. O’Reilly Media, Inc., Sebastopol (2017)

    Google Scholar 

  10. Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SparkBench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, pp. 1–8. ACM (2015)

    Google Scholar 

  11. Li, X.: Spark performance tuning guide. https://tech.meituan.com/2016/04/29/spark-tuning-basic.html (2016)

  12. Li, Y., Li, M., Ding, L., Interlandi, M.: Rios: Runtime integrated optimizer for spark. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 275–287 (2018)

    Google Scholar 

  13. Mothe, R., Tharun Reddy, S., Vijay Kumar, B., Rajeshwar Rao, A., Chythanya, K.R.: A review on big data analytics in Internet of Things (IoT) and Its roles, applications and challenges. In: Kumar, A., Senatore, S., Gunjan, V.K. (eds.) ICDSMLA 2020. LNEE, vol. 783, pp. 765–773. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-3690-5_70

    Chapter  Google Scholar 

  14. Nguyen, N., Khan, M.M.H., Albayram, Y., Wang, K.: Understanding the influence of configuration settings: an execution model-driven framework for apache spark platform. In: 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), pp. 802–807. IEEE (2017)

    Google Scholar 

  15. Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G.: Making sense of performance in data analytics frameworks. In: 12th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 15), pp. 293–307 (2015)

    Google Scholar 

  16. Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int. J. Data Sci. Anal. 1(3), 145–164 (2016)

    Article  Google Scholar 

  17. Shmeis, Z., Jaber, M.: A rewrite-based optimizer for spark. Futur. Gener. Comput. Syst. 98, 586–599 (2019)

    Article  Google Scholar 

  18. Spark: website. http://spark.apache.org/

  19. Wang, G., Xu, J., He, B.: A novel method for tuning configuration parameters of spark based on machine learning. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 586–593. IEEE (2016)

    Google Scholar 

  20. Xu, L., Li, M., Zhang, L., Butt, A.R., Wang, Y., Hu, Z.Z.: Memtune: dynamic memory management for in-memory data analytic platforms. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 383–392. IEEE (2016)

    Google Scholar 

  21. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: 9th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 12), pp. 15–28 (2012)

    Google Scholar 

  22. Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)

    Article  Google Scholar 

  23. Zhang, H., Huang, H., Wang, L.: MRapid: an efficient short job optimizer on Hadoop. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)., pp. 459–468. IEEE (2017)

    Google Scholar 

  24. Zhao, W., Wang, L., Mirjalili, S.: Artificial hummingbird algorithm: a new bio-inspired optimizer with its engineering applications. Comput. Methods Appl. Mech. Eng. 388, 114194 (2022)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgement

This work is supported by Hebei Natural Science Foundation of China [No. F2019201361].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, M., Zhen, J., Ma, Y., Huang, X., Zhang, H. (2023). SPOAHA: Spark Program Optimizer Based on Artificial Hummingbird Algorithm. In: Jin, Z., Jiang, Y., Buchmann, R.A., Bi, Y., Ghiran, AM., Ma, W. (eds) Knowledge Science, Engineering and Management. KSEM 2023. Lecture Notes in Computer Science(), vol 14119. Springer, Cham. https://doi.org/10.1007/978-3-031-40289-0_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40289-0_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40288-3

  • Online ISBN: 978-3-031-40289-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics