Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3428757.3429121acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

Performance evaluation of Apache Hadoop and Apache Spark for parallelization of compute-intensive tasks

Published: 27 January 2021 Publication History

Abstract

There have been numerous studies that have examined the performance of distribution frameworks. Most of these studies deal with the processing of large amounts of data. This work compares two of these frameworks for their ability to implement CPU-intensive distributed algorithms. As a case study for our experiments we used a simple but computationally intensive puzzle. To find all solutions using brute-force search, 15! permutations had to be calculated and tested against the solution rules. Our experimental application was implemented in the Java programming language using a simple algorithm and having two distributed solutions with the paradigms MapReduce (Apache Hadoop) and RDD (Apache Spark). The implementations were benchmarked in Amazon-EC2/EMR clusters for performance and scalability measurements, where the processing time of both solutions scaled approximately linearly. However, according to our experiments, the number of tasks, hardware utilization and other aspects should also be taken into consideration when assessing scalability. The comparison of the solutions with MapReduce (Apache Hadoop) and RDD (Apache Spark) under Amazon EMR showed that the processing time measured in CPU minutes with Spark was up to 30 % lower, while the performance of Spark especially benefits from an increasing number of tasks. Considering the efficiency of using the EC2 resources, the implementation via Apache Spark was even more powerful than a comparable multithreaded Java solution.

References

[1]
Amazon. 2020. Amazon EC2. https://aws.amazon.com/de/ec2/
[2]
Amazon. 2020. Amazon EMR. https://aws.amazon.com/de/emr/
[3]
Amazon. 2020. Amazon S3. https://aws.amazon.com/de/s3/
[4]
Amazon. 2020. Amazon Web Services. https://aws.amazon.com/de/
[5]
Anonym. 2009. Next permutation: When C++ gets it right. http://wordaligned.org/articles/next-permutation
[6]
Apache. 2020. Apache Hadoop. http://hadoop.apache.org/
[7]
Sandeep Bhargava, Drdinesh Goyal, and Bright Keswani. 2019. Performance Comparison of Big Data Analytics Platforms. International Journal of Engineering, Applied and Management Sciences Paradigms (IJEAM) Volume 54 (2019), 342--348. Issue Issue 2.
[8]
Reza Bosagh Zadeh, Xiangrui Meng, Alexander Ulanov, Burak Yavuz, Li Pu, Shivaram Venkataraman, Evan Sparks, Aaron Staple, and Matei Zaharia. 2016. Matrix Computations and Optimization in Apache Spark. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016-08-13). ACM, New York, NY, USA, 31--38. https://doi.org/10.1145/2939672.2939675
[9]
André R. Brodtkorb, Trond R. Hagen, and Martin L. Sætra. 2013. Graphics processing unit (GPU) programming strategies and trends in GPU computing. J. Parallel and Distrib. Comput. 73, 1 (2013), 4--13. https://doi.org/10.1016/j.jpdc.2012.04.003
[10]
Jao Cerreia. 2020. Jao Cerreia. https://www.joaocorreia.de
[11]
Yi Chen, Zhi Qiao, Hai Jiang, Kuan-Ching Li, and Won Woo Ro. 2013. MGMR: Multi-GPU Based MapReduce. In Grid and Pervasive Computing, James J. Park, Hamid R. Arabnia, Cheonshik Kim, Weisong Shi, and Joon-Min Gil (Eds.). Vol. 7861. Springer Berlin Heidelberg, Berlin, Heidelberg, 433--442. https://doi.org/10.1007/978-3-642-38027-3_46 Series Title: Lecture Notes in Computer Science.
[12]
Alexander Döschl, Max-Emanuel Keller, and Peter Mandl. 2020. Permutation-games. https://github.com/CCWI/permutation-games
[13]
Chang-Jung Hsieh and Ting-Yuan Chan. 2016. Detection DDoS attacks based on neural-network using Apache Spark. In 2016 International Conference on Applied System Innovation (ICASI) (2016-05). IEEE, New Your City, NY, 1--4. https://doi.org/10.1109/ICASI.2016.7539833
[14]
K. R. Jayaram, Anshul Gandhi, Hongyi Xin, and Shu Tao. 2019. Adaptively Accelerating Map-Reduce/Spark with GPUs: A Case Study. In 2019 IEEE International Conference on Autonomic Computing (ICAC) (2019-06). IEEE, New Your City, NY, 105--114. https://doi.org/10.1109/ICAC.2019.00022
[15]
JCuda. 2020. JCuda. http://www.jcuda.org/
[16]
JOCL. 2020. JOCL. http://www.jocl.org/
[17]
Max-Emanuel Keller, Peter Mandl, Alexander Döschl, Daniel Kailer, and Markus Grimm. 2017. Verarbeitung komplexer XML-basierter Massendaten in BigData-Anwendungen. AKWI 2017, 6 (2017), 20--27. https://ojs-hslu.ch/ojs302/index.php/AKWI/article/view/93
[18]
Peter Mandl and Alexander Döschl. 2017. Klassisches Multi-threading versus MapReduce zur Parallelisierung rechenintensiver Tasks in der Amazon Cloud. HMD Praxis der Wirtschaftsinformatik 55, 2 (2017), 445--461. https://doi.org/10.1365/s40702-017-0360-z
[19]
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2016. MLlib: Machine Learning in Apache Spark. J. Mach. Learn. Res. 17, 1 (2016), 1235--1241. Publisher: JMLR.org.
[20]
Hamid Mushtaq, Nauman Ahmed, and Zaid Al-Ars. 2017. Streaming Distributed DNA Sequence Alignment Using Apache Spark. In 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE) (2017-10). IEEE, New York City, NY, 188--193. https://doi.org/10.1109/BIBE.2017.00-57
[21]
Eric Ouellet and Omar Saad. 2018. Permutations:Fast implementations and a new indexing algorithm allowing multithreading. https://www.codeproject.com/Articles/1250925/Permutations-Fast-implementations-and-a-new-indexi
[22]
Weiming Shi and Bo Hong. 2013. Clotho: an elastic MapReduce workload/runtime co-design. In Proceedings of the 12th International Workshop on Adaptive and Reflective Middleware - ARM '13. ACM Press, New York, NY, USA, 1--6. https://doi.org/10.1145/2541583.2541588
[23]
J. Veiga, R. R. Expósito, X. C. Pardo, G. L. Taboada, and J. Tourifio. 2016. Performance evaluation of big data frameworks for large-scale data analytics. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, New York City, NY, 424--431.
[24]
Jorge Veiga, Roberto R. Expósito, and Juan Touriño. 2018. Performance Evaluation of Big Data Analysis. In Encyclopedia of Big Data Technologies, Sherif Sakr and Albert Zomaya (Eds.). Springer International Publishing, Berlin, Heidelberg, 1--6. https://doi.org/10.1007/978-3-319-63962-8_143-1
[25]
Ramon Wartala. 2012. Hadoop: zuverlässige, verteilte und skalierbare Big-Data-Anwendungen. Open Source Press, Munich, Germany.
[26]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). USENIX, San Jose, CA, 15--28. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

Cited By

View all
  • (2021)Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache SparkISPRS International Journal of Geo-Information10.3390/ijgi1011076310:11(763)Online publication date: 11-Nov-2021
  • (2021)Performance evaluation of GPU- and cluster-computing for parallelization of compute-intensive tasksInternational Journal of Web Information Systems10.1108/IJWIS-03-2021-0032ahead-of-print:ahead-of-printOnline publication date: 6-Aug-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
iiWAS '20: Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services
November 2020
492 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

  • Johannes Kepler University, Linz, Austria

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Apache Hadoop
  2. Apache Spark
  3. Cloud Computing
  4. Cluster Computing
  5. Distributed Systems

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

iiWAS '20

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)2
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache SparkISPRS International Journal of Geo-Information10.3390/ijgi1011076310:11(763)Online publication date: 11-Nov-2021
  • (2021)Performance evaluation of GPU- and cluster-computing for parallelization of compute-intensive tasksInternational Journal of Web Information Systems10.1108/IJWIS-03-2021-0032ahead-of-print:ahead-of-printOnline publication date: 6-Aug-2021

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media