research-article

Performance evaluation of Apache Hadoop and Apache Spark for parallelization of compute-intensive tasks

Authors:

Alexander Döschl,

Max-Emanuel Keller,

Peter MandlAuthors Info & Claims

iiWAS '20: Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services

Pages 313 - 321

https://doi.org/10.1145/3428757.3429121

Published: 27 January 2021 Publication History

Abstract

There have been numerous studies that have examined the performance of distribution frameworks. Most of these studies deal with the processing of large amounts of data. This work compares two of these frameworks for their ability to implement CPU-intensive distributed algorithms. As a case study for our experiments we used a simple but computationally intensive puzzle. To find all solutions using brute-force search, 15! permutations had to be calculated and tested against the solution rules. Our experimental application was implemented in the Java programming language using a simple algorithm and having two distributed solutions with the paradigms MapReduce (Apache Hadoop) and RDD (Apache Spark). The implementations were benchmarked in Amazon-EC2/EMR clusters for performance and scalability measurements, where the processing time of both solutions scaled approximately linearly. However, according to our experiments, the number of tasks, hardware utilization and other aspects should also be taken into consideration when assessing scalability. The comparison of the solutions with MapReduce (Apache Hadoop) and RDD (Apache Spark) under Amazon EMR showed that the processing time measured in CPU minutes with Spark was up to 30 % lower, while the performance of Spark especially benefits from an increasing number of tasks. Considering the efficiency of using the EC2 resources, the implementation via Apache Spark was even more powerful than a comparable multithreaded Java solution.

References

[1]

Amazon. 2020. Amazon EC2. https://aws.amazon.com/de/ec2/

[2]

Amazon. 2020. Amazon EMR. https://aws.amazon.com/de/emr/

[3]

Amazon. 2020. Amazon S3. https://aws.amazon.com/de/s3/

[4]

Amazon. 2020. Amazon Web Services. https://aws.amazon.com/de/

[5]

Anonym. 2009. Next permutation: When C++ gets it right. http://wordaligned.org/articles/next-permutation

[6]

Apache. 2020. Apache Hadoop. http://hadoop.apache.org/

[7]

Sandeep Bhargava, Drdinesh Goyal, and Bright Keswani. 2019. Performance Comparison of Big Data Analytics Platforms. International Journal of Engineering, Applied and Management Sciences Paradigms (IJEAM) Volume 54 (2019), 342--348. Issue Issue 2.

[8]

Reza Bosagh Zadeh, Xiangrui Meng, Alexander Ulanov, Burak Yavuz, Li Pu, Shivaram Venkataraman, Evan Sparks, Aaron Staple, and Matei Zaharia. 2016. Matrix Computations and Optimization in Apache Spark. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016-08-13). ACM, New York, NY, USA, 31--38. https://doi.org/10.1145/2939672.2939675

Digital Library

[9]

André R. Brodtkorb, Trond R. Hagen, and Martin L. Sætra. 2013. Graphics processing unit (GPU) programming strategies and trends in GPU computing. J. Parallel and Distrib. Comput. 73, 1 (2013), 4--13. https://doi.org/10.1016/j.jpdc.2012.04.003

Digital Library

[10]

Jao Cerreia. 2020. Jao Cerreia. https://www.joaocorreia.de

[11]

Yi Chen, Zhi Qiao, Hai Jiang, Kuan-Ching Li, and Won Woo Ro. 2013. MGMR: Multi-GPU Based MapReduce. In Grid and Pervasive Computing, James J. Park, Hamid R. Arabnia, Cheonshik Kim, Weisong Shi, and Joon-Min Gil (Eds.). Vol. 7861. Springer Berlin Heidelberg, Berlin, Heidelberg, 433--442. https://doi.org/10.1007/978-3-642-38027-3_46 Series Title: Lecture Notes in Computer Science.

[12]

Alexander Döschl, Max-Emanuel Keller, and Peter Mandl. 2020. Permutation-games. https://github.com/CCWI/permutation-games

[13]

Chang-Jung Hsieh and Ting-Yuan Chan. 2016. Detection DDoS attacks based on neural-network using Apache Spark. In 2016 International Conference on Applied System Innovation (ICASI) (2016-05). IEEE, New Your City, NY, 1--4. https://doi.org/10.1109/ICASI.2016.7539833

[14]

K. R. Jayaram, Anshul Gandhi, Hongyi Xin, and Shu Tao. 2019. Adaptively Accelerating Map-Reduce/Spark with GPUs: A Case Study. In 2019 IEEE International Conference on Autonomic Computing (ICAC) (2019-06). IEEE, New Your City, NY, 105--114. https://doi.org/10.1109/ICAC.2019.00022

[15]

JCuda. 2020. JCuda. http://www.jcuda.org/

[16]

JOCL. 2020. JOCL. http://www.jocl.org/

[17]

Max-Emanuel Keller, Peter Mandl, Alexander Döschl, Daniel Kailer, and Markus Grimm. 2017. Verarbeitung komplexer XML-basierter Massendaten in BigData-Anwendungen. AKWI 2017, 6 (2017), 20--27. https://ojs-hslu.ch/ojs302/index.php/AKWI/article/view/93

[18]

Peter Mandl and Alexander Döschl. 2017. Klassisches Multi-threading versus MapReduce zur Parallelisierung rechenintensiver Tasks in der Amazon Cloud. HMD Praxis der Wirtschaftsinformatik 55, 2 (2017), 445--461. https://doi.org/10.1365/s40702-017-0360-z

[19]

Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2016. MLlib: Machine Learning in Apache Spark. J. Mach. Learn. Res. 17, 1 (2016), 1235--1241. Publisher: JMLR.org.

Digital Library

[20]

Hamid Mushtaq, Nauman Ahmed, and Zaid Al-Ars. 2017. Streaming Distributed DNA Sequence Alignment Using Apache Spark. In 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE) (2017-10). IEEE, New York City, NY, 188--193. https://doi.org/10.1109/BIBE.2017.00-57

[21]

Eric Ouellet and Omar Saad. 2018. Permutations:Fast implementations and a new indexing algorithm allowing multithreading. https://www.codeproject.com/Articles/1250925/Permutations-Fast-implementations-and-a-new-indexi

[22]

Weiming Shi and Bo Hong. 2013. Clotho: an elastic MapReduce workload/runtime co-design. In Proceedings of the 12th International Workshop on Adaptive and Reflective Middleware - ARM '13. ACM Press, New York, NY, USA, 1--6. https://doi.org/10.1145/2541583.2541588

Digital Library

[23]

J. Veiga, R. R. Expósito, X. C. Pardo, G. L. Taboada, and J. Tourifio. 2016. Performance evaluation of big data frameworks for large-scale data analytics. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, New York City, NY, 424--431.

[24]

Jorge Veiga, Roberto R. Expósito, and Juan Touriño. 2018. Performance Evaluation of Big Data Analysis. In Encyclopedia of Big Data Technologies, Sherif Sakr and Albert Zomaya (Eds.). Springer International Publishing, Berlin, Heidelberg, 1--6. https://doi.org/10.1007/978-3-319-63962-8_143-1

[25]

Ramon Wartala. 2012. Hadoop: zuverlässige, verteilte und skalierbare Big-Data-Anwendungen. Open Source Press, Munich, Germany.

[26]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). USENIX, San Jose, CA, 15--28. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

Digital Library

Cited By

Moutafis PMavrommatis GVassilakopoulos MCorral A(2021)Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache SparkISPRS International Journal of Geo-Information10.3390/ijgi1011076310:11(763)Online publication date: 11-Nov-2021
https://doi.org/10.3390/ijgi10110763
Döschl AKeller MMandl P(2021)Performance evaluation of GPU- and cluster-computing for parallelization of compute-intensive tasksInternational Journal of Web Information Systems10.1108/IJWIS-03-2021-0032ahead-of-print:ahead-of-printOnline publication date: 6-Aug-2021
https://doi.org/10.1108/IJWIS-03-2021-0032

Index Terms

Performance evaluation of Apache Hadoop and Apache Spark for parallelization of compute-intensive tasks
1. Applied computing
  1. Enterprise computing
    1. Enterprise architectures
      1. Enterprise architecture frameworks
2. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed algorithms
      1. MapReduce algorithms

Recommendations

A comparative between hadoop mapreduce and apache Spark on HDFS
IML '17: Proceedings of the 1st International Conference on Internet of Things and Machine Learning

Data is growing now in a very high speed with a large volume, Spark and MapReduce¹ both provide a processing model for analyzing and managing this large data -Big Data- stored on HDFS. In this paper, we discuss a comparative between Apache Spark and ...
Performance comparison of Apache Hadoop and Apache Spark
ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing Research

The term 'Big Data' is a broad term used for the data sets, which is enormous and traditional data processing applications find it hard to process. Both Apache Spark and Apache Hadoop are one of the significant parts of the big data family. Some of the ...
Learning Apache Spark 2.0

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

iiWAS '20: Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services

November 2020

492 pages

ISBN:9781450389228

DOI:10.1145/3428757

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

Johannes Kepler University, Linz, Austria

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

iiWAS '20

iiWAS '20: The 22nd International Conference on Information Integration and Web-based Applications & Services

November 30 - December 2, 2020

Chiang Mai, Thailand

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
81
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)2

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Moutafis PMavrommatis GVassilakopoulos MCorral A(2021)Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache SparkISPRS International Journal of Geo-Information10.3390/ijgi1011076310:11(763)Online publication date: 11-Nov-2021
https://doi.org/10.3390/ijgi10110763
Döschl AKeller MMandl P(2021)Performance evaluation of GPU- and cluster-computing for parallelization of compute-intensive tasksInternational Journal of Web Information Systems10.1108/IJWIS-03-2021-0032ahead-of-print:ahead-of-printOnline publication date: 6-Aug-2021
https://doi.org/10.1108/IJWIS-03-2021-0032

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents