research-article

Improvement of job completion time in data-intensive cloud computing applications

Authors:

Ibrahim Adel Ibrahim,

Mostafa BassiouniAuthors Info & Claims

Journal of Cloud Computing, Volume 9, Issue 1

https://doi.org/10.1186/s13677-019-0139-6

Published: 07 February 2020 Publication History

Abstract

Task stragglers in MapReduce jobs dramatically impede job execution of data-intensive computing in cloud data centers. This impedance is due to the uneven distribution of input data, heterogeneous data nodes, resource contention situations, and network configurations. Data skew of intermediate data in MapReduce job causes delay failures due to the violation of job completion time. Data-intensive computing frameworks, such as MapReduce or Hadoop YARN, employ HashPartitioner. This partitioner may cause intermediate data skew, which results in straggler reducers. In this paper, we strive to make Hadoop YARN more efficient in cloud environments. We present, a new partitioning scheme, called balanced data clusters partitioner (BDCP), to handle straggler Reduce tasks based on sampling of input data and feedback information about the current processing task. Our extensive experimental results show that BDCP can outperform the default Hadoop HashPartitioner and Range partitioner. BDCP can assist in straggler mitigation during reduce phase and minimize the job completion time in MapReduce jobs within data-intensive cloud computing.

References

[1]

MapReduce: Official Apache Hadoop Website. http://hadoop.apache.org. Accessed 14 Feb 2019.

[2]

Wu H (2016) Big data management the mass weather logs In: International Conference on Smart Computing and Communication, 122–132. Springer.

[3]

White T (2009) Hadoop, “The Definitive Guide (1’st ed.)”

[4]

Subramanian V, Wang L, Lee E-J, Chen P (2010) Rapid processing of synthetic seismograms using windows azure cloud In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science. IEEE. 10.1109/cloudcom.2010.110.

[5]

Chen Q, Yao J, and Xiao Z Libra: Lightweight data skew mitigation in mapreduce IEEE Trans Parallel Distrib Syst 2015 26 9 2520-2533

[6]

Zhang F, Cao J, Khan SU, Li K, and Hwang K A task-level adaptive mapreduce framework for real-time streaming data in healthcare applications Futur Gener Comput Syst 2015 43 149-160

[7]

MapReduce Job. Word Count. http://spark.apache.org/examples.html. Accessed 27 Apr 2019.

[8]

Lee D, Kim J-S, and Maeng S Large-scale incremental processing with mapreduce Futur Gener Comput Syst 2014 36 66-79

[9]

Range Partitioner, [EB/OL]. http://spark.apache.org/docs/1.3.0/api/java/org/apache/spark/RangePartitioner.html. Accessed 11 Apr 2019.

[10]

Kwon Y, Balazinska M, Howe B, and Rolia J Skewtune: mitigating skew in mapreduce applications Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data 2012 Scottsdale ACM 25-36

[11]

Hassan MAH, Bamha M, and Loulergue F Handling data-skew effects in join operations using mapreduce Procedia Comput Sci 2014 29 145-158

[12]

Karapiperis D and Verykios VS Load-balancing the distance computations in record linkage ACM SIGKDD Explor Newsl 2015 17 1 1-7

[13]

Vu L and Alaghband G A load balancing parallel method for frequent pattern mining on multi-core cluster Proceedings of the Symposium on High Performance Computing 2015 Alexandria Society for Computer Simulation International 49-58

[14]

Li Jianjiang, Liu Yajun, Pan Jian, Zhang Peng, Chen Wei, and Wang Lizhe Map-Balance-Reduce: An improved parallel programming model for load balancing of MapReduce Future Generation Computer Systems 2020 105 993-1001

[15]

Xu Y, Zou P, Qu W, Li Z, Li K, Cui X (2012) Sampling-based partitioning in mapreduce for skewed data In: 2012 Seventh ChinaGrid Annual Conference. IEEE. 10.1109/chinagrid.2012.18.

[16]

Tang Z, Zhang X, Li K, and Li K An intermediate data placement algorithm for load balancing in spark computing environment Futur Gener Comput Syst 2018 78 287-301

[17]

Ibrahim IA, Bassiouni M (2017) Improving mapreduce performance with progress and feedback based speculative execution In: 2017 IEEE International Conference on Smart Cloud (SmartCloud). IEEE. 10.1109/smartcloud.2017.25.

[18]

Ananthanarayanan G, Ghodsi A, Shenker S, and Stoica I Effective straggler mitigation: Attack of the clones Presented as Part of the 10th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 13) 2013 Lombard USENIX 185-198

[19]

Zaharia M, Konwinski A, Joseph AD, Katz RH, and Stoica I Improving mapreduce performance in heterogeneous environments Osdi 2008 8 7

[20]

Xie J, Yin S, Ruan X, Ding Z, Tian Y, Majors J, Manzanares A, and Qin X Improving mapreduce performance through data placement in heterogeneous hadoop clusters 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW) 2010 Atlanta IEEE 1-9

[21]

Lin C, Guo W, Lin C (2013) Self-learning mapreduce scheduler in multi-job environment In: 2013 International Conference on Cloud Computing and Big Data, 610–612. IEEE. 10.1109/cloudcom-asia.2013.95.

[22]

Ibrahim IA, Dai W, Bassiouni M (2016) Intelligent data placement mechanism for replicas distribution in cloud storage systems In: 2016 IEEE International Conference on Smart Cloud (SmartCloud). IEEE. 10.1109/smartcloud.2016.23.

[23]

Dai W and Bassiouni M An improved task assignment scheme for hadoop running in the clouds J Cloud Comput Adv Syst Appl 2013 2 1 23

[24]

Dai W, Ibrahim I, Bassiouni M (2016) A new replica placement policy for hadoop distributed file system In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), 262–267. IEEE. 10.1109/bigdatasecurity-hpsc-ids.2016.30.

[25]

Dai W, Ibrahim I, Bassiouni M (2016) Improving load balance for data-intensive computing on cloud platforms In: 2016 IEEE International Conference on Smart Cloud (SmartCloud). IEEE. 10.1109/smartcloud.2016.44.

[26]

Khatami Z, Hong S, Lee J, Depner S, Chafi H, Ramanujam J, Kaiser H (2017) A load-balanced parallel and distributed sorting algorithm implemented with PGX.D In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE. 10.1109/ipdpsw.2017.30.

Cited By

Kwao Dawson JTwum FHayfron Acquah JMissah Y(2023)PRISMA Archetype-Based Systematic Literature Review of Security Algorithms in the CloudSecurity and Communication Networks10.1155/2023/92108032023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/9210803

Recommendations

Parallel data intensive applications using MapReduce: a data mining case study in biomedical sciences

Performance is an open issue in data intensive applications (e.g. data mining tasks). Parallel and distributed computing systems (e.g. multicore computing, grid computing, cloud computing,etc.), along with hybrid programming models (e.g. MapReduce, MPI, ...
G-Hadoop: MapReduce across distributed data centers for data-intensive computing

Recently, the computational requirements for large-scale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge ...
Multi-Tier Resource Allocation for Data-Intensive Computing

As distributed computing systems are used more widely, driven by trends such as 'big data' and cloud computing, they are being used for an increasingly wide range of applications. With this massive increase in application heterogeneity, the ability to ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Cloud Computing: Advances, Systems and Applications

Journal of Cloud Computing: Advances, Systems and Applications Volume 9, Issue 1

Dec 2020

883 pages

ISSN:2192-113X

EISSN:2192-113X

Issue’s Table of Contents

© The Author(s) 2020.

Publisher

Hindawi Limited

London, United Kingdom

Publication History

Published: 07 February 2020

Accepted: 26 September 2019

Received: 21 May 2019

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kwao Dawson JTwum FHayfron Acquah JMissah Y(2023)PRISMA Archetype-Based Systematic Literature Review of Security Algorithms in the CloudSecurity and Communication Networks10.1155/2023/92108032023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/9210803

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents