Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Improvement of job completion time in data-intensive cloud computing applications

Published: 07 February 2020 Publication History

Abstract

Task stragglers in MapReduce jobs dramatically impede job execution of data-intensive computing in cloud data centers. This impedance is due to the uneven distribution of input data, heterogeneous data nodes, resource contention situations, and network configurations. Data skew of intermediate data in MapReduce job causes delay failures due to the violation of job completion time. Data-intensive computing frameworks, such as MapReduce or Hadoop YARN, employ HashPartitioner. This partitioner may cause intermediate data skew, which results in straggler reducers. In this paper, we strive to make Hadoop YARN more efficient in cloud environments. We present, a new partitioning scheme, called balanced data clusters partitioner (BDCP), to handle straggler Reduce tasks based on sampling of input data and feedback information about the current processing task. Our extensive experimental results show that BDCP can outperform the default Hadoop HashPartitioner and Range partitioner. BDCP can assist in straggler mitigation during reduce phase and minimize the job completion time in MapReduce jobs within data-intensive cloud computing.

References

[1]
MapReduce: Official Apache Hadoop Website. http://hadoop.apache.org. Accessed 14 Feb 2019.
[2]
Wu H (2016) Big data management the mass weather logs In: International Conference on Smart Computing and Communication, 122–132. Springer.
[3]
White T (2009) Hadoop, “The Definitive Guide (1’st ed.)”
[4]
Subramanian V, Wang L, Lee E-J, Chen P (2010) Rapid processing of synthetic seismograms using windows azure cloud In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science. IEEE. 10.1109/cloudcom.2010.110.
[5]
Chen Q, Yao J, and Xiao Z Libra: Lightweight data skew mitigation in mapreduce IEEE Trans Parallel Distrib Syst 2015 26 9 2520-2533
[6]
Zhang F, Cao J, Khan SU, Li K, and Hwang K A task-level adaptive mapreduce framework for real-time streaming data in healthcare applications Futur Gener Comput Syst 2015 43 149-160
[7]
MapReduce Job. Word Count. http://spark.apache.org/examples.html. Accessed 27 Apr 2019.
[8]
Lee D, Kim J-S, and Maeng S Large-scale incremental processing with mapreduce Futur Gener Comput Syst 2014 36 66-79
[10]
Kwon Y, Balazinska M, Howe B, and Rolia J Skewtune: mitigating skew in mapreduce applications Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data 2012 Scottsdale ACM 25-36
[11]
Hassan MAH, Bamha M, and Loulergue F Handling data-skew effects in join operations using mapreduce Procedia Comput Sci 2014 29 145-158
[12]
Karapiperis D and Verykios VS Load-balancing the distance computations in record linkage ACM SIGKDD Explor Newsl 2015 17 1 1-7
[13]
Vu L and Alaghband G A load balancing parallel method for frequent pattern mining on multi-core cluster Proceedings of the Symposium on High Performance Computing 2015 Alexandria Society for Computer Simulation International 49-58
[14]
Li Jianjiang, Liu Yajun, Pan Jian, Zhang Peng, Chen Wei, and Wang Lizhe Map-Balance-Reduce: An improved parallel programming model for load balancing of MapReduce Future Generation Computer Systems 2020 105 993-1001
[15]
Xu Y, Zou P, Qu W, Li Z, Li K, Cui X (2012) Sampling-based partitioning in mapreduce for skewed data In: 2012 Seventh ChinaGrid Annual Conference. IEEE. 10.1109/chinagrid.2012.18.
[16]
Tang Z, Zhang X, Li K, and Li K An intermediate data placement algorithm for load balancing in spark computing environment Futur Gener Comput Syst 2018 78 287-301
[17]
Ibrahim IA, Bassiouni M (2017) Improving mapreduce performance with progress and feedback based speculative execution In: 2017 IEEE International Conference on Smart Cloud (SmartCloud). IEEE. 10.1109/smartcloud.2017.25.
[18]
Ananthanarayanan G, Ghodsi A, Shenker S, and Stoica I Effective straggler mitigation: Attack of the clones Presented as Part of the 10th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 13) 2013 Lombard USENIX 185-198
[19]
Zaharia M, Konwinski A, Joseph AD, Katz RH, and Stoica I Improving mapreduce performance in heterogeneous environments Osdi 2008 8 7
[20]
Xie J, Yin S, Ruan X, Ding Z, Tian Y, Majors J, Manzanares A, and Qin X Improving mapreduce performance through data placement in heterogeneous hadoop clusters 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW) 2010 Atlanta IEEE 1-9
[21]
Lin C, Guo W, Lin C (2013) Self-learning mapreduce scheduler in multi-job environment In: 2013 International Conference on Cloud Computing and Big Data, 610–612. IEEE. 10.1109/cloudcom-asia.2013.95.
[22]
Ibrahim IA, Dai W, Bassiouni M (2016) Intelligent data placement mechanism for replicas distribution in cloud storage systems In: 2016 IEEE International Conference on Smart Cloud (SmartCloud). IEEE. 10.1109/smartcloud.2016.23.
[23]
Dai W and Bassiouni M An improved task assignment scheme for hadoop running in the clouds J Cloud Comput Adv Syst Appl 2013 2 1 23
[24]
Dai W, Ibrahim I, Bassiouni M (2016) A new replica placement policy for hadoop distributed file system In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), 262–267. IEEE. 10.1109/bigdatasecurity-hpsc-ids.2016.30.
[25]
Dai W, Ibrahim I, Bassiouni M (2016) Improving load balance for data-intensive computing on cloud platforms In: 2016 IEEE International Conference on Smart Cloud (SmartCloud). IEEE. 10.1109/smartcloud.2016.44.
[26]
Khatami Z, Hong S, Lee J, Depner S, Chafi H, Ramanujam J, Kaiser H (2017) A load-balanced parallel and distributed sorting algorithm implemented with PGX.D In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE. 10.1109/ipdpsw.2017.30.

Cited By

View all
  • (2023)PRISMA Archetype-Based Systematic Literature Review of Security Algorithms in the CloudSecurity and Communication Networks10.1155/2023/92108032023Online publication date: 1-Jan-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Cloud Computing: Advances, Systems and Applications
Journal of Cloud Computing: Advances, Systems and Applications  Volume 9, Issue 1
Dec 2020
883 pages
ISSN:2192-113X
EISSN:2192-113X
Issue’s Table of Contents

Publisher

Hindawi Limited

London, United Kingdom

Publication History

Published: 07 February 2020
Accepted: 26 September 2019
Received: 21 May 2019

Author Tags

  1. Cloud computing
  2. MapReduce
  3. Data-intensive computing
  4. Parallel and distributed processing
  5. Straggler reduce task
  6. Sampling

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)PRISMA Archetype-Based Systematic Literature Review of Security Algorithms in the CloudSecurity and Communication Networks10.1155/2023/92108032023Online publication date: 1-Jan-2023

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media