Historical data based approach to mitigate stragglers from the Reduce phase of MapReduce in a heterogeneous Hadoop cluster

Bawankule, Kamalakant Laxman; Dewang, Rupesh Kumar; Singh, Anil Kumar

doi:10.1007/s10586-021-03530-x

Historical data based approach to mitigate stragglers from the Reduce phase of MapReduce in a heterogeneous Hadoop cluster

Published: 01 February 2022

Volume 25, pages 3193–3211, (2022)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Kamalakant Laxman Bawankule ORCID: orcid.org/0000-0003-2486-4949¹,
Rupesh Kumar Dewang¹ &
Anil Kumar Singh¹

418 Accesses
7 Citations
Explore all metrics

Abstract

Hadoop MapReduce processes data on the cluster of commodity hardware (node) in two phases using Map and Reduce tasks. Yet another resource negotiator (YARN), a dynamic resource manager, allocates resources for Map tasks by preserving the data locality. In contrast, it allocates resources to schedule the Reduce tasks on any node in the cluster. The policy’s performance is better in a homogeneous environment, where the nodes’ computing capabilities are identical. However, its performance degrades in a heterogeneous environment when it allocates the containers for scheduling the Reduce tasks on any node that slowdowns the Reduce tasks execution and leads to computational skew. To mitigate the computational skew from the Reduce phase of MapReduce, we proposed the Historical data based Reduce tasks scheduling (HDRTS) technique. The technique has two algorithms: The first algorithm finds node average response time (NART) of each node by interpreting the job history information. The second algorithm allocates the resource on the faster processing node (FPN) to schedule the Reduce tasks. To evaluate the policy’s performance, we have used a very popular benchmark, i.e., the HiBench benchmark suite. Finally, compared with Hadoop’s default policy and several other policies, the proposed HDRTS policy reduces the Reduce tasks execution time for reduce-input-heavy jobs by nearly 25% to 37% significantly. Finally, it mitigates the computational skew and the stragglers from Reduce phase of MapReduce in the heterogeneous environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Historical data based approach for straggler avoidance in a heterogeneous Hadoop cluster

Article 08 February 2021

Enhancing the Performance of MapReduce Default Scheduler by Detecting Prolonged TaskTrackers in Heterogeneous Environments

TMaR: a two-stage MapReduce scheduler for heterogeneous environments

Article Open access 07 October 2020

Data availability

Not applicable.

Code availability

Not applicable.

References

Arasanal, R.M., Rumani, D.U.: Improving Mapreduce performance through complexity and performance based data placement in heterogeneous Hadoop clusters. In: International Conference on Distributed Computing and Internet Technology, pp. 115–125. Springer (2013)
Bawankule, K.L., Dewang, R.K., Singh, A.K.: Load balancing approach for a Mapreduce job running on a heterogeneous Hadoop cluster. In: International Conference on Distributed Computing and Internet Technology, pp. 289–298. Springer (2021)
Bawankule, K.L., Dewang, R.K., Singh, A.K.: Historical data based approach for straggler avoidance in a heterogeneous Hadoop cluster. J Ambient Intell. Hum. Comput. 23, 1–17 (2021)
Google Scholar
Bawankule, K.L., Dewang, R.K., Singh, A.K.: Performance analysis of hadoop YARN job schedulers in a multi-tenant environment on HiBench benchmark suite. Int. J. Distrib. Syst. Technol. 12(3), 64–82 (2021). https://doi.org/10.4018/IJDST.2021070104
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Dhawalia, P., Kailasam, S., Janakiram, D.: Chisel: a resource savvy approach for handling skew in Mapreduce applications. In: 2013 IEEE Sixth International Conference on Cloud Computing, pp. 652–660. IEEE (2013)
Ghazali, R., Adabi, S., Down, D.G., Movaghar, A.: A classification of Hadoop job schedulers based on performance optimization approaches. Clust. Comput. 41, 1–23 (2021)
Google Scholar
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The Hibench benchmark suite: Characterization of the Mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51. IEEE (2010)
Irandoost, M.A., Rahmani, A.M., Setayeshi, S.: Mapreduce data skewness handling: a systematic literature review. Int. J. Parall. Program. 47(5–6), 907–950 (2019)
Article Google Scholar
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 75–86 (2010)
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune: mitigating skew in Mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 25–36. ACM (2012)
Lee, C.W., Hsieh, K.Y., Hsieh, S.Y., Hsiao, H.C.: A dynamic data placement strategy for Hadoop in heterogeneous environments. Big Data Res. 1, 14–22 (2014)
Article Google Scholar
Naik, N.S., Negi, A., BR, T.B., Anitha, R.: A data locality based scheduler to enhance Mapreduce performance in heterogeneous environments. Future Gener. Comput. Syst. 90, 423–434 (2019)
Article Google Scholar
Paik, S.S., Goswami, R.S., Roy, D., Reddy, K.H.: Intelligent data placement in heterogeneous Hadoop cluster. In: International Conference on Next Generation Computing Rechnologies, pp. 568–579. Springer (2017)
Pandey, V., Saini, P.: A heuristic method towards deadline-aware energy-efficient Mapreduce scheduling problem in Hadoop yarn. Clust. Comput. 24(2), 683–699 (2021)
Article Google Scholar
Sellami, M., Mezni, H., Hacid, M.S., Gammoudi, M.M.: Clustering-based data placement in cloud computing: a predictive approach. Clust. Comput. 87, 1–26 (2021)
Google Scholar
Seneviratne, S., Levy, D.C.: Task profiling model for load profile prediction. Future Gener. Comput. Syst. 27(3), 245–255 (2011)
Article Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R., et al.: The Hadoop distributed file system. MSST 10, 1–10 (2010)
Google Scholar
Ubarhande, V., Popescu, A.M., González-Vélez, H.: Novel data-distribution technique for hadoop in heterogeneous cloud environments. In: 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems, pp. 217–224. IEEE (2015)
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache Hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, p. 5. ACM (2013)
Wang, B., Jiang, J., Yang, G.: Actcap: accelerating mapreduce on heterogeneous clusters with capability-aware data placement. In: 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 1328–1336. IEEE (2015)
Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares, A., Qin, X.: Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–9. IEEE (2010)
Ye, X., Huang, M., Zhu, D., Xu, P.: A novel blocks placement strategy for Hadoop. In: 2012 IEEE/ACIS 11th International Conference on Computer and Information Science, pp. 3–7. IEEE (2012)
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving Mapreduce performance in heterogeneous environments. Osdi 8, 7 (2008)
Google Scholar
Zhang, X., Wu, Y., Zhao, C.: Mrheter: improving Mapreduce performance in heterogeneous environments. Clust. Comput. 19(4), 1691–1701 (2016)
Article Google Scholar

Download references

Funding

The authors would like to thank the Quality Improvement Program of All India Council for Technical Education (AICTE), India, to support the research.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Motilal Nehru National Institute of Technology Allahabad, Pryagraj, Uttar Pradesh, India
Kamalakant Laxman Bawankule, Rupesh Kumar Dewang & Anil Kumar Singh

Authors

Kamalakant Laxman Bawankule
View author publications
You can also search for this author in PubMed Google Scholar
Rupesh Kumar Dewang
View author publications
You can also search for this author in PubMed Google Scholar
Anil Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kamalakant Laxman Bawankule.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bawankule, K.L., Dewang, R.K. & Singh, A.K. Historical data based approach to mitigate stragglers from the Reduce phase of MapReduce in a heterogeneous Hadoop cluster. Cluster Comput 25, 3193–3211 (2022). https://doi.org/10.1007/s10586-021-03530-x

Download citation

Received: 10 February 2021
Revised: 12 November 2021
Accepted: 27 December 2021
Published: 01 February 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s10586-021-03530-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Historical data based approach to mitigate stragglers from the Reduce phase of MapReduce in a heterogeneous Hadoop cluster

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Historical data based approach for straggler avoidance in a heterogeneous Hadoop cluster

Enhancing the Performance of MapReduce Default Scheduler by Detecting Prolonged TaskTrackers in Heterogeneous Environments

TMaR: a two-stage MapReduce scheduler for heterogeneous environments

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Historical data based approach to mitigate stragglers from the Reduce phase of MapReduce in a heterogeneous Hadoop cluster

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Historical data based approach for straggler avoidance in a heterogeneous Hadoop cluster

Enhancing the Performance of MapReduce Default Scheduler by Detecting Prolonged TaskTrackers in Heterogeneous Environments

TMaR: a two-stage MapReduce scheduler for heterogeneous environments

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation