Abstract
Hadoop MapReduce processes data on the cluster of commodity hardware (node) in two phases using Map and Reduce tasks. Yet another resource negotiator (YARN), a dynamic resource manager, allocates resources for Map tasks by preserving the data locality. In contrast, it allocates resources to schedule the Reduce tasks on any node in the cluster. The policy’s performance is better in a homogeneous environment, where the nodes’ computing capabilities are identical. However, its performance degrades in a heterogeneous environment when it allocates the containers for scheduling the Reduce tasks on any node that slowdowns the Reduce tasks execution and leads to computational skew. To mitigate the computational skew from the Reduce phase of MapReduce, we proposed the Historical data based Reduce tasks scheduling (HDRTS) technique. The technique has two algorithms: The first algorithm finds node average response time (NART) of each node by interpreting the job history information. The second algorithm allocates the resource on the faster processing node (FPN) to schedule the Reduce tasks. To evaluate the policy’s performance, we have used a very popular benchmark, i.e., the HiBench benchmark suite. Finally, compared with Hadoop’s default policy and several other policies, the proposed HDRTS policy reduces the Reduce tasks execution time for reduce-input-heavy jobs by nearly 25% to 37% significantly. Finally, it mitigates the computational skew and the stragglers from Reduce phase of MapReduce in the heterogeneous environments.
Similar content being viewed by others
Data availability
Not applicable.
Code availability
Not applicable.
References
Arasanal, R.M., Rumani, D.U.: Improving Mapreduce performance through complexity and performance based data placement in heterogeneous Hadoop clusters. In: International Conference on Distributed Computing and Internet Technology, pp. 115–125. Springer (2013)
Bawankule, K.L., Dewang, R.K., Singh, A.K.: Load balancing approach for a Mapreduce job running on a heterogeneous Hadoop cluster. In: International Conference on Distributed Computing and Internet Technology, pp. 289–298. Springer (2021)
Bawankule, K.L., Dewang, R.K., Singh, A.K.: Historical data based approach for straggler avoidance in a heterogeneous Hadoop cluster. J Ambient Intell. Hum. Comput. 23, 1–17 (2021)
Bawankule, K.L., Dewang, R.K., Singh, A.K.: Performance analysis of hadoop YARN job schedulers in a multi-tenant environment on HiBench benchmark suite. Int. J. Distrib. Syst. Technol. 12(3), 64–82 (2021). https://doi.org/10.4018/IJDST.2021070104
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Dhawalia, P., Kailasam, S., Janakiram, D.: Chisel: a resource savvy approach for handling skew in Mapreduce applications. In: 2013 IEEE Sixth International Conference on Cloud Computing, pp. 652–660. IEEE (2013)
Ghazali, R., Adabi, S., Down, D.G., Movaghar, A.: A classification of Hadoop job schedulers based on performance optimization approaches. Clust. Comput. 41, 1–23 (2021)
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The Hibench benchmark suite: Characterization of the Mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51. IEEE (2010)
Irandoost, M.A., Rahmani, A.M., Setayeshi, S.: Mapreduce data skewness handling: a systematic literature review. Int. J. Parall. Program. 47(5–6), 907–950 (2019)
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 75–86 (2010)
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune: mitigating skew in Mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 25–36. ACM (2012)
Lee, C.W., Hsieh, K.Y., Hsieh, S.Y., Hsiao, H.C.: A dynamic data placement strategy for Hadoop in heterogeneous environments. Big Data Res. 1, 14–22 (2014)
Naik, N.S., Negi, A., BR, T.B., Anitha, R.: A data locality based scheduler to enhance Mapreduce performance in heterogeneous environments. Future Gener. Comput. Syst. 90, 423–434 (2019)
Paik, S.S., Goswami, R.S., Roy, D., Reddy, K.H.: Intelligent data placement in heterogeneous Hadoop cluster. In: International Conference on Next Generation Computing Rechnologies, pp. 568–579. Springer (2017)
Pandey, V., Saini, P.: A heuristic method towards deadline-aware energy-efficient Mapreduce scheduling problem in Hadoop yarn. Clust. Comput. 24(2), 683–699 (2021)
Sellami, M., Mezni, H., Hacid, M.S., Gammoudi, M.M.: Clustering-based data placement in cloud computing: a predictive approach. Clust. Comput. 87, 1–26 (2021)
Seneviratne, S., Levy, D.C.: Task profiling model for load profile prediction. Future Gener. Comput. Syst. 27(3), 245–255 (2011)
Shvachko, K., Kuang, H., Radia, S., Chansler, R., et al.: The Hadoop distributed file system. MSST 10, 1–10 (2010)
Ubarhande, V., Popescu, A.M., González-Vélez, H.: Novel data-distribution technique for hadoop in heterogeneous cloud environments. In: 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems, pp. 217–224. IEEE (2015)
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al.: Apache Hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, p. 5. ACM (2013)
Wang, B., Jiang, J., Yang, G.: Actcap: accelerating mapreduce on heterogeneous clusters with capability-aware data placement. In: 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 1328–1336. IEEE (2015)
Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares, A., Qin, X.: Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–9. IEEE (2010)
Ye, X., Huang, M., Zhu, D., Xu, P.: A novel blocks placement strategy for Hadoop. In: 2012 IEEE/ACIS 11th International Conference on Computer and Information Science, pp. 3–7. IEEE (2012)
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving Mapreduce performance in heterogeneous environments. Osdi 8, 7 (2008)
Zhang, X., Wu, Y., Zhao, C.: Mrheter: improving Mapreduce performance in heterogeneous environments. Clust. Comput. 19(4), 1691–1701 (2016)
Funding
The authors would like to thank the Quality Improvement Program of All India Council for Technical Education (AICTE), India, to support the research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bawankule, K.L., Dewang, R.K. & Singh, A.K. Historical data based approach to mitigate stragglers from the Reduce phase of MapReduce in a heterogeneous Hadoop cluster. Cluster Comput 25, 3193–3211 (2022). https://doi.org/10.1007/s10586-021-03530-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-021-03530-x