Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A New Framework for Evaluating Straggler Detection Mechanisms in MapReduce

Published: 12 September 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Big Data systems (e.g., Google MapReduce, Apache Hadoop, Apache Spark) rely increasingly on speculative execution to mask slow tasks, also known as stragglers, because a job’s execution time is dominated by the slowest task instance. Big Data systems typically identify stragglers and speculatively run copies of those tasks with the expectation that a copy may complete faster to shorten job execution times. There is a rich body of recent results on straggler mitigation in MapReduce. However, the majority of these do not consider the problem of accurately detecting stragglers. Instead, they adopt a particular straggler detection approach and then study its effectiveness in terms of performance, e.g., reduction in job completion time or higher efficiency, e.g., high resource utilization. In this article, we consider a complete framework for straggler detection and mitigation. We start with a set of metrics that can be used to characterize and detect stragglers including Precision, Recall, Detection Latency, Undetected Time, and Fake Positive. We then develop an architectural model by which these metrics can be linked to measures of performance including execution time and system energy overheads. We further conduct a series of experiments to demonstrate which metrics and approaches are more effective in detecting stragglers and are also predictive of effectiveness in terms of performance and energy efficiencies. For example, our results indicate that the default Hadoop straggler detector could be made more effective. In a certain case, Precision is low and only 55% of those detected are actual stragglers and the Recall, i.e., percent of actual detected stragglers, is also relatively low at 56%. For the same case, the hierarchical approach (i.e., a green-driven detector based on the default one) achieves a Precision of 99% and a Recall of 29%. This increase in Precision can be translated to achieve lower execution time and energy consumption, and thus higher performance and energy efficiency; compared to the default Hadoop mechanism, the energy consumption is reduced by almost 31%. These results demonstrate how our framework can offer useful insights and be applied in practical settings to characterize and design new straggler detection mechanisms for MapReduce systems.

    References

    [1]
    Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective straggler mitigation: Attack of the clones. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI’13). 185--198.
    [2]
    Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the outliers in MapReduce clusters using Mantri. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). 1--16.
    [3]
    Guillaume Aupy, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. 2014. Checkpointing algorithms and fault prediction. J. Parallel Distrib. Comput. 74, 2 (2014), 2048--2064.
    [4]
    Daniel Balouek, Alexandra Carpen Amarie, Ghislain Charrier, Frédéric Desprez, Emmanuel Jeannot, Emmanuel Jeanvoine, Adrien Lèbre, David Margery, Nicolas Niclausse, Lucas Nussbaum, Olivier Richard, Christian Pérez, Flavien Quesnel, Cyril Rohr, and Luc Sarzyniec. 2013. Adding virtualization capabilities to the Grid’5000 testbed. In Cloud Computing and Services Science. Springer International Publishing.
    [5]
    Qi Chen, Cheng Liu, and Zhen Xiao. 2014. Improving MapReduce performance using smart speculative execution strategy. IEEE Trans. Comput. 63, 4 (2014), 29--42.
    [6]
    Jeffrey Dean. 2009. Large-scale distributed systems at Google: Current systems and future directions. In Proceedings of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS’09).
    [7]
    Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
    [8]
    Ana Gainaru, Franck Cappello, and William Kramer. 2012. Taming of the shrew: Modeling the normal and faulty behaviour of large-scale HPC systems. In Proceedings of IEEE 26th International Parallel Distributed Processing Symposium (IPDPS’12). 1168--1179.
    [9]
    HDFS. 2016. The Hadoop Distributed File System. Retrieved from https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html.
    [10]
    Shadi Ibrahim, Hai Jin, Lu Lu, Bingsheng He, Gabirel Antoniu, and Song Wu. 2012. Maestro: Replica-aware map scheduling for MapReduce. In Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid’12). 59--72.
    [11]
    Shadi Ibrahim, Hai Jin, Lu Lu, Song Wu, Bingsheng He, and Li Qi. 2010. LEEN: Locality/fairness-aware key partitioning for MapReduce in the cloud. In Proceedings of the IEEE International Conference on Cloud Computing Technology and Science (CloudCom’10). 17--24.
    [12]
    Hai Jin, Shadi Ibrahim, Li Qi, Haijun Cao, Song Wu, and Xuanhua Shi. 2011. The MapReduce programming model and implementations. In Cloud Comput.: Principles Paradigms. 373--390.
    [13]
    Tien-Dat Phan, Shadi Ibrahim, Gabriel Antoniu, and Luc Bougé. 2015. On understanding the energy impact of speculative execution in Hadoop. In Proceedings of the IEEE International Conference on Data Science and Data Intensive Systems. 396--403.
    [14]
    Tien-Dat Phan, Shadi Ibrahim, Amelie Chi Zhou, Guillaume Aupy, and Gabriel Antoniu. 2017. Energy-driven straggler mitigation in MapReduce. In Proceedings of the 23rd International European Conference on Parallel and Distributed Computing (Euro-Par’17). 385--398.
    [15]
    The Apache Hadoop Project. 2018. Retrieved from http://hadoop.apache.org.
    [16]
    Asfandyar Qureshi. 2010. Power-demand routing in massive geo-distributed systems. Ph.D. dissertation, MIT.
    [17]
    Kai Ren, YongChul Kwon, Magdalena Balazinska, and Bill Howe. 2013. Hadoop’s adolescence: An analysis of Hadoop usage in scientific workloads. Proc. VLDB Endow. 6, 10 (2013), 853--864.
    [18]
    M. Thottethodi, F. Ahmad, S. Lee, and TN Vijaykumar. 2012. Puma: Purdue MapReduce benchmarks suite. Technical Report, Purdue University.
    [19]
    MapReduce Tutorial. 2016. Retrieved from https://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html.
    [20]
    Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC’13). 5:1--5:16.
    [21]
    Huicheng Wu, Kenli Li, Zhuo Tang, and Longxin Zhang. 2014. A heuristic speculative execution strategy in heterogeneous distributed environments. In Proceedings of the 6th International Symposium on Parallel Architectures, Algorithms, and Programming. 268--273.
    [22]
    Huanle Xu and Wing Cheong Lau. 2013. Resource optimization for speculative execution in a MapReduce Cluster. In Proceedings of the 21st IEEE International Conference on Network Protocols (ICNP’13). 1--3.
    [23]
    Huanle Xu and Wing Cheong Lau. 2014. Speculative execution for a single job in a MapReduce-like system. In Proceedings of the 7th IEEE International Conference on Cloud Computing. 586--593.
    [24]
    Huanle Xu and Wing Cheongu Lau. 2015. Task-cloning algorithms in a MapReduce cluster with competitive performance bounds. In Proceedings of the 35th IEEE International Conference on Distributed Computing Systems (ICDCS’15). 339--348.
    [25]
    Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10). 10--10.
    [26]
    Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce performance in heterogeneous environments. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). 29--42.
    [27]
    Amelie Chi Zhou, Tien-Dat Phan, Shadi Ibrahim, and Bingsheng He. 2018. Energy-efficient speculative execution using advanced reservation for heterogeneous clusters. In Proceedings of the 47th International Conference on Parallel Processing (ICPP’18). 8:1--8:10.

    Cited By

    View all
    • (2022)FLeet: Online Federated Learning via Staleness Awareness and Performance PredictionACM Transactions on Intelligent Systems and Technology10.1145/352762113:5(1-30)Online publication date: 22-Apr-2022
    • (2022)Stragglers' Detection in Big Data Analytic Systems: The Impact of Heartbeat Arrival2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00084(747-751)Online publication date: May-2022
    • (2022)An Optimized Straggler Mitigation Framework for Large-Scale Distributed Computing SystemsIEEE Access10.1109/ACCESS.2022.320572310(97075-97088)Online publication date: 2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Modeling and Performance Evaluation of Computing Systems
    ACM Transactions on Modeling and Performance Evaluation of Computing Systems  Volume 4, Issue 3
    September 2019
    151 pages
    ISSN:2376-3639
    EISSN:2376-3647
    DOI:10.1145/3343140
    • Editors:
    • Sem Borst,
    • Carey Williamson
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 September 2019
    Accepted: 01 April 2019
    Revised: 01 November 2018
    Received: 01 July 2017
    Published in TOMPECS Volume 4, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Hadoop
    2. MapReduce
    3. Modelisation
    4. energy efficiency
    5. performance evaluation
    6. speculation
    7. stragglers

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)2

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)FLeet: Online Federated Learning via Staleness Awareness and Performance PredictionACM Transactions on Intelligent Systems and Technology10.1145/352762113:5(1-30)Online publication date: 22-Apr-2022
    • (2022)Stragglers' Detection in Big Data Analytic Systems: The Impact of Heartbeat Arrival2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00084(747-751)Online publication date: May-2022
    • (2022)An Optimized Straggler Mitigation Framework for Large-Scale Distributed Computing SystemsIEEE Access10.1109/ACCESS.2022.320572310(97075-97088)Online publication date: 2022
    • (2022)Impact of Resource Millibottlenecks on Large-Scale Time Fluctuations in Spark SQL2022 6th Asian Conference on Artificial Intelligence Technology (ACAIT)10.1109/ACAIT56212.2022.10137814(1-6)Online publication date: 9-Dec-2022
    • (2022)A Straggler Identification Model for Large-Scale Distributed Computing Systems Using Machine LearningProceedings of the 8th International Conference on Advanced Intelligent Systems and Informatics 202210.1007/978-3-031-20601-6_10(123-132)Online publication date: 18-Nov-2022
    • (2021)Latency-aware Straggler Mitigation Strategy in Hadoop MapReduce Framework: A ReviewSystematic Literature Review and Meta-Analysis Journal10.54480/slrm.v2i2.192:2(53-60)Online publication date: 19-Oct-2021
    • (2021)A Speculative Execution Framework for Big Data Processing Systems2021 International Conference on Information Technology (ICIT)10.1109/ICIT52682.2021.9491697(616-621)Online publication date: 14-Jul-2021
    • (2020)FLeetProceedings of the 21st International Middleware Conference10.1145/3423211.3425685(163-177)Online publication date: 7-Dec-2020
    • (2020)RETRACTED ARTICLE: Detecting straggler MapReduce tasks in big data processing infrastructure by neural networkThe Journal of Supercomputing10.1007/s11227-019-03136-676:9(6969-6993)Online publication date: 1-Sep-2020
    • (2019)PRISMProceedings of the 5th International Workshop on Container Technologies and Container Clouds10.1145/3366615.3368353(13-18)Online publication date: 9-Dec-2019

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media