Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1924943.1924962acmotherconferencesArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
Article

Reining in the outliers in map-reduce clusters using Mantri

Published: 04 October 2010 Publication History
  • Get Citation Alerts
  • Abstract

    Experience froman operational Map-Reduce cluster reveals that outliers significantly prolong job completion. The causes for outliers include run-time contention for processor, memory and other resources, disk failures, varying bandwidth and congestion along network paths and, imbalance in task workload. We present Mantri, a system that monitors tasks and culls outliers using cause- and resource-aware techniques. Mantri's strategies include restarting outliers, network-aware placement of tasks and protecting outputs of valuable tasks. Using real-time progress reports, Mantri detects and acts on outliers early in their lifetime. Early action frees up resources that can be used by subsequent tasks and expedites the job overall. Acting based on the causes and the resource and opportunity cost of actions lets Mantri improve over prior work that only duplicates the laggards. Deployment in Bing's production clusters and trace-driven simulations show that Mantri improves job completion times by 32%.

    References

    [1]
    Hadoop distributed filesystem. http://hadoop.apache.org.
    [2]
    A. Faraj, X. Yuan, D. Lowenthal. STAR-MPI: Self Tuned Adaptive Routines for MPI Collective Operations. In SC, 2006.
    [3]
    A. Greenberg, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, 2009.
    [4]
    I. Ahmad and M. K. Dhodhi. Semi-distributed load balancing for massively parallel multicomputer systems. In IEEE TSE, 1991.
    [5]
    G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, and Y. Lu. Reigning in the outliers inmap-reduce clusters. Technical Report MSR-TR-2010-69, Microsoft Research, 2010.
    [6]
    B. Ucar, C. Aykanat, K. Kaya, M. Ikinci. Task assignment in Heterogeneous Computing Systems. In JPDC, 2006.
    [7]
    L. N. Bairavasundaram, G. R. Goodson, B. Schroeder, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. An analysis of data corruption in the storage stack. In FAST, 2008.
    [8]
    R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Datasets. In VLDB, 2008.
    [9]
    T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmleegy, and R. Sears. Mapreduce online. In NSDI, 2010.
    [10]
    D. Culler et al. LogP: Towards a Realistic Model of Parallel Computation. In SIGPLAN PPoPP, 1993.
    [11]
    J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004.
    [12]
    R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2), 1969.
    [13]
    M. Isard et al. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Eurosys, 2007.
    [14]
    S. Kandula, D. Katabi, B. Davie, and A. Charny. Walking the Tightrope: Responsive Yet Stable Traffic Engineering. In SIGCOMM, 2005.
    [15]
    S. Ko, I. Hoque, B. Cho, and I. Gupta. Making cloud intermediate data fault-tolerant. In SOCC, 2010.
    [16]
    A. Krishnamurthy and K. Yelick. Analysis and optimizations for shared address space programs. JPDC, 1996.
    [17]
    M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity Data Center Network Architecture. In SIGCOMM, 2008.
    [18]
    M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, A. Goldberg. Quincy: Fair scheduling for distributed computing clusters. In SOSP, 2009.
    [19]
    M. Lauria and A. Chien. MPI-FM: High Performance MPI on Workstation Clusters. In JPDC, 1997.
    [20]
    M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In OSDI, 2008.
    [21]
    P. Patarasuk, A. Faraj, X. Yuan. Pipelined Broadcast on Ethernet Switched Clusters. In IEEE IPDPS, 2006.
    [22]
    A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. R. Madden, and M. Stonebraker. A comparison of approaches to large scale data analysis. In SIGMOD, 2009.
    [23]
    S. Kandula, S. Sengupta, A. Greenberg, P. Patel, R. Chaiken. Nature of Datacenter Traffic: Measurements and Analysis. In IMC, 2009.
    [24]
    S. Manoharan. Effect of task duplication on assignment of dependency graphs. In Parallel Comput., 2001.
    [25]
    T. Sandholm and K. Lai. Mapreduce optimization using regulated dynamic prioritization. In SIGMETRICS, 2009.
    [26]
    Y. Kwon, M. Balazinska, B. Howe, J. Rolia. Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions. In SOCC, 2010.
    [27]
    Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, J. Currey. DryadLINQ: A System for General-Purpose Data-Parallel Computing Using a High-Level Language. In OSDI, 2008.
    [28]
    Y. Yu, P. K. Gunda, and M. Isard. Distributed Aggregation for Data-Parallel Computing: Interfaces, Impl. In SOSP, 2009.

    Cited By

    View all
    • (2022)Parallelism-Optimizing Data Placement for Faster Data-Parallel ComputationsProceedings of the VLDB Endowment10.14778/3574245.357426016:4(760-771)Online publication date: 1-Dec-2022
    • (2021)Collaborative Learning Based Straggler Prevention in Large-Scale Distributed Computing FrameworkSecurity and Communication Networks10.1155/2021/83409252021Online publication date: 1-Jan-2021
    • (2021)CERES: Container-Based Elastic Resource Management System for Mixed WorkloadsProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472459(1-10)Online publication date: 9-Aug-2021
    • Show More Cited By
    1. Reining in the outliers in map-reduce clusters using Mantri

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        OSDI'10: Proceedings of the 9th USENIX conference on Operating systems design and implementation
        October 2010
        386 pages

        Sponsors

        • NSF: National Science Foundation
        • Google Inc.
        • Infosys
        • Microsoft Research: Microsoft Research
        • USENIX Assoc: USENIX Assoc

        In-Cooperation

        Publisher

        USENIX Association

        United States

        Publication History

        Published: 04 October 2010

        Check for updates

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 27 Jul 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2022)Parallelism-Optimizing Data Placement for Faster Data-Parallel ComputationsProceedings of the VLDB Endowment10.14778/3574245.357426016:4(760-771)Online publication date: 1-Dec-2022
        • (2021)Collaborative Learning Based Straggler Prevention in Large-Scale Distributed Computing FrameworkSecurity and Communication Networks10.1155/2021/83409252021Online publication date: 1-Jan-2021
        • (2021)CERES: Container-Based Elastic Resource Management System for Mixed WorkloadsProceedings of the 50th International Conference on Parallel Processing10.1145/3472456.3472459(1-10)Online publication date: 9-Aug-2021
        • (2021)Improving the Performance of Heterogeneous Data Centers through RedundancyProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/34283334:3(1-29)Online publication date: 15-Jun-2021
        • (2020)Heterogeneous MacroTasking (HeMT) for Parallel Processing in the CloudProceedings of the 2020 6th International Workshop on Container Technologies and Container Clouds10.1145/3429885.3429962(7-12)Online publication date: 7-Dec-2020
        • (2020)PrimulaProceedings of the 21st International Middleware Conference Industrial Track10.1145/3429357.3430522(31-37)Online publication date: 7-Dec-2020
        • (2020)TRACKProceedings of the 13th EAI International Conference on Performance Evaluation Methodologies and Tools10.1145/3388831.3388860(188-191)Online publication date: 18-May-2020
        • (2020)Modeling of Request Cloning in Cloud Server Systems using Processor SharingProceedings of the ACM/SPEC International Conference on Performance Engineering10.1145/3358960.3379128(24-35)Online publication date: 20-Apr-2020
        • (2020)Spur: Mitigating Slow Instances in Large-Scale Streaming PipelinesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3386142(2271-2285)Online publication date: 11-Jun-2020
        • (2019)Robust and communication-efficient collaborative learningProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455040(8388-8399)Online publication date: 8-Dec-2019
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media