Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3302424.3303973acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Public Access

Managing Tail Latency in Datacenter-Scale File Systems Under Production Constraints

Published: 25 March 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Distributed file systems often exhibit high tail latencies, especially in large-scale datacenters and in the presence of competing (and possibly higher priority) workloads. This paper introduces techniques for managing tail latencies in these systems, while addressing the practical challenges inherent in production datacenters (e.g., hardware heterogeneity, interference from other workloads, the need to maximize simplicity and maintainability). We implement our techniques in a scalable distributed file system (an extension of HDFS) used in production at Microsoft. Our evaluation uses 70k servers in 3 datacenters, and shows that our techniques reduce tail latency significantly for production workloads.

    References

    [1]
    HDFS Architecture Guide, 2008. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html.
    [2]
    MongoDB Managed Chain Replication, 2008. https://docs.mongodb.com/manual/tutorial/manage-chained-replication/.
    [3]
    Apache HTrace: A tracing framework for use with distributed systems, 2017. http://htrace.incubator.apache.org/.
    [4]
    DistCp Guide, 2017. http://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html.
    [5]
    TeraGen, 2017. https://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/teragen/package-summary.html.
    [6]
    Track time to process packet in Datanode, 2017. https://issues.apache.org/jira/browse/HDFS-13053.
    [7]
    Expectations of a Hadoop-compatible file system, 2018. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html.
    [8]
    Track speed in DFSClient, 2018. https://issues.apache.org/jira/browse/HDFS-12861.
    [9]
    Amvrosiadis, G., Park, J. W., Ganger, G. R., Gibson, G. A., Baseman, E., And DeBardeleben, N. On the diversity of cluster workloads and its impact on research results. In USENIX ATC (2018).
    [10]
    Ananthanarayanan, G., Ghodsi, A., Shenker, S., and Stoica, I. Effective Straggler Mitigation: Attack of the Clones. In NSDI (2013).
    [11]
    Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., and Harris, E. Reining in the Outliers in Mapreduce Clusters Using Mantri. In OSDI (2010).
    [12]
    Andersen, D. G., Franklin, J., Kaminsky, M., Phan-Ishayee, A., Tan, L., and Vasudevan, V. Fawn: A fast array of wimpy nodes. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (New York, NY, USA, 2009), SOSP '09, ACM, pp. 1--14.
    [13]
    Apache Software Foundation. Apache Spark.
    [14]
    Balakrishnan, M., Malkhi, D., Prabhakaran, V., Wobbler, T., Wei, M., and Davis, J. D. CORFU: A shared log design for flash clusters. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12) (San Jose, CA, 2012), USENIX, pp. 1--14.
    [15]
    Calder, B., Wang, J., Ogus, A., Nilakantan, N., Skjolsvold, A., Mckelvie, S., Xu, Y., Srivastav, S., Wu, J., Simitci, H., Haridas, J., Uddaraju, C., Khatri, H., Edwards, A., Bedekar, V., Mainali, S., Abbasi, R., Agarwal, A., Haq, M. F. U., Haq, M. I. U., Bhardwaj, D., Dayanand, S., Adusumilli, A., Mcnett, M., Sankaran, S., Manivannan, K., and Rigas, L. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (New York, NY, USA, 2011), SOSP '11, ACM, pp. 143--157.
    [16]
    Cao, Z., Tarasov, V., Raman, H. P., Hildebrand, D., and Zadok, E. On the Performance Variation in Modern Storage Stacks. In FAST (2017).
    [17]
    Carpenter, J., and Hewitt, E. Cassandra: The Definitive Guide: Distributed Data at Web Scale. "O'Reilly Media, Inc.", 2016.
    [18]
    Chaiken, R., Jenkins, B., Larson, P.-Å., Ramsey, B., Shakib, D., Weaver, S., and Zhou, J. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. Proceedings of the VLDB Endowment 1, 2 (2008).
    [19]
    Dean, J., and Barroso, L. A. The Tail at Scale. Communications of the ACM 56, 2 (2013).
    [20]
    Dean, J., and Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. In OSDI (2004).
    [21]
    Delimitrou, C., and Kozyrakis, C. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In ASPLOS (2014).
    [22]
    Dinu, F., and Ng, T. Understanding the Effects and Implications of Compute Node Related Failures in Hadoop. In HPDC (2012).
    [23]
    Escriva, R., Wong, B., and Sirer, E. G. Hyperdex: A distributed, searchable key-value store. SIGCOMM Comput. Commun. Rev. 42, 4 (Aug. 2012), 25--36.
    [24]
    Ghemawat, S., Gobioff, H., and Leung, S.-T. The Google File System. In SOSP (2003).
    [25]
    Goder, A., Spiridonov, A., and Wang, Y. Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems. In USENIX ATC (2015).
    [26]
    Gulati, A., Ahmad, I., and Waldspurger, C. A. PARDA: Proportional Allocation of Resources for Distributed Storage Access. In FAST (2009).
    [27]
    Hao, M., Li, H., Tong, M. H., Pakha, C., Suminto, R. O., Stuardo, C. A., Chien, A. A., And Gunawi, H. S. MittOS: Supporting Millisecond Tail Tolerance with Fast Rejecting SLO-Aware OS Interface. In SOSP (2017).
    [28]
    Hao, M., Soundararajan, G., Kenchammana-Hosekote, D. R., Chien, A. A., and Gunawi, H. S. The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments. In FAST (2016).
    [29]
    He, J., Nguyen, D., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Reducing File System Tail Latencies with Chopper. In FAST (2015).
    [30]
    Iorgulescu, C., Azimi, R., Kwon, Y., Elnikety, S., Sya-Mala, M., Narasayya, V., Herodotou, H., Tomita, P., Chen, A., Zhang, J., et al. Perfiso: Performance isolation for commercial latency-sensitive services. In USENIX ATC (2018).
    [31]
    Jalaparti, V., Bodik, P., Kandula, S., Menache, I., Ry-Balkin, M., and Yan, C. Speeding up Distributed Request-Response Workflows. In SIGCOMM (2013).
    [32]
    Jin, W., Chase, J. S., and Kaur, J. Interposed Proportional Sharing for a Storage Service Utility. In SIGMETRICS (2004).
    [33]
    Jyothi, S. A., Curino, C., Menache, I., Narayana-Murthy, S. M., Tumanov, A., Yaniv, J., Mavlyutov, R., Goiri, Í., Krishnan, S., Kulkarni, J., et al. Morpheus: Towards Automated SLOs for Enterprise Clusters. In OSDI (2016).
    [34]
    Li, H., Ghodsi, A., Zaharia, M., Shenker, S., and Stoica, I. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In SoCC (2014).
    [35]
    Li, J., Sharma, N. K., Ports, D. R., and Gribble, S. D. Tales of the Tail: Hardware, OS, and Application-Level Sources of Tail Latency. In SoCC (2014).
    [36]
    Lo, D., Cheng, L., Govindaraju, R., Ranganathan, P., and Kozyrakis, C. Heracles: Improving Resource Efficiency at Scale. In ISCA (2015).
    [37]
    Mace, J., Bodik, P., Fonseca, R., and Musuvathi, M. Retro: Targeted Resource Management in Multi-tenant Distributed Systems. In NSDI (2015).
    [38]
    Misra, P., Goiri, I., Kace, J., and Bianchini, R. Scaling Distributed File Systems in Resource-Harvesting Datacenters. In USENIX ATC (2017).
    [39]
    Ouyang, X., Garraghan, P., Primas, B., Mckee, D., Tow-Nend, P., and Xu, J. Adaptive speculation for efficient inter-netware application execution in clouds. ACM Trans. Internet Technol. 18, 2 (Jan. 2018), 15:1--15:22.
    [40]
    Reda, W., Canini, M., Suresh, L., Kostić, D., and Braithwaite, S. Rein: Taming Tail Latency in Key-Value Stores via Multiget Scheduling. In EuroSys (2017).
    [41]
    Ren, X., Ananthanarayanan, G., Wierman, A., and Yu, M. Hopper: Decentralized speculation-aware cluster scheduling at scale. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (New York, NY, USA, 2015), SIGCOMM '15, ACM, pp. 379--392.
    [42]
    Shue, D., Freedman, M. J., and Shaikh, A. Performance Isolation and Fairness for Multi-tenant Cloud Storage. In OSDI (2012).
    [43]
    Suminto, R. O., Stuardo, C. A., Clark, A., Ke, H., Leesatapornwongsa, T., Fu, B., Kurniawan, D. H., Martin, V., Uma, M. R. G., and Gunawi, H. S. Pbse: A robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworks. In Proceedings of the 2017 Symposium on Cloud Computing (New York, NY, USA, 2017), SoCC '17, ACM, pp. 295--308.
    [44]
    Suresh, P. L., Canini, M., Schmid, S., and Feldmann, A. C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection. In NSDI (2015).
    [45]
    Tang, S., Lee, B., and He, B. Dynamicmr: A dynamic slot allocation optimization framework for mapreduce clusters. IEEE Transactions on Cloud Computing 2, 3 (July 2014), 333--347.
    [46]
    Terrace, J., and Freedman, M. J. Object storage on craq: High-throughput chain replication for read-mostly workloads. In USENIX Annual Technical Conference (2009), San Diego, CA.
    [47]
    Trushkowsky, B., Bodíak, P., Fox, A., Franklin, M. J., Jordan, M. I., and Patterson, D. A. The SCADS Director: Scaling a Distributed Storage System Under Stringent Performance Requirements. In FAST (2011).
    [48]
    van Renesse, R., and Schneider, F. B. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (Berkeley, CA, USA, 2004), OSDI'04, USENIX Association, pp. 7--7.
    [49]
    Veeraraghavan, K., Meza, J., Chou, D., Kim, W., Margulis, S., Michelson, S., Nishtala, R., Obenshain, D., Perelman, D., and Song, Y. J. Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services. In OSDI (2016).
    [50]
    Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., and Wilkes, J. Large-Scale Cluster Management at Google with Borg. In EuroSys (2015).
    [51]
    Wang, A., Venkataraman, S., Alspaugh, S., Katz, R., and Stoica, I. Cake: Enabling High-Level SLOs on Shared Storage Systems. In SoCC (2012).
    [52]
    Wu, Z., Butkiewicz, M., Perkins, D., Katz-Bassett, E., and Madhyastha, H. V. SPANStore: Cost-effective georeplicated storage spanning multiple cloud services. In SOSP (2013).
    [53]
    Wu, Z., Yu, C., and Madhyastha, H. V. CosTLO: Cost-Effective Redundancy for Lower Latency Variance on Cloud Storage Services. In NSDI (2015).
    [54]
    Xe, L. Support Hedged Reads in DFSClient, 2014. https://issues.apache.org/jira/browse/HDFS-5776.
    [55]
    Xe, L., and McCabe, C. P. Support Non-Positional Hedged Reads in HDFS, 2017. https://issues.apache.org/jira/browse/HDFS-6450.
    [56]
    Yang, H., Breslow, A., Mars, J., and Tang, L. Bubble-flux: Precise Online QoS Management for Increased Utilization in Warehouse Scale Computers. In ISCA (2013).
    [57]
    Yang, S., Harter, T., Agrawal, N., Kowsalya, S. S., Krishnamurthy, A., Al-Kiswany, S., Kaushik, R. T., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Split-level i/o scheduling. In Proceedings of the 25th Symposium on Operating Systems Principles (2015), ACM, pp. 474--489.
    [58]
    Yang, Y., Kim, G.-W., Song, W. W., Lee, Y., Chung, A., Qian, Z., Cho, B., and Chun, B.-G. Pado: A Data Processing Engine for Harnessing Transient Resources in Datacenters. In EuroSys (2017).
    [59]
    Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H., and Stoica, I. Improving MapReduce Performance in Heterogeneous Environments. In OSDI (2008).
    [60]
    Zhang, J., Riska, A., Sivasubramaniam, A., Wang, Q., and Riedel, E. Storage Performance Virtualization via Throughput and Latency Control. In MASCOTS (2005).
    [61]
    Zhang, Y., Prekas, G., Fumarola, G. M., Fontoura, M., Goiri, I., and Bianchini, R. History-Based Harvesting of Spare Cycles and Storage in Large-Scale Datacenters. In OSDI (2016).
    [62]
    Zhu, T., Tumanov, A., Kozuch, M. A., Harchol-Balter, M., and Ganger, G. R. PriorityMeister: Tail Latency QoS for Shared Networked Storage. In SoCC (2014).

    Cited By

    View all
    • (2024)A quantitative evaluation of persistent memory hash indexesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00812-133:2(375-397)Online publication date: 1-Mar-2024
    • (2023)λFS: A Scalable and Elastic Distributed File System Metadata Service using Serverless FunctionsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624765(394-411)Online publication date: 25-Mar-2023
    • (2023)Analyzing and Improving the Scalability of In-Memory Indices for Managed Search EnginesProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595272(15-29)Online publication date: 6-Jun-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019
    March 2019
    714 pages
    ISBN:9781450362818
    DOI:10.1145/3302424
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 March 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    EuroSys '19
    Sponsor:
    EuroSys '19: Fourteenth EuroSys Conference 2019
    March 25 - 28, 2019
    Dresden, Germany

    Acceptance Rates

    Overall Acceptance Rate 241 of 1,308 submissions, 18%

    Upcoming Conference

    EuroSys '25
    Twentieth European Conference on Computer Systems
    March 30 - April 3, 2025
    Rotterdam , Netherlands

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)157
    • Downloads (Last 6 weeks)13

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A quantitative evaluation of persistent memory hash indexesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00812-133:2(375-397)Online publication date: 1-Mar-2024
    • (2023)λFS: A Scalable and Elastic Distributed File System Metadata Service using Serverless FunctionsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624765(394-411)Online publication date: 25-Mar-2023
    • (2023)Analyzing and Improving the Scalability of In-Memory Indices for Managed Search EnginesProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595272(15-29)Online publication date: 6-Jun-2023
    • (2023)F3: Serving Files Efficiently in Serverless ComputingProceedings of the 16th ACM International Conference on Systems and Storage10.1145/3579370.3594771(8-21)Online publication date: 5-Jun-2023
    • (2023)CPU Scheduling in Data Centers Using Asynchronous Finite-Time Distributed Coordination MechanismsIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.3236214(1-15)Online publication date: 2023
    • (2023)TailGuard: Tail Latency SLO Guaranteed Task Scheduling for Data-Intensive User-Facing Applications2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS57875.2023.00042(898-909)Online publication date: Jul-2023
    • (2023)Efficient and accurate Lyapunov function-based truncation technique for multi-dimensional Markov chains with applications to discriminatory processor sharing and priority queuesPerformance Evaluation10.1016/j.peva.2023.102356162(102356)Online publication date: Nov-2023
    • (2022)Transformers for tabular data representationProceedings of the VLDB Endowment10.14778/3554821.355489015:12(3746-3749)Online publication date: 1-Aug-2022
    • (2022)Degraded Mode-benefited I/O Scheduling to Ensure I/O Responsiveness in RAID-enabled SSDsACM Transactions on Design Automation of Electronic Systems10.1145/352275527:6(1-24)Online publication date: 22-Nov-2022
    • (2022)HolmesProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531464(110-121)Online publication date: 27-Jun-2022
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media