research-article

Public Access

Managing Tail Latency in Datacenter-Scale File Systems Under Production Constraints

Authors:

Pulkit A. Misra,

María F. Borge,

Alvin R. Lebeck,

Willy Zwaenepoel, and

Ricardo BianchiniAuthors Info & Claims

EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019

March 2019

Article No.: 17, Pages 1 - 15

https://doi.org/10.1145/3302424.3303973

Published: 25 March 2019 Publication History

Abstract

Distributed file systems often exhibit high tail latencies, especially in large-scale datacenters and in the presence of competing (and possibly higher priority) workloads. This paper introduces techniques for managing tail latencies in these systems, while addressing the practical challenges inherent in production datacenters (e.g., hardware heterogeneity, interference from other workloads, the need to maximize simplicity and maintainability). We implement our techniques in a scalable distributed file system (an extension of HDFS) used in production at Microsoft. Our evaluation uses 70k servers in 3 datacenters, and shows that our techniques reduce tail latency significantly for production workloads.

References

[1]

HDFS Architecture Guide, 2008. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html.

[2]

MongoDB Managed Chain Replication, 2008. https://docs.mongodb.com/manual/tutorial/manage-chained-replication/.

[3]

Apache HTrace: A tracing framework for use with distributed systems, 2017. http://htrace.incubator.apache.org/.

[4]

DistCp Guide, 2017. http://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html.

[5]

TeraGen, 2017. https://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/teragen/package-summary.html.

[6]

Track time to process packet in Datanode, 2017. https://issues.apache.org/jira/browse/HDFS-13053.

[7]

Expectations of a Hadoop-compatible file system, 2018. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html.

[8]

Track speed in DFSClient, 2018. https://issues.apache.org/jira/browse/HDFS-12861.

[9]

Amvrosiadis, G., Park, J. W., Ganger, G. R., Gibson, G. A., Baseman, E., And DeBardeleben, N. On the diversity of cluster workloads and its impact on research results. In USENIX ATC (2018).

Digital Library

[10]

Ananthanarayanan, G., Ghodsi, A., Shenker, S., and Stoica, I. Effective Straggler Mitigation: Attack of the Clones. In NSDI (2013).

Digital Library

[11]

Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., and Harris, E. Reining in the Outliers in Mapreduce Clusters Using Mantri. In OSDI (2010).

Digital Library

[12]

Andersen, D. G., Franklin, J., Kaminsky, M., Phan-Ishayee, A., Tan, L., and Vasudevan, V. Fawn: A fast array of wimpy nodes. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles (New York, NY, USA, 2009), SOSP '09, ACM, pp. 1--14.

Digital Library

[13]

Apache Software Foundation. Apache Spark.

[14]

Balakrishnan, M., Malkhi, D., Prabhakaran, V., Wobbler, T., Wei, M., and Davis, J. D. CORFU: A shared log design for flash clusters. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12) (San Jose, CA, 2012), USENIX, pp. 1--14.

Digital Library

[15]

Calder, B., Wang, J., Ogus, A., Nilakantan, N., Skjolsvold, A., Mckelvie, S., Xu, Y., Srivastav, S., Wu, J., Simitci, H., Haridas, J., Uddaraju, C., Khatri, H., Edwards, A., Bedekar, V., Mainali, S., Abbasi, R., Agarwal, A., Haq, M. F. U., Haq, M. I. U., Bhardwaj, D., Dayanand, S., Adusumilli, A., Mcnett, M., Sankaran, S., Manivannan, K., and Rigas, L. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (New York, NY, USA, 2011), SOSP '11, ACM, pp. 143--157.

Digital Library

[16]

Cao, Z., Tarasov, V., Raman, H. P., Hildebrand, D., and Zadok, E. On the Performance Variation in Modern Storage Stacks. In FAST (2017).

Digital Library

[17]

Carpenter, J., and Hewitt, E. Cassandra: The Definitive Guide: Distributed Data at Web Scale. "O'Reilly Media, Inc.", 2016.

Digital Library

[18]

Chaiken, R., Jenkins, B., Larson, P.-Å., Ramsey, B., Shakib, D., Weaver, S., and Zhou, J. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. Proceedings of the VLDB Endowment 1, 2 (2008).

Digital Library

[19]

Dean, J., and Barroso, L. A. The Tail at Scale. Communications of the ACM 56, 2 (2013).

Digital Library

[20]

Dean, J., and Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. In OSDI (2004).

Digital Library

[21]

Delimitrou, C., and Kozyrakis, C. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In ASPLOS (2014).

Digital Library

[22]

Dinu, F., and Ng, T. Understanding the Effects and Implications of Compute Node Related Failures in Hadoop. In HPDC (2012).

Digital Library

[23]

Escriva, R., Wong, B., and Sirer, E. G. Hyperdex: A distributed, searchable key-value store. SIGCOMM Comput. Commun. Rev. 42, 4 (Aug. 2012), 25--36.

Digital Library

[24]

Ghemawat, S., Gobioff, H., and Leung, S.-T. The Google File System. In SOSP (2003).

Digital Library

[25]

Goder, A., Spiridonov, A., and Wang, Y. Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems. In USENIX ATC (2015).

Digital Library

[26]

Gulati, A., Ahmad, I., and Waldspurger, C. A. PARDA: Proportional Allocation of Resources for Distributed Storage Access. In FAST (2009).

Digital Library

[27]

Hao, M., Li, H., Tong, M. H., Pakha, C., Suminto, R. O., Stuardo, C. A., Chien, A. A., And Gunawi, H. S. MittOS: Supporting Millisecond Tail Tolerance with Fast Rejecting SLO-Aware OS Interface. In SOSP (2017).

Digital Library

[28]

Hao, M., Soundararajan, G., Kenchammana-Hosekote, D. R., Chien, A. A., and Gunawi, H. S. The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments. In FAST (2016).

Digital Library

[29]

He, J., Nguyen, D., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Reducing File System Tail Latencies with Chopper. In FAST (2015).

Digital Library

[30]

Iorgulescu, C., Azimi, R., Kwon, Y., Elnikety, S., Sya-Mala, M., Narasayya, V., Herodotou, H., Tomita, P., Chen, A., Zhang, J., et al. Perfiso: Performance isolation for commercial latency-sensitive services. In USENIX ATC (2018).

Digital Library

[31]

Jalaparti, V., Bodik, P., Kandula, S., Menache, I., Ry-Balkin, M., and Yan, C. Speeding up Distributed Request-Response Workflows. In SIGCOMM (2013).

Digital Library

[32]

Jin, W., Chase, J. S., and Kaur, J. Interposed Proportional Sharing for a Storage Service Utility. In SIGMETRICS (2004).

Digital Library

[33]

Jyothi, S. A., Curino, C., Menache, I., Narayana-Murthy, S. M., Tumanov, A., Yaniv, J., Mavlyutov, R., Goiri, Í., Krishnan, S., Kulkarni, J., et al. Morpheus: Towards Automated SLOs for Enterprise Clusters. In OSDI (2016).

Digital Library

[34]

Li, H., Ghodsi, A., Zaharia, M., Shenker, S., and Stoica, I. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In SoCC (2014).

Digital Library

[35]

Li, J., Sharma, N. K., Ports, D. R., and Gribble, S. D. Tales of the Tail: Hardware, OS, and Application-Level Sources of Tail Latency. In SoCC (2014).

Digital Library

[36]

Lo, D., Cheng, L., Govindaraju, R., Ranganathan, P., and Kozyrakis, C. Heracles: Improving Resource Efficiency at Scale. In ISCA (2015).

Digital Library

[37]

Mace, J., Bodik, P., Fonseca, R., and Musuvathi, M. Retro: Targeted Resource Management in Multi-tenant Distributed Systems. In NSDI (2015).

Digital Library

[38]

Misra, P., Goiri, I., Kace, J., and Bianchini, R. Scaling Distributed File Systems in Resource-Harvesting Datacenters. In USENIX ATC (2017).

Digital Library

[39]

Ouyang, X., Garraghan, P., Primas, B., Mckee, D., Tow-Nend, P., and Xu, J. Adaptive speculation for efficient inter-netware application execution in clouds. ACM Trans. Internet Technol. 18, 2 (Jan. 2018), 15:1--15:22.

Digital Library

[40]

Reda, W., Canini, M., Suresh, L., Kostić, D., and Braithwaite, S. Rein: Taming Tail Latency in Key-Value Stores via Multiget Scheduling. In EuroSys (2017).

Digital Library

[41]

Ren, X., Ananthanarayanan, G., Wierman, A., and Yu, M. Hopper: Decentralized speculation-aware cluster scheduling at scale. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (New York, NY, USA, 2015), SIGCOMM '15, ACM, pp. 379--392.

Digital Library

[42]

Shue, D., Freedman, M. J., and Shaikh, A. Performance Isolation and Fairness for Multi-tenant Cloud Storage. In OSDI (2012).

Digital Library

[43]

Suminto, R. O., Stuardo, C. A., Clark, A., Ke, H., Leesatapornwongsa, T., Fu, B., Kurniawan, D. H., Martin, V., Uma, M. R. G., and Gunawi, H. S. Pbse: A robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworks. In Proceedings of the 2017 Symposium on Cloud Computing (New York, NY, USA, 2017), SoCC '17, ACM, pp. 295--308.

Digital Library

[44]

Suresh, P. L., Canini, M., Schmid, S., and Feldmann, A. C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection. In NSDI (2015).

Digital Library

[45]

Tang, S., Lee, B., and He, B. Dynamicmr: A dynamic slot allocation optimization framework for mapreduce clusters. IEEE Transactions on Cloud Computing 2, 3 (July 2014), 333--347.

[46]

Terrace, J., and Freedman, M. J. Object storage on craq: High-throughput chain replication for read-mostly workloads. In USENIX Annual Technical Conference (2009), San Diego, CA.

Digital Library

[47]

Trushkowsky, B., Bodíak, P., Fox, A., Franklin, M. J., Jordan, M. I., and Patterson, D. A. The SCADS Director: Scaling a Distributed Storage System Under Stringent Performance Requirements. In FAST (2011).

Digital Library

[48]

van Renesse, R., and Schneider, F. B. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (Berkeley, CA, USA, 2004), OSDI'04, USENIX Association, pp. 7--7.

Digital Library

[49]

Veeraraghavan, K., Meza, J., Chou, D., Kim, W., Margulis, S., Michelson, S., Nishtala, R., Obenshain, D., Perelman, D., and Song, Y. J. Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services. In OSDI (2016).

Digital Library

[50]

Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., and Wilkes, J. Large-Scale Cluster Management at Google with Borg. In EuroSys (2015).

Digital Library

[51]

Wang, A., Venkataraman, S., Alspaugh, S., Katz, R., and Stoica, I. Cake: Enabling High-Level SLOs on Shared Storage Systems. In SoCC (2012).

Digital Library

[52]

Wu, Z., Butkiewicz, M., Perkins, D., Katz-Bassett, E., and Madhyastha, H. V. SPANStore: Cost-effective georeplicated storage spanning multiple cloud services. In SOSP (2013).

Digital Library

[53]

Wu, Z., Yu, C., and Madhyastha, H. V. CosTLO: Cost-Effective Redundancy for Lower Latency Variance on Cloud Storage Services. In NSDI (2015).

Digital Library

[54]

Xe, L. Support Hedged Reads in DFSClient, 2014. https://issues.apache.org/jira/browse/HDFS-5776.

[55]

Xe, L., and McCabe, C. P. Support Non-Positional Hedged Reads in HDFS, 2017. https://issues.apache.org/jira/browse/HDFS-6450.

[56]

Yang, H., Breslow, A., Mars, J., and Tang, L. Bubble-flux: Precise Online QoS Management for Increased Utilization in Warehouse Scale Computers. In ISCA (2013).

Digital Library

[57]

Yang, S., Harter, T., Agrawal, N., Kowsalya, S. S., Krishnamurthy, A., Al-Kiswany, S., Kaushik, R. T., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Split-level i/o scheduling. In Proceedings of the 25th Symposium on Operating Systems Principles (2015), ACM, pp. 474--489.

Digital Library

[58]

Yang, Y., Kim, G.-W., Song, W. W., Lee, Y., Chung, A., Qian, Z., Cho, B., and Chun, B.-G. Pado: A Data Processing Engine for Harnessing Transient Resources in Datacenters. In EuroSys (2017).

Digital Library

[59]

Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H., and Stoica, I. Improving MapReduce Performance in Heterogeneous Environments. In OSDI (2008).

Digital Library

[60]

Zhang, J., Riska, A., Sivasubramaniam, A., Wang, Q., and Riedel, E. Storage Performance Virtualization via Throughput and Latency Control. In MASCOTS (2005).

Digital Library

[61]

Zhang, Y., Prekas, G., Fumarola, G. M., Fontoura, M., Goiri, I., and Bianchini, R. History-Based Harvesting of Spare Cycles and Storage in Large-Scale Datacenters. In OSDI (2016).

Digital Library

[62]

Zhu, T., Tumanov, A., Kozuch, M. A., Harchol-Balter, M., and Ganger, G. R. PriorityMeister: Tail Latency QoS for Shared Networked Storage. In SoCC (2014).

Digital Library

Cited By

Chen ZHu DChe WSun JChen H(2024)A quantitative evaluation of persistent memory hash indexesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00812-133:2(375-397)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s00778-023-00812-1
Carver BHan RZhang JZheng MCheng YAamodt TSwift MJerger N(2023)λFS: A Scalable and Elastic Distributed File System Metadata Service using Serverless FunctionsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624765(394-411)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624765
Chilukuri AAkram SBlackburn SPetrank E(2023)Analyzing and Improving the Scalability of In-Memory Indices for Managed Search EnginesProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595272(15-29)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591195.3595272
Show More Cited By

Recommendations

Achieving Microsecond-Scale Tail Latency Efficiently with Approximate Optimal Scheduling
SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles

Datacenter applications expect microsecond-scale service times and tightly bound tail latency, with future workloads expected to be even more demanding. To address this challenge, state-of-the-art runtimes employ theoretically optimal scheduling ...
Read More
A Datacenter Network Architecture for Low Latency, Automation and Virtualization
LANCOMM '16: Proceedings of the 2016 workshop on Fostering Latin-American Research in Data Communication Networks

Future Datacenter networking must enable long-term innovation in an automated way, without decreasing the forwarding speed. Current Software Defined Networking (SDN) solutions are creating complex architectures that rely on layers of software overlays ...
Read More
The impact of management operations on the virtualized datacenter
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

Virtualization has the potential to dramatically reduce the total cost of ownership of datacenters and increase the flexibility of deployments for general-purpose workloads. If present trends continue, the datacenter of the future will be largely ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019

March 2019

714 pages

ISBN:9781450362818

DOI:10.1145/3302424

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

EuroSys '19

Sponsor:

SIGOPS

EuroSys '19: Fourteenth EuroSys Conference 2019

March 25 - 28, 2019

Dresden, Germany

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
1,170
Total Downloads

Downloads (Last 12 months)157
Downloads (Last 6 weeks)13

Other Metrics

View Author Metrics

Citations

Cited By

Chen ZHu DChe WSun JChen H(2024)A quantitative evaluation of persistent memory hash indexesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00812-133:2(375-397)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s00778-023-00812-1
Carver BHan RZhang JZheng MCheng YAamodt TSwift MJerger N(2023)λFS: A Scalable and Elastic Distributed File System Metadata Service using Serverless FunctionsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624765(394-411)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624765
Chilukuri AAkram SBlackburn SPetrank E(2023)Analyzing and Improving the Scalability of In-Memory Indices for Managed Search EnginesProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595272(15-29)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591195.3595272
Merenstein ATarasov VAnwar AGuthridge SZadok EGilad YKostic DMoatti YBiran O(2023)F3: Serving Files Efficiently in Serverless ComputingProceedings of the 16th ACM International Conference on Systems and Storage10.1145/3579370.3594771(8-21)Online publication date: 5-Jun-2023
https://dl.acm.org/doi/10.1145/3579370.3594771
Grammenos ACharalambous TKalyvianaki E(2023)CPU Scheduling in Data Centers Using Asynchronous Finite-Time Distributed Coordination MechanismsIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.3236214(1-15)Online publication date: 2023
https://doi.org/10.1109/TNSE.2023.3236214
Wang ZLi HSun LRosenkrantz TChe HJiang H(2023)TailGuard: Tail Latency SLO Guaranteed Task Scheduling for Data-Intensive User-Facing Applications2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS57875.2023.00042(898-909)Online publication date: Jul-2023
https://doi.org/10.1109/ICDCS57875.2023.00042
Somashekar GDelasay MGandhi A(2023)Efficient and accurate Lyapunov function-based truncation technique for multi-dimensional Markov chains with applications to discriminatory processor sharing and priority queuesPerformance Evaluation10.1016/j.peva.2023.102356162(102356)Online publication date: Nov-2023
https://doi.org/10.1016/j.peva.2023.102356
Badaro GPapotti P(2022)Transformers for tabular data representationProceedings of the VLDB Endowment10.14778/3554821.355489015:12(3746-3749)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.14778/3554821.3554890
Sha ZLi JCai ZHuang MLiao JTrahay F(2022)Degraded Mode-benefited I/O Scheduling to Ensure I/O Responsiveness in RAID-enabled SSDsACM Transactions on Design Automation of Electronic Systems10.1145/352275527:6(1-24)Online publication date: 22-Nov-2022
https://dl.acm.org/doi/10.1145/3522755
Pi AZhou XXu CWeissman JChandra AGavrilovska ATiwari D(2022)HolmesProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531464(110-121)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3502181.3531464
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents