research-article

Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers

Authors:

Garrett Wilson Ransom,

Haryadi S. GunawiAuthors Info & Claims

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 47, Pages 1 - 13

https://doi.org/10.1145/3581784.3607072

Published: 11 November 2023 Publication History

Abstract

Multi-level erasure coding (MLEC) has seen large deployments in the field, but there is no in-depth study of design considerations for MLEC at scale. In this paper, we provide comprehensive design considerations and analysis of MLEC at scale. We introduce the design space of MLEC in multiple dimensions, including various code parameter selections, chunk placement schemes, and various repair methods. We quantify their performance and durability, and show which MLEC schemes and repair methods can provide the best tolerance against independent/correlated failures and reduce repair network traffic by orders of magnitude. To achieve this, we use various evaluation strategies including simulation, splitting, dynamic programming, and mathematical modeling. We also compare the performance and durability of MLEC with other EC schemes such as SLEC and LRC and show that MLEC can provide high durability with higher encoding throughput and less repair network traffic over both SLEC and LRC.

Supplemental Material

MP4 File - SC23 paper presentation recording for "Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers"

SC23 paper presentation recording for "Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers", by Meng Wang, Jiajun Mao, Rajdeep Rana, John Bent, Serkay Olmez, Anjus George, Garrett Wilson Ransom, Jun Li, Haryadi S. Gunawi

Download
177.03 MB

References

[1]

D. Colarelli and D. Grunwald. Massive Arrays of Idle Disks For Storage Archives. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (SC), 2002.

Digital Library

[2]

Huaxia Xia and Andrew A. Chien. RobuSTore: Robust Performance for Distributed Storage Systems. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC), 2007.

[3]

Zizhong Chen. Optimal real number codes for fault tolerant matrix operations. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009.

Digital Library

[4]

Haiyang Shi and Xiaoyi Lu. TriEC: Tripartite Graph Based Erasure Coding NIC Offload. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2019.

[5]

Haiyang Shi and Xiaoyi Lu. INEC: Fast and Coherent In-Network Erasure Coding. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2020.

[6]

Liangfeng Cheng, Yuchong Hu, Zhaokang Ke, Jia Xu, Qiaori Yao, Dan Feng, Weichun Wang, and Wei Chen. LogECMem: Coupling Erasure-Coded In-Memory Key-Value Stores with Parity Logging. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2021.

Digital Library

[7]

Yuya Uezato. Accelerating XOR-based erasure coding using program optimization techniques. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2021.

Digital Library

[8]

Salvatore Di Girolamo, Daniele De Sensi, Konstantin Taranov, Milos Malesevic, Maciej Besta, Timo Schneider, Severin Kistler, and Torsten Hoefler. Building blocks for network-accelerated distributed file systems. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2022.

[9]

David Patterson, Garth Gibson, and Randy Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM SIGMOD Conference on the Management of Data (SIGMOD), 1988.

Digital Library

[10]

Jaeho Kim, Jongmin Lee, Jongmoo Choi, Donghee Lee, and Sam H. Noh. Enhancing SSD reliability through efficient RAID support. In Proceedings of the Asia-Pacific Workshop on Systems (APSys), 2012.

Digital Library

[11]

Guangyan Zhang, Zican Huang, Xiaosong Ma, Songlin Yang, Zhufan Wang, and Weimin Zheng. RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures. In Proceedings of the 16th USENIX Symposium on File and Storage Technologies (FAST), 2018.

[12]

K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster. In the 5th Workshop on Hot Topics in Storage and File Systems (HotStorage), 2013.

[13]

KV Rashmi, Preetum Nakkiran, Jingyan Wang, Nihar B. Shah, and Kannan Ramchandran. Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage and Network-bandwidth. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST), 2015.

[14]

Mingyuan Xia, Mohit Saxena, Mario Blaum, and David A. Pease. A Tale of Two Erasure Codes in HDFS. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST), 2015.

Digital Library

[15]

Jeffrey Thornton Inman, William Flynn Vining, Garrett Wilson Ransom, and Gary Alan Grider. Marfs, a near-posix interface to cloud objects. ; Login, 42(LA-UR-16-28720; LA-UR-16-28952), 2017.

[16]

Scality ARTESCA: Object Storage for S3 Applications. https://www.scality.com/products/artesca/.

[17]

Hierarchical Erasure Coding: Making Erasure Coding Usable. https://www.snia.org/sites/default/files/SNIA_Hierarchical_Erasure_Coding_Final.pdf.

[18]

Jehan-François Pâris, S. J. Thomas J. E. Schwarz, Ahmed Amer, and Darrell D. E. Long. Highly reliable two-dimensional RAID arrays for archival storage. In 31th IEEE - International Performance Computing and Communications Conference (IPCCC), 2012.

[19]

Neng Wang, Yinlong Xu, Yongkun Li, and Si Wu. OI-RAID: A Two-Layer RAID Architecture towards Fast Recovery and High Reliability. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), 2016.

[20]

Alexander Thomasian. Multi-level RAID for very large disk arrays. In ACM SIGMETRICS Performance Evaluation Review, 2006.

Digital Library

[21]

Sung Hoon Baek, Bong Wan Kim, Eui Joung Joung, and Chong Won Park. Reliability and performance of hierarchical RAID with multiple controllers. In Proceedings of the 20st ACM Symposium on Principles of Distributed Computing (PODC), 2001.

Digital Library

[22]

Alexander Thomasian and Yujie Tang. Performance, Reliability, and Performability Aspects of Hierarchical RAID. In 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage (NAS), 2011.

[23]

Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. Erasure Coding in Windows Azure Storage. In Proceedings of the 2012 USENIX Annual Technical Conference (ATC), 2012.

Digital Library

[24]

MLEC Github repository. https://github.com/ucare-uchicago/mlec-sim.

[25]

MLEC Artifact on Chameleon Trovi. https://tinyurl.com/mlec-artifact.

[26]

Richard R. Muntz and John C. S. Lui. Performance analysis of disk arrays under failure. In Proceedings of the 16th International Conference on Very Large Data Bases (VLDB), 1990.

[27]

Mark Holland and Garth Gibson. Parity Declustering for Continuous Operation in Redundant Disk Arrays. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1992.

Digital Library

[28]

Guillermo A. Alvarez, Walter A. Burkhard, and Flaviu Cristian. Tolerating Multiple Failures in RAID Architectures with Optimal Storage and Uniform Declustering. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA), 1997.

Digital Library

[29]

Guillermo A. Alvarez, Walter A. Burkhard, Larry J. Stockmeyer, and Flaviu Cristian. Declustered disk array architectures with optimal and near-optimal parallelism. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA), 1998.

Digital Library

[30]

Thomas J.E. Schwarz S.J., Jesse Steinberg, and Walter A. Burkhard. Permutation development data layout (PDDL). In Proceedings of the 5th International Symposium on High Performance Computer Architecture (HPCA-5), 1999.

[31]

Huan Ke, Haryadi S Gunawi, David Bonnie, Nathan DeBardeleben, Michael Grosskopf, Terry Grové, Dominic Manno, Elisabeth Moore, and Brad Settlemyer. Extreme protection against data loss with single-overlap declustered parity. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 343--354. IEEE, 2020.

[32]

CORVAULT - Self-Healing, High Density Data Storage. https://www.seagate.com/products/storage/data-storage-systems/corvault/.

[33]

Jeff Bonwick and Bill Moore. Zfs: The last word in file systems, 2007.

[34]

Dell PowerEdge RAID Controller 12. https://infohub.delltechnologies.com/p/dell-poweredge-raid-controller-12/.

[35]

Paul Glasserman, Philip Heidelberger, Perwez Shahabuddin, and Tim Zajic. Splitting for rare event simulation: analysis of simple cases. In Proceedings of the 28th conference on Winter simulation, pages 302--308, 1996.

Digital Library

[36]

Victor F Nicola, Perwez Shahabuddin, and Marvin K Nakayama. Techniques for fast simulation of models of highly dependable systems. IEEE Transactions on Reliability, 50(3):246--264, 2001.

[37]

Daniel Ford, Franis Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlna. Availability in Globally Distributed Storage Systems. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI), 2010.

Digital Library

[38]

Kevin M. Greenan, James S. Plank, and Jay J. Wylie. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In the 2nd Workshop on Hot Topics in Storage and File Systems (HotStorage), 2010.

[39]

Hiroaki Akutsu and Tomohiro Kawaguchi. Reliability analysis of distributed raid with priority rebuilding. In Proc. USENIX Conf., 2013.

[40]

Kishor S Trivedi. Probability and statistics with reliability, queuing, and computer science applications. John Wiley & Sons, 2001.

Digital Library

[41]

ORNL's Alpine storage system. https://www.olcf.ornl.gov/olcf-resources/data-visualization-resources/alpine.

[42]

Personal Communication with LANL, ORNL, and Seagate Engineers and Operators.

[43]

Yuchong Hu, Liangfeng Cheng, Qiaori Yao, Patrick P. C. Lee, Weichun Wang, and Wei Chen. Exploiting Combined Locality for Wide-Stripe Erasure Coding in Distributed Storage. In Proceedings of the 19th USENIX Symposium on File and Storage Technologies (FAST), 2021.

[44]

Intel Intelligent Storage Acceleration Library (Intel ISA-L). https://software.intel.com/en-us/storage/ISA-L.

[45]

Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. XORing Elephants: Novel Erasure Codes for Big Data. In Proceedings of the 39th International Conference on Very Large Data Bases (VLDB), 2013.

[46]

Oleg Kolosov, Gala Yadgar, Matan Liram, Itzhak Tamo, and Alexander Barg. On Fault Tolerance, Locality, and Optimality in Locally Repairable Codes. In Proceedings of the 2018 USENIX Annual Technical Conference (ATC), 2018.

[47]

Itzhak Tamo and Alexander Barg. A family of optimal locally recoverable codes. IEEE Transactions on Information Theory, 60(8):4661--4676, 2014.

[48]

Saurabh Kadekodi, Shashwat Silas, David Clausen, and Arif Merchant. Practical Design Considerations for Wide Locally Recoverable Codes (LRCs). In Proceedings of the 21th USENIX Symposium on File and Storage Technologies (FAST), 2023.

Digital Library

[49]

Chameleon - A configurable experimental environment for large-scale cloud research. https://www.chameleoncloud.org.

[50]

Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. Lessons Learned from the Chameleon Testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (ATC), 2020.

Cited By

Shen ZCai YCheng KLee PLi XHu YShu J(2025)A Survey of the Past, Present, and Future of Erasure Coding for Storage SystemsACM Transactions on Storage10.1145/370899421:1(1-39)Online publication date: 8-Jan-2025
https://dl.acm.org/doi/10.1145/3708994
Kamath KBrewer NMalik T(2024)FAIR Assessment of Cloud-based Experiments2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678673(1-6)Online publication date: 16-Sep-2024
https://doi.org/10.1109/e-Science62913.2024.10678673
George AWang MHanley JRansom GBent JZimmer C(2024)From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC EnvironmentsSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00070(484-495)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SCW63240.2024.00070

Index Terms

Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. n-tier architectures
  2. Dependable and fault-tolerant systems and networks
    1. Redundancy
    2. Reliability
2. Computing methodologies
  1. Modeling and simulation
    1. Simulation evaluation

Recommendations

High performance erasure coding for very large stripe sizes
HPC '19: Proceedings of the High Performance Computing Symposium

Exascale computing demands high bandwidth and low latency I/O on the computing edge. Object storage systems can provide higher bandwidth and lower latencies than tape archive. File transfer nodes present a single point of mediation through which data ...
Cross-Rack-Aware Updates in Erasure-Coded Data Centers
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

The update performance in erasure-coded data centers is often bottlenecked by the constrained cross-rack bandwidth. We propose CAU, a cross-rack-aware update mechanism that aims to mitigate the cross-rack update traffic in erasure-coded data centers. ...
Cost analysis of erasure coding for exa-scale storage
Abstract
With the increasing demand for mass storage, research on exa-scale storage is actively underway. When the scale of storage grows to the exa-scale, the space efficiency becomes very important. To maintain the storage reliability and improve the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2023

1428 pages

ISBN:9798400701092

DOI:10.1145/3581784

Chair:
Dorian Arnold,
Program Chair:
Rosa M Badia,
Program Co-chair:
Kathryn Mohror

Copyright © 2023 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

SC '23

Sponsor:

SIGHPC

SC '23: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2023

CO, Denver, USA

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
330
Total Downloads

Downloads (Last 12 months)199
Downloads (Last 6 weeks)5

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shen ZCai YCheng KLee PLi XHu YShu J(2025)A Survey of the Past, Present, and Future of Erasure Coding for Storage SystemsACM Transactions on Storage10.1145/370899421:1(1-39)Online publication date: 8-Jan-2025
https://dl.acm.org/doi/10.1145/3708994
Kamath KBrewer NMalik T(2024)FAIR Assessment of Cloud-based Experiments2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678673(1-6)Online publication date: 16-Sep-2024
https://doi.org/10.1109/e-Science62913.2024.10678673
George AWang MHanley JRansom GBent JZimmer C(2024)From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC EnvironmentsSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00070(484-495)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SCW63240.2024.00070

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten