Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581784.3607072acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers

Published: 11 November 2023 Publication History

Abstract

Multi-level erasure coding (MLEC) has seen large deployments in the field, but there is no in-depth study of design considerations for MLEC at scale. In this paper, we provide comprehensive design considerations and analysis of MLEC at scale. We introduce the design space of MLEC in multiple dimensions, including various code parameter selections, chunk placement schemes, and various repair methods. We quantify their performance and durability, and show which MLEC schemes and repair methods can provide the best tolerance against independent/correlated failures and reduce repair network traffic by orders of magnitude. To achieve this, we use various evaluation strategies including simulation, splitting, dynamic programming, and mathematical modeling. We also compare the performance and durability of MLEC with other EC schemes such as SLEC and LRC and show that MLEC can provide high durability with higher encoding throughput and less repair network traffic over both SLEC and LRC.

Supplemental Material

MP4 File - SC23 paper presentation recording for "Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers"
SC23 paper presentation recording for "Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers", by Meng Wang, Jiajun Mao, Rajdeep Rana, John Bent, Serkay Olmez, Anjus George, Garrett Wilson Ransom, Jun Li, Haryadi S. Gunawi

References

[1]
D. Colarelli and D. Grunwald. Massive Arrays of Idle Disks For Storage Archives. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (SC), 2002.
[2]
Huaxia Xia and Andrew A. Chien. RobuSTore: Robust Performance for Distributed Storage Systems. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC), 2007.
[3]
Zizhong Chen. Optimal real number codes for fault tolerant matrix operations. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009.
[4]
Haiyang Shi and Xiaoyi Lu. TriEC: Tripartite Graph Based Erasure Coding NIC Offload. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2019.
[5]
Haiyang Shi and Xiaoyi Lu. INEC: Fast and Coherent In-Network Erasure Coding. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2020.
[6]
Liangfeng Cheng, Yuchong Hu, Zhaokang Ke, Jia Xu, Qiaori Yao, Dan Feng, Weichun Wang, and Wei Chen. LogECMem: Coupling Erasure-Coded In-Memory Key-Value Stores with Parity Logging. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2021.
[7]
Yuya Uezato. Accelerating XOR-based erasure coding using program optimization techniques. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2021.
[8]
Salvatore Di Girolamo, Daniele De Sensi, Konstantin Taranov, Milos Malesevic, Maciej Besta, Timo Schneider, Severin Kistler, and Torsten Hoefler. Building blocks for network-accelerated distributed file systems. In Proceedings of International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2022.
[9]
David Patterson, Garth Gibson, and Randy Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM SIGMOD Conference on the Management of Data (SIGMOD), 1988.
[10]
Jaeho Kim, Jongmin Lee, Jongmoo Choi, Donghee Lee, and Sam H. Noh. Enhancing SSD reliability through efficient RAID support. In Proceedings of the Asia-Pacific Workshop on Systems (APSys), 2012.
[11]
Guangyan Zhang, Zican Huang, Xiaosong Ma, Songlin Yang, Zhufan Wang, and Weimin Zheng. RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures. In Proceedings of the 16th USENIX Symposium on File and Storage Technologies (FAST), 2018.
[12]
K. V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster. In the 5th Workshop on Hot Topics in Storage and File Systems (HotStorage), 2013.
[13]
KV Rashmi, Preetum Nakkiran, Jingyan Wang, Nihar B. Shah, and Kannan Ramchandran. Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage and Network-bandwidth. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST), 2015.
[14]
Mingyuan Xia, Mohit Saxena, Mario Blaum, and David A. Pease. A Tale of Two Erasure Codes in HDFS. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST), 2015.
[15]
Jeffrey Thornton Inman, William Flynn Vining, Garrett Wilson Ransom, and Gary Alan Grider. Marfs, a near-posix interface to cloud objects. ; Login, 42(LA-UR-16-28720; LA-UR-16-28952), 2017.
[16]
Scality ARTESCA: Object Storage for S3 Applications. https://www.scality.com/products/artesca/.
[17]
Hierarchical Erasure Coding: Making Erasure Coding Usable. https://www.snia.org/sites/default/files/SNIA_Hierarchical_Erasure_Coding_Final.pdf.
[18]
Jehan-François Pâris, S. J. Thomas J. E. Schwarz, Ahmed Amer, and Darrell D. E. Long. Highly reliable two-dimensional RAID arrays for archival storage. In 31th IEEE - International Performance Computing and Communications Conference (IPCCC), 2012.
[19]
Neng Wang, Yinlong Xu, Yongkun Li, and Si Wu. OI-RAID: A Two-Layer RAID Architecture towards Fast Recovery and High Reliability. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), 2016.
[20]
Alexander Thomasian. Multi-level RAID for very large disk arrays. In ACM SIGMETRICS Performance Evaluation Review, 2006.
[21]
Sung Hoon Baek, Bong Wan Kim, Eui Joung Joung, and Chong Won Park. Reliability and performance of hierarchical RAID with multiple controllers. In Proceedings of the 20st ACM Symposium on Principles of Distributed Computing (PODC), 2001.
[22]
Alexander Thomasian and Yujie Tang. Performance, Reliability, and Performability Aspects of Hierarchical RAID. In 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage (NAS), 2011.
[23]
Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. Erasure Coding in Windows Azure Storage. In Proceedings of the 2012 USENIX Annual Technical Conference (ATC), 2012.
[24]
MLEC Github repository. https://github.com/ucare-uchicago/mlec-sim.
[25]
MLEC Artifact on Chameleon Trovi. https://tinyurl.com/mlec-artifact.
[26]
Richard R. Muntz and John C. S. Lui. Performance analysis of disk arrays under failure. In Proceedings of the 16th International Conference on Very Large Data Bases (VLDB), 1990.
[27]
Mark Holland and Garth Gibson. Parity Declustering for Continuous Operation in Redundant Disk Arrays. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1992.
[28]
Guillermo A. Alvarez, Walter A. Burkhard, and Flaviu Cristian. Tolerating Multiple Failures in RAID Architectures with Optimal Storage and Uniform Declustering. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA), 1997.
[29]
Guillermo A. Alvarez, Walter A. Burkhard, Larry J. Stockmeyer, and Flaviu Cristian. Declustered disk array architectures with optimal and near-optimal parallelism. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA), 1998.
[30]
Thomas J.E. Schwarz S.J., Jesse Steinberg, and Walter A. Burkhard. Permutation development data layout (PDDL). In Proceedings of the 5th International Symposium on High Performance Computer Architecture (HPCA-5), 1999.
[31]
Huan Ke, Haryadi S Gunawi, David Bonnie, Nathan DeBardeleben, Michael Grosskopf, Terry Grové, Dominic Manno, Elisabeth Moore, and Brad Settlemyer. Extreme protection against data loss with single-overlap declustered parity. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 343--354. IEEE, 2020.
[32]
CORVAULT - Self-Healing, High Density Data Storage. https://www.seagate.com/products/storage/data-storage-systems/corvault/.
[33]
Jeff Bonwick and Bill Moore. Zfs: The last word in file systems, 2007.
[34]
Dell PowerEdge RAID Controller 12. https://infohub.delltechnologies.com/p/dell-poweredge-raid-controller-12/.
[35]
Paul Glasserman, Philip Heidelberger, Perwez Shahabuddin, and Tim Zajic. Splitting for rare event simulation: analysis of simple cases. In Proceedings of the 28th conference on Winter simulation, pages 302--308, 1996.
[36]
Victor F Nicola, Perwez Shahabuddin, and Marvin K Nakayama. Techniques for fast simulation of models of highly dependable systems. IEEE Transactions on Reliability, 50(3):246--264, 2001.
[37]
Daniel Ford, Franis Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlna. Availability in Globally Distributed Storage Systems. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI), 2010.
[38]
Kevin M. Greenan, James S. Plank, and Jay J. Wylie. Mean time to meaningless: MTTDL, Markov models, and storage system reliability. In the 2nd Workshop on Hot Topics in Storage and File Systems (HotStorage), 2010.
[39]
Hiroaki Akutsu and Tomohiro Kawaguchi. Reliability analysis of distributed raid with priority rebuilding. In Proc. USENIX Conf., 2013.
[40]
Kishor S Trivedi. Probability and statistics with reliability, queuing, and computer science applications. John Wiley & Sons, 2001.
[41]
ORNL's Alpine storage system. https://www.olcf.ornl.gov/olcf-resources/data-visualization-resources/alpine.
[42]
Personal Communication with LANL, ORNL, and Seagate Engineers and Operators.
[43]
Yuchong Hu, Liangfeng Cheng, Qiaori Yao, Patrick P. C. Lee, Weichun Wang, and Wei Chen. Exploiting Combined Locality for Wide-Stripe Erasure Coding in Distributed Storage. In Proceedings of the 19th USENIX Symposium on File and Storage Technologies (FAST), 2021.
[44]
Intel Intelligent Storage Acceleration Library (Intel ISA-L). https://software.intel.com/en-us/storage/ISA-L.
[45]
Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. XORing Elephants: Novel Erasure Codes for Big Data. In Proceedings of the 39th International Conference on Very Large Data Bases (VLDB), 2013.
[46]
Oleg Kolosov, Gala Yadgar, Matan Liram, Itzhak Tamo, and Alexander Barg. On Fault Tolerance, Locality, and Optimality in Locally Repairable Codes. In Proceedings of the 2018 USENIX Annual Technical Conference (ATC), 2018.
[47]
Itzhak Tamo and Alexander Barg. A family of optimal locally recoverable codes. IEEE Transactions on Information Theory, 60(8):4661--4676, 2014.
[48]
Saurabh Kadekodi, Shashwat Silas, David Clausen, and Arif Merchant. Practical Design Considerations for Wide Locally Recoverable Codes (LRCs). In Proceedings of the 21th USENIX Symposium on File and Storage Technologies (FAST), 2023.
[49]
Chameleon - A configurable experimental environment for large-scale cloud research. https://www.chameleoncloud.org.
[50]
Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. Lessons Learned from the Chameleon Testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (ATC), 2020.

Cited By

View all
  • (2025)A Survey of the Past, Present, and Future of Erasure Coding for Storage SystemsACM Transactions on Storage10.1145/370899421:1(1-39)Online publication date: 8-Jan-2025
  • (2024)FAIR Assessment of Cloud-based Experiments2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678673(1-6)Online publication date: 16-Sep-2024
  • (2024)From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC EnvironmentsSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00070(484-495)Online publication date: 17-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2023
1428 pages
ISBN:9798400701092
DOI:10.1145/3581784
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2023

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. data centers
  2. HPC storage
  3. scalable storage
  4. reliability
  5. data protection
  6. erasure coding
  7. system-design tradeoffs

Qualifiers

  • Research-article

Conference

SC '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)199
  • Downloads (Last 6 weeks)5
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A Survey of the Past, Present, and Future of Erasure Coding for Storage SystemsACM Transactions on Storage10.1145/370899421:1(1-39)Online publication date: 8-Jan-2025
  • (2024)FAIR Assessment of Cloud-based Experiments2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678673(1-6)Online publication date: 16-Sep-2024
  • (2024)From Failure to Insight: Analyzing Disk Breakdowns in Large-Scale HPC EnvironmentsSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00070(484-495)Online publication date: 17-Nov-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media