Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3603269.3604860acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Open access

EBB: Reliable and Evolvable Express Backbone Network in Meta

Published: 01 September 2023 Publication History

Abstract

We present the design, implementation, evaluation, deployment and production experiences of EBB (Express BackBone), a private WAN (Wide Area Network) connecting Meta's global data centers (DCs). Initiated in 2015, EBB now carries 100% of DC-DC traffic, witnessing remarkable growth over the years. A key design aspect of EBB is its multi-plane architecture, facilitating seamless deployment of a new control plane while ensuring operational simplicity. This architecture allows for efficient failure mitigation, standard maintenance, and capacity expansion by draining one or two planes without impacting service level objectives (SLOs). Another critical design decision is the hybrid model, combining distributed control agents and a central controller. EBB's centralized traffic engineering utilizes an MPLS-TE based solution to allocate paths periodically for different traffic classes based on service requirements, while its distributed control agents enable fast local failure recovery by pre-installing pre-computed backup paths in the data plane. We delve into our eight-year production experience, highlighting the successful deployment of multiple generations of EBB.

References

[1]
[n.d.]. COIN-OR Linear Program Solver. ://www.coin-or.org/Clp/.
[2]
[n.d.]. Configuring Next-Hop Groups to Use Multiple Interfaces to Forward Packets Used in Port Mirroring. https://www.juniper.net/documentation/us/en/software/junos/sampling-forwarding-monitoring/topics/concept/policy-configuring-next-hop-groups.html.
[3]
[n.d.]. RSVP-TE. https://en.wikipedia.org/wiki/RSVP-TE.
[4]
Satyajeet Singh Ahuja, Vinayak Dangui, Kirtesh Patil, Manikandan Somasundaram, Varun Gupta, Mario Sanchez, Guanqing Yan, Max Noormohammadpour, Alaleh Razmjoo, Grace Smith, Hao Zhong, Abhinav Triguna, Soshant Bali, Yuxiang Xiang, Yilun Chen, Prabhakaran Ganesan, Mikel Jimenez Fernandez, Petr Lapukhov, Guyue Liu, and Ying Zhang. 2022. Network Entitlement: Contract-Based Network Sharing with Agility and SLO Guarantees (SIGCOMM '22). Association for Computing Machinery, New York, NY, USA, 250--263.
[5]
Alexey Andreyev. 2014. Introducing data center fabric, the next-generation Face-book data center network. https://engineering.fb.com/2014/11/14/production-engineering/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/
[6]
Ryan Beckett, Aarti Gupta, Ratul Mahajan, and David Walker. 2017. A general approach to network configuration verification. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication. 155--168.
[7]
Jeremy Bogle, Nikhil Bhatia, Manya Ghobadi, Ishai Menache, Nikolaj Bjørner, Asaf Valadarsky, and Michael Schapira. 2019. TEAVAR: Striking the Right Utilization-Availability Balance in WAN Traffic Engineering. In Proceedings of the ACM Special Interest Group on Data Communication (Beijing, China) (SIGCOMM '19). Association for Computing Machinery, New York, NY, USA, 29--43.
[8]
Facebook. 2018. KvStore - Store and Sync. https://openr.readthedocs.io/Protocol_Guide/KvStore.html.
[9]
Klaus-Tycho Förster, Ratul Mahajan, and Roger Wattenhofer. 2016. Consistent updates in software defined networks: On dependencies, loop freedom, and blackholes. In 2016 IFIP Networking Conference, Networking 2016 and Workshops, Vienna, Austria, May 17--19, 2016. 1--9.
[10]
Bernard Fortz, Jennifer Rexford, and Mikkel Thorup. 2002. Traffic engineering with traditional IP routing protocols. IEEE communications Magazine 40, 10 (2002), 118--124.
[11]
Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. 2016. Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure. In SIGCOMM (Florianopolis, Brazil). 15 pages.
[12]
Saif Hasan, Petr Lapukhov, Anuj Madan, and Omar Baldonado. 2017. Open/R: Open routing for modern networks. https://engineering.fb.com/2017/11/15/connectivity/open-r-open-routing-for-modern-networks/.
[13]
Chi-Yao Hong, Subhasree Mandal, Mohammad Al-Fares, Min Zhu, Richard Alimi, Kondapa Naidu Bollineni, Chandan Bhagat, Sourabh Jain, Jay Kaimal, Shiyu Liang, Kirill Mendelev, Steve Padgett, Faro Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn, Jonathan Zolla, Joon Ong, and Amin Vahdat. 2018. B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-defined WAN. In ACM SIGCOMM (2018).
[14]
Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri, and Roger Wattenhofer. 2013. Achieving High Utilization with Software-Driven WAN. In Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM), Hong Kong.
[15]
Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jon Zolla, Urs Hölzle, Stephen Stuart, and Amin Vahdat. 2013. B4: Experience with a Globally-deployed Software Defined Wan. SIGCOMM (2013), 12 pages.
[16]
Santosh Janardhan. 2021. More details about the October 4 outage. https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/.
[17]
X. Jin, H. Harry Liu, R. Gandhi, S. Kandula, R. Mahajan, M. Zhang, J. Rexford, and R. Wattenhofer. 2014. Dynamic scheduling of network updates. In ACM SIGCOMM (2014), Fabián E. Bustamante, Y. Charlie Hu, Arvind Krishnamurthy, and Sylvia Ratnasamy (Eds.). 539--550.
[18]
David Karger and Serge Plotkin. 1995. Adding multiple cost constraints to combinatorial optimization problems, with applications to multicommodity flows. In Proceedings of the twenty-seventh annual ACM symposium on Theory of computing. 18--25.
[19]
Manolis Karpathiotakis, Dino Wernli, and Milos Stojanovics. 2017. Scribe: Transporting petabytes per hour via a distributed, buffered queueing system. https://engineering.fb.com/2019/10/07/data-infrastructure/scribe/.
[20]
Peyman Kazemian, George Varghese, and Nick McKeown. 2012. Header Space Analysis: Static Checking for Networks. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (San Jose, CA) (NSDI'12). USENIX Association, USA, 9.
[21]
Praveen Kumar, Chris Yu, Yang Yuan, Nate Foster, Robert Kleinberg, and Robert Soulé. 2018. YATES: Rapid Prototyping for Traffic Engineering Systems. In Proceedings of the Symposium on SDN Research (Los Angeles, CA, USA) (SOSR '18). ACM, New York, NY, USA, Article 11, 7 pages.
[22]
Praveen Kumar, Yang Yuan, Chris Yu, Nate Foster, Robert Kleinberg, Petr Lapukhov, Chiun Lin Lim, and Robert Soulé. 2018. Semi-oblivious traffic engineering: The road not taken. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18). 157--170.
[23]
Praveen Kumar, Yang Yuan, Chris Yu, Nate Foster, Robert Kleinberg, Petr Lapukhov, Ciun L. Lim, and Robert Soulé. 2018. Semi-Oblivious Traffic Engineering: The Road Not Taken. In USENIX NSDI (2018).
[24]
Nikolaos Laoutaris, Michael Sirivianos, Xiaoyuan Yang, and Pablo Rodriguez. 2011. Inter-datacenter bulk transfers with netstitcher. In Proceedings of the ACM SIGCOMM 2011 Conference. 74--85.
[25]
George Leopold. 2017. Building Express Backbone: Facebook's new long-haul network. http://code.facebook.com/posts/1782709872057497/.
[26]
Guangzhi Li, Dongmei Wang, C. Kalmanek, and R. Doverspike. 2002. Efficient distributed path selection for shared restoration connections. In Proceedings. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, Vol. 1. 140--149 vol.1.
[27]
Hongqiang Harry Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, and David Gelernter. 2014. Traffic Engineering with Forward Fault Correction. In Proceedings of the 2014 ACM Conference on SIGCOMM (Chicago, Illinois, USA) (SIGCOMM '14). Association for Computing Machinery, New York, NY, USA, 527--538.
[28]
Hongqiang Harry Liu, Xin Wu, Ming Zhang, Lihua Yuan, Roger Wattenhofer, and David Maltz. 2013. zUpdate: Updating data center networks with zero loss. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM. 411--422.
[29]
Ratul Mahajan and Roger Wattenhofer. 2013. On consistent updates in software defined networks. In Proceedings of the Twelfth ACM Workshop on Hot Topics in Networks. 1--7.
[30]
Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson, Jennifer Rexford, Scott Shenker, and Jonathan Turner. 2008. OpenFlow: enabling innovation in campus networks. ACM SIGCOMM computer communication review 38, 2 (2008), 69--74.
[31]
Abhinav Pathak, Ming Zhang, Y Charlie Hu, Ratul Mahajan, and Dave Maltz. 2011. Latency inflation with MPLS-based traffic engineering. In Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference. 463--472.
[32]
Serge A Plotkin, David B Shmoys, and Éva Tardos. 1995. Fast approximation algorithms for fractional packing and covering problems. Mathematics of Operations Research 20, 2 (1995), 257--301.
[33]
Santhosh Prabhu, Kuan Yen Chou, Ali Kheradmand, Brighten Godfrey, and Matthew Caesar. 2020. Plankton: Scalable network configuration verification through model checking. In 17th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 20). 953--967.
[34]
Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V. Madhyastha, Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng. 2017. Engineering Egress with Edge Fabric: Steering Oceans of Content to the World. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (Los Angeles, CA, USA) (SIGCOMM '17). ACM, New York, NY, USA, 418--431.
[35]
Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. 2015. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (London, United Kingdom) (SIGCOMM '15). Association for Computing Machinery, New York, NY, USA, 183--197.
[36]
Kandula Srikanth, Dina Katabi, Bruce Davie, and Anna Charny. 2005. Walking the tightrope: responsive yet stable traffic engineering. (2005).
[37]
Martin Suchara, Dahai Xu, Robert Doverspike, David Johnson, and Jennifer Rexford. 2011. Network Architecture for Joint Failure Recovery and Traffic Engineering. In ACM SIGMETRICS (2011).
[38]
Peng Sun, Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, and Ahsan Arefin. 2014. A Network-State Management Service. SIGCOMM Comput. Commun. Rev. 44, 4 (aug 2014), 563--574.
[39]
Yu-Wei Eric Sung, Xiaozheng Tie, Starsky HY Wong, and Hongyi Zeng. 2016. Robotron: Top-down network management at facebook scale. In Proceedings of the 2016 ACM SIGCOMM Conference. ACM, 426--439.
[40]
Kaushik Veeraraghavan, Justin Meza, Scott Michelson, Sankaralingam Panneerselvam, Alex Gyori, David Chou, Sonia Margulis, Daniel Obenshain, Shruti Padmanabha, Ashish Shah, et al. 2018. Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 373--389.
[41]
Hao Wang, Haiyong Xie, Lili Qiu, Yang Richard Yang, Yin Zhang, and Albert Greenberg. 2006. COPE: Traffic engineering in dynamic networks. In Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications. 99--110.
[42]
Dahai Xu, Mung Chiang, and Jennifer Rexford. 2011. Link-state routing with hop-by-hop forwarding can achieve optimal traffic engineering. IEEE/ACM Transactions on networking 19, 6 (2011), 1717--1730.
[43]
Jin Y Yen. 1970. An algorithm for finding shortest routes from all source nodes to a given destination in general networks. Quarterly of applied mathematics 27, 4 (1970), 526--530.
[44]
Yang Zhou, Ying Zhang, Minlan Yu, Guangyu Wang, Dexter Cao, Yu-Wei Eric Sung, and Starsky H. Y. Wong. 2022. Evolvable Network Telemetry at Facebook. In 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022, Renton, WA, USA, April 4--6, 2022, Amar Phanishayee and Vyas Sekar (Eds.). USENIX Association, 961--975.

Cited By

View all
  • (2025)Meta’s Hyperscale Infrastructure: Overview and InsightsCommunications of the ACM10.1145/370129668:2(52-63)Online publication date: 21-Jan-2025
  • (2025)Cooperative Graceful Degradation in Containerized CloudsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707244(214-232)Online publication date: 3-Feb-2025
  • (2024)A Decentralized SDN Architecture for the WANProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672257(938-953)Online publication date: 4-Aug-2024
  • Show More Cited By

Index Terms

  1. EBB: Reliable and Evolvable Express Backbone Network in Meta

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference
        September 2023
        1217 pages
        ISBN:9798400702365
        DOI:10.1145/3603269
        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 01 September 2023

        Check for updates

        Author Tags

        1. wide-area networks
        2. traffic engineering
        3. software-defined networking

        Qualifiers

        • Research-article

        Conference

        ACM SIGCOMM '23
        Sponsor:
        ACM SIGCOMM '23: ACM SIGCOMM 2023 Conference
        September 10, 2023
        NY, New York, USA

        Acceptance Rates

        Overall Acceptance Rate 462 of 3,389 submissions, 14%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)1,705
        • Downloads (Last 6 weeks)268
        Reflects downloads up to 08 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)Meta’s Hyperscale Infrastructure: Overview and InsightsCommunications of the ACM10.1145/370129668:2(52-63)Online publication date: 21-Jan-2025
        • (2025)Cooperative Graceful Degradation in Containerized CloudsProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707244(214-232)Online publication date: 3-Feb-2025
        • (2024)A Decentralized SDN Architecture for the WANProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672257(938-953)Online publication date: 4-Aug-2024
        • (2024)MegaTE: Extending WAN Traffic Engineering to Millions of Endpoints in Virtualized CloudProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672242(103-116)Online publication date: 4-Aug-2024
        • (2024)RDMA over Ethernet for Distributed Training at Meta ScaleProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672233(57-70)Online publication date: 4-Aug-2024
        • (2024)Balancing Sdn Control Plane Availability and Traffic Engineering Efficiency in Data Centers2024 IEEE 32nd International Conference on Network Protocols (ICNP)10.1109/ICNP61940.2024.10858573(1-12)Online publication date: 28-Oct-2024

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media