Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HEPiX Network Functions Virtualisation Working Group Report

2020
High Energy Physics (HEP) experiments rely on the networks as one of the critical parts of their infrastructure both within the participating laboratories and sites as well as globally to interconnect the sites, data centres and experiments instrumentation. Network virtualisation and programmable networks are two key enablers that facilitate agile, fast and more economical network infrastructures as well as service development, deployment and provisioning. Adoption of these technologies by the HEP sites and experiments will allow them to design more scalable and robust networks while decreasing the overall cost and improving the effectiveness of the resource usage. The primary challenge we face is ensuring that WLCG and its constituent collaborations will have the networking capabilities required to most effectively exploit LHC data for the lifetime of the LHC. Network virtualisation and programmable networks are nowadays quite common in the commercial clouds and telco deployments a......Read more
HEPiX Network Functions Virtualisation Working Group Report Editors: Shawn McKee, Marian Babik Authors: Attebury, G. (UNL), Babik, M. (CERN), Campanella, M. (GARR), Capone, V. (GEANT), Chen, Hsin-Yen (Academia Sinica), Chown, T. (Jisc), Dart, E. (ESnet), Guok, C. (ESnet), Hoeft, B. (KIT), Ibarra, J. (FIU), Lapacz, R. (PSNC), Martelli, E. (CERN), McKee, S. (Univ. of Michigan), Monga, I. (ESnet), Naegale-Jackson, S. (FAU), Newman, H. (Caltech), Seuster, R. (Univ. of Victoria), Sim, A. (ESnet), Suerink, T. (Nikhef), Zani, S. (INFN), Wu, Wenji (FNAL) Version 2.0 HEPiX Network Functions Virtualisation Working Group Report 0 HEPiX NFV WG 2 Executive Summary 3 Introduction 4 Cloud Native DC Networking 6 Motivation 6 Container Networking 9 Rethinking Network Design 12 Network Virtualisation 14 Network Virtualisation Solutions for the DC 16 Cumulus 16 OpenVSwitch 17 Tungsten Fabric 19 Tigera Calico 20 Cilium 21 Contiv/VPP 21 Flannel 22 Summary 22 Network Disaggregation 23 Programmable Network Interfaces 24 Linux Kernel Networking Models 25 DC Edge 26 Challenges and Outlook 28 Programmable Wide-Area Networks 30 Motivation 30 Programmable Networks for Data-Intensive Science 32 Software-defined WAN (SD-WAN) 32
V2.0 What is a Software-Defined WAN ? 32 What is the SD-WAN Standard? 33 Network Provisioning (multi-ONE) 33 Network Service Interface (NSI) 34 Software defined exchanges (SDX) 35 Orchestrators 36 SENSE 36 NOTED 37 Data Transfer Services 39 BigData Express 39 Research & Education Networks Programmable Services 44 ESnet6 44 ESnet6 Architecture for programmability 44 “High-touch” services 45 In-network caching/computing 47 FABRIC 48 GEANT Higher-level Services 50 GÉANT community survey on orchestration, automation and virtualization 50 GÉANT Testbeds Service 51 AmLight Express and Protect (AmLight-ExP) 52 Support for Network Virtualization and Programmability on AmLight-ExP 53 Challenges and Outlook 55 Proposed Areas of Future Work 56 Conclusions 59 Terminology 60 Acknowledgments 62 Appendix 63 StarLight SDX 63 AtlanticWave-SDX 67 Pacific Wave Expansion Supporting SDX & Experimentation 68 1
HEPiX Network Functions Virtualisation Working Group Report Editors: Shawn McKee, Marian Babik Authors: Attebury, G. (UNL), Babik, M. (CERN), Campanella, M. (GARR), Capone, V. (GEANT), Chen, Hsin-Yen (Academia Sinica), Chown, T. (Jisc), Dart, E. (ESnet), Guok, C. (ESnet), Hoeft, B. (KIT), Ibarra, J. (FIU), Lapacz, R. (PSNC), Martelli, E. (CERN), McKee, S. (Univ. of Michigan), Monga, I. (ESnet), Naegale-Jackson, S. (FAU), Newman, H. (Caltech), Seuster, R. (Univ. of Victoria), Sim, A. (ESnet), Suerink, T. (Nikhef), Zani, S. (INFN), Wu, Wenji (FNAL) Version 2.0 HEPiX Network Functions Virtualisation Working Group Report 0 HEPiX NFV WG 2 Executive Summary 3 Introduction 4 Cloud Native DC Networking 6 Motivation 6 Container Networking 9 Rethinking Network Design 12 Network Virtualisation 14 Network Virtualisation Solutions for the DC 16 Cumulus 16 OpenVSwitch 17 Tungsten Fabric 19 Tigera Calico 20 Cilium 21 Contiv/VPP 21 Flannel 22 Summary 22 Network Disaggregation 23 Programmable Network Interfaces 24 Linux Kernel Networking Models 25 DC Edge 26 Challenges and Outlook 28 Programmable Wide-Area Networks 30 Motivation 30 Programmable Networks for Data-Intensive Science 32 Software-defined WAN (SD-WAN) 32 V2.0 What is a Software-Defined WAN ? 32 What is the SD-WAN Standard? 33 Network Provisioning (multi-ONE) 33 Network Service Interface (NSI) 34 Software defined exchanges (SDX) 35 Orchestrators 36 SENSE 36 NOTED 37 Data Transfer Services 39 BigData Express 39 Research & Education Networks Programmable Services ESnet6 44 44 ESnet6 Architecture for programmability 44 “High-touch” services 45 In-network caching/computing 47 FABRIC 48 GEANT Higher-level Services 50 GÉANT community survey on orchestration, automation and virtualization 50 GÉANT Testbeds Service 51 AmLight Express and Protect (AmLight-ExP) Support for Network Virtualization and Programmability on AmLight-ExP Challenges and Outlook 52 53 55 Proposed Areas of Future Work 56 Conclusions 59 Terminology 60 Acknowledgments 62 Appendix 63 StarLight SDX 63 AtlanticWave-SDX 67 Pacific Wave Expansion Supporting SDX & Experimentation 68 1 V2.0 HEPiX NFV WG Ahmed Khouder Andrea Manzi, CERN Andreas Petzold Balde Thierno Ben Mack-Crane Bruno Hoeft Dale W. Carder, ESnet Darcy Hodgson Dennis van Dok Duncan Rand, Imperial College London and Jisc Edoardo Martelli, CERN Enzo Capone Eric Yen Eygene Ryabinkin Eyle Brinkhuis, SURFnet Fabio Farina Fazhi Qi Felix Lee Fernando Lopez, PIC Garhan Attebury, University of Nebraska Gerben van Malenstein Giancarlo Viola Gloria Vuagnin Harvey Newman, Caltech Hsin-Yen Chen Inder Monga Jan Erik Sundermann Jin Kim Joe Metzger Josep Flix Julio Ibarra, Florida International University Kars Ohrenberg Lars Fischer Laurent Caillat-Vallet Marco Marletta, GARR Marian Babik, CERN Mario Lassnig Massimo Carboni, GARR Matt Zekauskas Mauro Campanella, GARR Mian Usman Michael Steder Paolo Bolletta, GARR Shawn McKee, University of Michigan Stefano Zani, INFN Rolf Seuster Sylvain Garrigues Tao Zhang Thomas Hartmann Tim Bell, CERN Tim Chown, Jisc Tomoaki Nakamura Tony Cass, CERN Tristan Suerink Warrick Mitchell Wenji Wu, Fermilab Xavier Jeannin Xin Zhao Yury Gugel 2 V2.0 Executive Summary High Energy Physics (HEP) experiments rely on the networks as one of the critical parts of their infrastructure both within the participating laboratories and sites as well as globally to interconnect the sites, data centres and experiments instrumentation. Network virtualisation and programmable networks are two key enablers that facilitate agile, fast and more economical network infrastructures as well as service development, deployment and provisioning. Adoption of these technologies by the HEP sites and experiments will allow them to design more scalable and robust networks while decreasing the overall cost and improving the effectiveness of the resource usage. The primary challenge we face is ensuring that WLCG and its constituent collaborations will have the networking capabilities required to most effectively exploit LHC data for the lifetime of the LHC. Network virtualisation and programmable networks are nowadays quite common in the commercial clouds and telco deployments and have also been deployed by some of the Research and Education (R&E) network providers to manage Wide-Area Networks (WAN), but there are only few HEP sites pursuing new models and technologies to build up their networks and data centers and most of the existing efforts are currently focused on improvements within a single domain or organisation, usually motivated by the organisation-specific factors. Therefore, most of the existing work known in the area is usually site or domain-specific. In addition, there is a significant gap in our understanding of how these new technologies should be adopted, deployed and operated and how the inter-play between LAN and WAN will be organised in the future. While it’s still unclear which technologies will become mainstream, it’s already clear that software (software-defined) and programmable networks will play a major role in the mid-term. With the aim to better understand the technologies and their use cases for HEP a ​Network Functions Virtualisation Working Group (NFV WG) has been formed within the ​High Energy Physics Information Exchange (HEPiX)​. In the initial phase, the working group was focusing on identifying the work already done, looking at the existing projects and their results as well as better understanding the various approaches and technologies and how they can be helpful for the HEP use cases. This report provides a summary of this phase and in addition attempts to identify key areas that require further work that could be discussed within the community to find synergies and understand where existing efforts should be focused. This document presents a technical overview of the existing approaches in network virtualisation and programmable networks. It starts by explaining how current paradigm shifts in the computing and clouds are impacting networking and how this will fundamentally change the ways networks are designed in the data centers and sites. Cloud native networking approaches involving new topologies, network disaggregation and virtualisation have been identified as primary drivers that will impact data centre networking, which will in turn impact how data centres will be inter-connected in the future. In the second chapter which is devoted to the WAN, capacity sharing, network provisioning and software-defined approaches are discussed and key projects in the area are highlighted. This is complemented by a brief survey on the mid-term plans of the major R&E providers. The report concludes with proposed projects areas and milestones that should serve as input in discussing the potential future work in this area. While there are many technology choices that need discussion and exploration, ​the most important thing is ensuring the experiments and sites collaborate with the RENs, network engineers and researchers to develop, prototype and implement a useful, agile network infrastructure that is well integrated with the computing and storage frameworks being evolved by the experiments as well as the technology choices being implemented at the sites and RENs. 3 V2.0 Introduction High Energy Physics (HEP) experiments have greatly benefited from a strong relationship with Research and Education (REN) network providers and, thanks to projects such as LHCOPN/LHCONE and REN contributions, have enjoyed significant capacities on globally distributed high performance networks for some time. RENs have been able to continually expand their capacities to over-provision their networks relative to the experiments needs and were thus able to cope with the recent rapid growth of the traffic between sites, both in terms of achievable peak transfer rates as well as in total amount of data transferred. For some HEP experiments this has led to designs that enable remote direct data access where the network is considered an appliance with almost infinite capacity. Networks are recognized as a critical enabling component for HEP and there is an implicit expectation that HEP will continue to enjoy over-provisioned use of the world’s RENs for the foreseeable future. However, there are reasons to believe that the network situation will change, due to both technological and non-technological reasons, starting in the next few years. Various factors that are in play include the anticipated growth of the non-HEP network usage with other large data volume sciences coming online; introduction of the cloud native and commercial networking and their respective impact on usage policies and securities, as well as technological limitations of the optical interfaces and switching equipment. As the scale and complexity of the current HEP network grows rapidly, new technologies and platforms are being introduced that greatly extend the capabilities of today’s networks. ​With many of these technologies becoming available, it’s important to understand how we can design, test and develop systems that could enter existing production workflows while at the same time changing something as fundamental as the network that all sites and experiments rely upon​. This report is surveying specific approaches that could be used to build scalable, high-throughput and reliable network infrastructures within and between data centres. It’s aimed at ​site managers​, network engineers and ​HEP experiment infrastructure architects looking to understand the core concepts of network virtualisation, software-defined networking and how data centres can be inter-connected in the mid-term future. In particular, the report focuses on specific network capabilities related to the cloud-native data centres, those focusing on providing significant resources in a virtualized manner and existing projects and technologies that can enable programmable wide-area networks to interconnect the data centres. In addition, potential use cases for both categories are listed and different trade-offs for the specific approaches are discussed. This report is broken into three main chapters: ● Cloud Native DC Network​ - focuses mainly on the data centre network design, topologies and existing approaches in the network virtualisation. It also discusses network disaggregation and ways how this is integrated in the Linux kernel networking stack. It 4 V2.0 ● ● concludes with a section on data centre edge, which is an area that falls in between the two chapters and provides a short overview of existing edge services and technologies. Programmable Wide-Area Networks​ - this chapter is focusing on the existing use cases and novel technologies for the data centre interconnect (DCIs), SD/WAN (SENSE, etc.) as well as applications that are integrating network and transfer technologies (BigDataExpress, NOTED). It also includes plans of the R&E providers wrt. programmable services focusing on the network provisioning and operations (tracing/monitoring) of VPNs (such as LHCONE). Outlook and Areas of Future Work​ - This is perhaps the most important chapter because we are trying to identify and initiate near-term collaborations between the experiments, the sites and the research and education networks to deliver capabilities needed by HEP for their future infrastructure while enabling the sites and NRENs to most effectively support HEP with the resources they have 5 V2.0 Cloud Native DC Networking Motivation One of the main drivers for network evolution in the data centres is the changing nature of the applications; if applications wouldn’t change, there would be no need to change the underlying networks. With the dawn of ​virtualisation​, applications have started to morph at an accelerating pace and moved from mostly static deployments (on bare metal servers) through virtual machines to containers and more recently to clusters of containers sometimes referred to as microservices (or server-less) [​MS​]. This evolution shown in the Figure below, also highlights one of the main aspects that has changed for networking, the usual life-cycle of the application (develop, deploy, update, re-deploy) has decreased from hours to microseconds. Establishing a full scale cluster of hundreds of containers can be done in less than a second, and upgrades or re-deployments of such a cluster can be done on a rolling basis, which means that the entire cluster can be replaced with new containers, all over again in seconds. This increasingly dynamic environment can easily cause a lot of trouble for the networking infrastructure, which is trying to keep up with this rapid pace. The rise of Linux and the economics of scale has led to the development and operations of clouds. HEP sites have been predominantly statically deployed with allocations usually served by batch systems, operating at job level granularities with high capacity storage hosting the datasets locally. This is slowly changing as experiments are starting to deploy their job payloads in containers and services are moving to container-based or even Kubernetes (K8s) deployments. This creates an interesting shuffle where multiple technologies are starting to overlap and compete. Recently, OSG has announced that they plan to optionally deploy all of its services on container platforms [​GDB​] and some of the NSF-funded projects such as SLATE, that investigate new infrastructures for sites, are entirely based on Kubernetes (K8s). Physical analysis running on K8s with a full dataset uploaded to the cloud has been demonstrated running on the Google Cloud Platform [​Kubecon​]. It can be expected that this evolution will continue and accelerate, potentially even picking up the pace and causing major headaches in the networking departments at HEP labs and sites. 6 V2.0 Network engineers are facing major challenges while trying to accommodate the new computing models in an environment where they often need to support not only cutting edge container technologies that are now popular, but at the same time legacy systems ( “bare metal” ), virtual machines and other equipment that needs to be connected to the network or even multiple networks (e.g. experimental equipment, technical networks) with custom designs and protocols. As the situation varies greatly between the sites, this document will focus on answering the most common questions related to cloud native network design, showing possible approaches to consider and discuss their trade-offs. Fast paced application life-cycle is not the only challenge, virtualisation is progressing into areas that were previously tied to the hardware, such as GPUs or storage systems. With GPU and storage virtualisation, there is a need for improved latency and throughput within the data centre (so called east-west) in order to enable more efficient allocation and use of resources. Network vendors have already recognised the cloud opportunity and have started to reprofile their revenue expectations from enterprises towards cloud providers [​Arista​, A ​ 2​, ​Cisco​]. This will likely have an impact on what the vendors will start pushing and how the network infrastructure and supporting software will evolve in the mid-term. As cloud has the potential to become a dominant way to run compute infrastructure, this section aims to explain what impact this has on the network design and what approaches are slowly proliferating into the mainstream. This section is aimed at site managers who are responsible for not just network, but also compute and storage and are looking for a way to understand the current trends in the networking as well as for network engineers looking for information that is usually scattered around in many different sources. As cloud native DC networking is quite a topic, the main focus will be on the introduction of the core concepts, brief explanation of basic trade-offs and challenges, with references pointing to more in-depth coverage. The section is organised as follows: ● Container networking - introduces basic concepts in container networking and explains its complexity and main challenges in the implementation of the DC network design ● New topologies - introduces topologies and design principles used by the existing cloud and telco providers to build up large-scale DC hosting container and VM technologies ● Network Virtualisation - provides in-depth overview of the existing approaches in network virtualisation and discusses the main trade-offs 7 V2.0 ● ● Network Disaggregation - highlight an important trend in the network technologies moving to open source hardware and network operating systems based on Linux Programmable Network Interfaces - gives an overview of the existing approaches in programming network cards/interfaces, which is the main building block for many of the existing network virtualisation and design implementations 8 V2.0 Container Networking This section briefly introduces the basic concepts of container networking1. The reason for the focus on container networking, though current data centres are still mostly organised around bare metal and virtual machines, is that the same core concepts apply, but with containers, the scale and performance is what makes the difference. Another reason is that the intention of this section is not to introduce containers from the application perspective, but rather look at it from the perspective of networking, which is where application and network engineers in the DC meet to make critical design decisions (​and it’s very important that they do meet as otherwise things are likely not going to work out​). Both containers and virtual machines can be framed around the same concepts in Linux, which is the predominant operating system of choice, so this introduction will explain the basic concepts in Linux networking and how it connects to VM/containers world, e.g. how does container interface get an IP, how it communicates with the external world and the other containers on the same host, multi-hosts, etc. Containers are built around two core Linux kernel constructs, ​namespaces​ and ​cgroups​ (control groups). Simply put, ​cgroups​ ensure proper sharing of resources such as cpu, memory and network for a collection of processes while ​namespaces​ provide means of isolation. ​Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources while another set of processes sees a different set of resources [​NS​], while at the same time resources can be shared across multiple ​namespaces​. One such resource is network, which virtualises network stacks, and allows any process to create a number of network interfaces, physical or virtual, which are only seen by the process or its sub-processes, and have their own routing table, firewall, listening sockets, connection tracking, etc. Since the release of the 4.1 linux kernel, there are five additional resources, apart from the network that can be isolated: UTS​ (allow a single system to appear to have different hostnames and domains), ​mount (controlling mount points), ​process ID​ (this makes it possible to have multiple processes with PID=1, each in its own namespace), ​Interprocess communication​ (IPC) and finally ​user IDs​. As ​network namespace​ encompasses the entire network stack all the way up to and including the transport layer, it’s perfectly fine to have multiple http daemons running on a single server, all listening on the same port 80. To the network engineers this probably sounds similar to the concept of VRF (virtual routing and forwarding), which can be seen as a subset of the namespace concept and some vendors have actually implemented VRF in the Linux kernel using the network namespaces. Containers can have four operations modes: no network, host network, single host network and multi host network. No network option is obvious, host network is usually implemented by running a privileged container that can access directly the default network stack. Single and multi-host networks are more complex, see Fig. below/next page. This section partly summarizes demo that was provided in one of the first HEPiX NFG WG meetings (link https://indico.cern.ch/event/715631/​) 1 9 V2.0 In the single host network there are two approaches, bridge or macvlan. In the first one a standard Linux bridge is created and connects any number of new containers via virtual ethernet interfaces (veth). Veth pairs are kind of tunnels that can connect interfaces between namespaces or directly to a physical interface (in this case to the bridge). This setting creates a small virtual (L2) switch that allows the different containers within a single host to communicate with each other and connect to the external world via NAT. The other option is macvlan, which is a layer 2 virtual network interface associated with a physical interface. The kernel assigns each macvlan interface a unique MAC address and then uses it to change the MAC address of the outgoing packet and vice versa to match the correct container. This way the container is exposed to the network as an additional host, however this prevents a direct ping between containers on the same host (packet needs to go to the router and back). While macvlan clearly outperforms the bridge solution, it has also some downsides, e.g. two containers on the same host don’t have direct connectivity (packet needs to go to the switch and back), assignment of IP addresses needs to be coordinated and finally there is a limitation in the number of MAC addresses for a single NIC. In the multi-host networking there are also two options, overlay network or direct routing: Overlay network creates layer 2 network between containers, the most common protocol used is VXLAN where each VTEP (VXLAN tunnel endpoint) is connected to a server/hypervisor IP address and tunnelling is established between the bridges created inside each server, see Fig. below. Coordination of IP address assignment and tunneling management is then usually performed by an orchestrator, e.g. Swarm in case of Docker, W ​ eave​, ​Flannel​, etc.. When overlay networking is used, Docker will set up two interfaces, one for communication between containers (bridge as described in the single-host networking) and the other one for communication with the outside world (docker_gwbridge, which has its own subnet). The routing table is then setup to use the appropriate interface for each form of communication. 10 V2.0 Direct routing on the other hand is a layer 3 solution, which uses a single host container network driver, typically bridge, to build multiple containers that are connected via routing (there is no NAT), see Fig. above (part b). A routing daemon runs on the individual servers and announces the container’s IP address or the bridge subnet. Examples of such models are for example Tigera’s Calico, Cilium, Contiv/FD.io and others (all described in more detail in the next section). The Kubernetes orchestrator can use kube-router that can be deployed to provision direct routing. Finally, this solution can also run with an independent daemon (like ​FRR​), which would then advertise the bridge subnet route whenever it comes up (i.e. without writing additional code to integrate with the orchestrator). In general, there are trade-offs between the approaches described; overall, direct approaches are likely to outperform the ones based on overlays. This is mainly due to the processing needed to operate the overlay tunnels but also due to the complex setup that is usually needed on the end host (chain of veth pairs, bridge and NIC). On the other hand, direct approaches increase the complexity of DHCP assignment due to the presence of multiple IPAM masters (server vs container) and require L3 networks, which in turn means deployment and operation of separate daemons that integrate with the rest of the networking (usually via BGP). Kubernetes (K8s), which is the most popular container orchestrator, organises containers into pods (set of containers spun up together, usually sharing the same namespace and cgroup) and provides pluggable networking models via CNI (​Container Network Interface​). For a functional cluster one has to deploy a pod networking solution (via CNI) which is expected to use no NAT for communication between the pods, all nodes should communicate with all containers without NAT (and vice versa) and the IP that a container sees itself is the same IP that others see it as [​CNI​]. Any non-trivial containerised application will run multiple pods for different services as well as a layer-4 service proxy that provides load balancing and service discovery. In addition, there are network security policies that need to be implemented in order to secure the pods. All this makes it a non-trivial endeavour considering this is just one of the applications in the DC networking. Virtual machines networking is quite similar to what was described, the same concepts apply, setup is usually more static as VM turn-around is not as frequent. The usual option is to connect the VMs via veth pairs and configure some form of network virtualisation as described in the next section. 11 V2.0 Overall, with so many possible choices and such a wide range of networking functionality available already on the servers, there is no doubt that this is becoming a major challenge. Deploying individual solutions for each functionality can easily end up with lots of moving parts making it extremely difficult to operate and troubleshoot. Coming up with a single solution that would work well for the entire range is non trivial and requires both application and network engineers to come together, which (among other things) makes container networking hard. The next section discusses some of the existing industry approaches and their trade-offs in more detail. Rethinking Network Design As was shown in the previous section, the current generation of applications are complex sets of services that run on a simple compute infrastructure with multiple levels of virtualisation that rely on a simple networking model that scales easily and can support a significant amount of east-west traffic (traffic between nodes in the DC). For these reasons, most (if not all) of the cloud native data centers currently rely on the ​Clos network topology​. The most basic example of such topology is shown in Fig below, it shows a network organised in two layers of switches, one called the spine and the other called leaf (top-of-the-rack switch), this is a so called 1-tier setup, 2-tier setup can be created by adding additional spine level called super-spine. Sample Tier-1 Clos topology, compute is attached to leafs/ToRs, each leaf is connected to multiple spines. Sample Tier-2 Clos topology, compute is attached to leafs/ToRs, each leaf is connected to multiple spines (altogether forming a pod) and each spine within a pod connects to super-spine. This topology produces a high-capacity network as there are more than two paths between any two servers and adding more spines increases the available bandwidth between the leaves. It can easily grow in a consistent way as one can just add more leaves and servers and use spines to 12 V2.0 scale the bandwidth between the edges. Clos also pushes all of the functionality to the edges of the network, the leaves and/or the servers, while spines are solely used to connect the leaves. This means that the control plane load on the spines doesn’t increase (or increases only marginally) as more leaves are added to the network. Some of the important aspects of this topology are as follows: ● Clos can be used to build very large networks with simple fixed form factor switches allowing homogenous equipment - that greatly simplifies inventory management and configuration ● The interconnect model is usually based only on routing, bridging is supported only at the leaves (i.e. within a single rack). The rest of the inter-connectivity relies on some form of network virtualisation (as described in the next section) ● The usual oversubscription ratio used in Clos is 1:1 resulting in a non-blocking network (technically it’s non-blocking only if the flows can be rearranged from the congested links but this is easy if routing is the only interconnect model)2 ● Interconnect link speeds - using higher speeds for the inter-switch links can decrease the number of spine switches reducing the cost of cabling and number of switches to manage. The other possibility is to use more switches at lower speeds, which increases the potential scale of the network and can provide better load balancing ● Failure domains are more fine-grained, e.g., the loss of a link on a spine switch results in a loss of bandwidth that is fractional to the number of spines. In addition, loss of the link only impacts the leaf that lost the link, the rest of the network continues at full bandwidth. While there can be many different options and configurations of the topology for different requirements, the common goal is to make sure that the equipment runs reliably and cost efficiently. In order to ensure this, it’s important to also mention some common principles that in a way overturn the usual principles the DC have been built in the past. Using standard simple building blocks with Clos, it’s possible to build out the network with just one or two types of boxes, which is very different from using many different enterprise specific appliances (or chassis switches) that support a wide range of functionality, many custom configurations, modules, that can all be managed from a management console. However, in order to achieve the simplicity some form of network automation needs to be adopted as it’s no longer possible to rely on the vendor-specific configuration tools. The other aspect is the need to rethink the failure model. Instead of focusing on the reliability of the network with reliable nodes, focus on the reliability of the network itself (and not nodes). This is primarily achieved by relying on routing (L3), which can gracefully degrade, if the router doesn’t hear from a neighbor, it will simply try to route around it, which is inherently more stable than L2 failures. With the small factor switches, equipment is relatively cheap and it’s easy to replace one box with another (allowing a faulty switch to be debugged offline), so failures, when they occur, can 2 Theoretically, with 64-port switch for both leaf and spine and 1:1 subscription the total number of servers that can be connected is 2038 (n​2​/2 with n-port switch); total number of switches for full 2-tier topology is n+n/2 (with 64 port switches; 96 switches are needed for full build up). In practice, there are also other considerations, such as cooling and power available in the rack that needs to be taken into account, which usually limits the number of ports that can be used on the leafs. 13 V2.0 be relatively quickly fixed. Upgrades are also simpler as it’s possible to simply drain traffic gracefully from nodes that are scheduled for maintenance (again using layer 3 routing). Finally, it’s very important to ​only ​rely on the features that are essentially needed. This, while obvious, can be quite challenging in practice as traditional vendors will come up with many different features or package features together for a better offer. Sometimes this could be because a simpler model is simply not supported (e.g. unnumbered interfaces with OSPF/BGP). In practice the situation usually gets a lot more complex, since most of the existing HEP sites won’t be able to start their network design from scratch. This increases the challenge, as progressive transitions to the new architecture must be found and there will likely be significant resistance to even start it. As was shown in the previous section, postponing it will likely only escalate the networking issues in the DC. Other constraints come from the applications that need multicast or broadcast for discovery, which then makes it hard to eliminate layer 2 altogether. In summary, a set of basic network topology concepts and principles have been reviewed in this section. There are other references that discuss the design of the cloud-native networking in more depth and can be used as a potential starting point from here [​CNDCN​, ​FB​, ​MPLSSDNEra​]. The effective use of this topology relies on the ability to deploy a working network virtualisation solution, which is what’s surveyed next. Network Virtualisation Network virtualisation is a way to organise a single physical network into multiple isolated virtual networks (this is comparable to the server virtualisation concept). In packet networks, each virtual network will assume it owns the interface or link, forwarding tables, policy and packet manipulation tables, packet buffers and link queues. At a higher level, each virtual network assumes it owns the entire address space and all its interfaces. And similar to the compute virtualisation where VM assumes it owns all CPU resources assigned to it, but this in reality can have an oversubscription ratio, one can also have more virtual networks than there are physical interfaces. This is achieved by adding a virtual network identified (VNID) in the packet header, which is then used to select the appropriate forwarding and flow tables. While there are many network virtualisation solutions for the data centre, most of them rely on a small subset of the existing technologies and concepts that could be mixed together in different ways to create the final solution. Some of those are: ● Virtual Local Area Network​ (VLAN) - provides layer 2 abstraction, it works by applying tags to network frames and handling of these tags in networking systems – creating the appearance and functionality of ​network traffic​ that is physically on a single network but acts as if it is split between separate networks. In this way, VLANs can keep network applications separate despite being connected to the same physical network. Number of VLANs on a given network is restricted to 4094, which for usual cloud native DC requires finding ways to reuse VLANs or combine them with other protocols/technologies. In traditional bridging networks it’s used together with Spanning Tree Protocol (STP), which prevents the loops and constructs a single tree across a bridged network to forward all packets. 14 V2.0 ● ● ● ● Virtual Routing and Forwarding​ (VRF) - is a layer 3 abstraction, which provides a separate routing table for each virtual private network (VPN), usually this is done by adding some sort of VRFID to the routing table lookup. Virtual eXtensible Local Area Network ​(VXLAN) - a typical example of an overlay model, which is used to build a traffic isolation at layer 2 across layer 3 infrastructure. Usually it’s coupled with VRF to provide a full bridging and routing overlay solution. There are several other overlay protocols, some notable examples are Network Virtualisation over GRE (NVGRE), pursued mainly by Microsoft, it provides network virtualisation over Generic Routing Encapsulation (​GRE​) protocol [​RFC7673​], STT (Stateless Transport Tunneling) and Generic Network Virtualisation Encapsulation (GENEVE), pursued by VMware, Intel and RedHat. While each offers different benefits, their adoption varies and this makes VXLAN usually the most popular choice. Multiprotocol Label Switching (MPLS) - grandfather of all virtual networks; operates between layer 2 and layer 3, so it’s often referred to as layer 2.5 protocol. It assigns labels to packets and uses the labels to decide on subsequent routing and forwarding. It has been around since the nineteen nineties and one of its core use cases was to establish and manage layer-3 VPNs (L3VPN). Another important aspect in implementing network virtualisation is the ​control plane​, which needs to provide the means by which the virtual networks are provisioned. At a minimum, the control plane needs to provide ways to exchange the mapping between inner and outer addresses for the tunnels to be established as well as a list of virtual networks supported at each endpoint. There are primarily three different directions: ● Ethernet Virtual Private Network (EVPN) is an example of a popular control plane for network virtualisation as it can closely integrate with the network equipment. It uses Border Gateway Protocol (BGP) as its control protocol to exchange reachability for virtual networks and is usually coupled with VXLAN for packet encapsulation, which can have a native routing implemented directly in the network equipment (this is called RIOT, routing in and out of tunnels). ● Another option is to use a centralised controller, this follows the Software Defined Network (SDN) model (more on this later) and can use different control protocols (such as OpenFlow, P4) to directly program the virtual networks on the compute endpoints. In many such cases the network equipment is unaware it is running a network virtualisation solution. This direction is mainly pursued by solutions that prefer to run a deployment without controlling the underlay, some notable examples are VMware NSX, Nuage Networks, Midokura, etc. ● There are also hybrid approaches where a centralised controller is used, but integrates with network equipment by using control protocols such as BGP (by running BGP reflectors or routing daemons that can integrate directly with physical equipment). Network tunnels have become a popular choice for virtual networking in the cloud native data center due to their scalability, flexibility and simple management. Scalability is the primary benefit as usually only the leafs need to be aware of the network virtualisation and sometimes, for very big cloud providers, the tunneling layer can be pushed all the way to the compute nodes. There are also some notable downsides to tunneling, one such is offloading, which is usually performed directly by the NIC. Adding encapsulation diminishes the usual offloading mechanisms unless 15 V2.0 there is a support for VXLAN offloading on the card or encapsulation/decapsulation is left for the upper layers of the network (EVPN). Another potential issue is with the maximum transmission unit (MTU), which needs to be configured correctly on the endpoints to account for the additional headers needed for tunneling or it could cause serious issues in debugging connectivity. Finally, debugging and tracing tools need to be aware of the tunnels, otherwise the information they show is misleading or not relevant. Network Virtualisation Solutions for the DC There are a number of existing solutions that offer usually a mixture of all of the above mentioned approaches and are currently being tracked by the C ​ loud Native Computing Foundation​ as shown in Fig below. Both open-source (white) and commercial (grey) approaches exist, but it should be mentioned that some of the commercial approaches actually rely on an open-source software (like Cumulus) while others could use their own protocols, controllers or even customized standards (e.g. commercial providers). Our focus will be on ​open source projects​, which are mature enough to be deployed in production. They can be essentially divided into three large areas, or schools of thought: ● Network Operating Systems running on hardware/bare metal switches​: Those are usually based on the ideas of network disaggregation and can implement a full range of different networking solutions (typical example of such an approach is Cumulus), their integration with compute orchestrator is however not always straight-forward. One particular approach is to use eBGP as the only routing protocol in the DC and EVPN/VXLAN for the network virtualisation. ● Software switch approaches​ are based on the idea of running a generic software switch directly on the compute node, which is usually configured from a centralised controller (e.g. OVS, Tungsten). Such approaches can usually support some form of integration with the network equipment or can run without any integration as just another application on the network. ● Linux kernel network stack extensions​, is a set of projects that usually implement or extend some specific area of the Linux kernel networking to provide network and security services for containers (Calico, Contiv, Cilium would fall into this category). Such approaches are usually tightly integrated with the compute orchestrator and usually use some form of BGP-peering to integrate with the rest of the DC networking. Cumulus Cumulus approach is based largely on few important advances: 16 V2.0 ● ● ● ● Rise of the open source network operating systems based on open source software suites, such as Q ​ uagga​ and F ​ RR​ (backed by Cumulus Networks, BigSwitch Networks3 and 6Wind). Free Range Routing (FREE) is an open source network routing suite (developed to work with Linux kernel using standard kernel interfaces4) that implements most of the popular routing protocols (OSPF, BGP, IS-IS, etc.). Ability to run the open source network operating systems on open commodity switches (white boxes) that were in turn made possible by the significant rise in good quality merchant silicon vendors and subsequent rise of original design manufacturers (ODMs). With the adoption of Clos topology and layer-3 only design, exterior Border Gateway Protocol (eBGP) is used as the only routing protocol in the DC [​rfc7938​]. Rise of the EVPN/VXLAN adoption for the network virtualisation. A sample configuration following this approach is shown in Fig below and shows sample assignments of ASNs (ASN-per-rack with spines sharing the same ASN) as well as IP addresses assignments for the underlay. The configuration can be simplified by using a number of different BGP extensions, such as unnumbered interfaces, peer groups, routing policies, route maps, etc. more details in [​CNDCN​]. The integration with the orchestrator is performed by configuring the leafs to peer (via BGP dynamic neighbors) with the BGP daemon running on the compute nodes (such as kuber-router or bird). Network virtualisation is then built up with EVPN, which uses BGP as its control protocol and VXLAN for packet encapsulation. There are two options from where the edge of the VXLAN can start, either directly at the compute nodes, which can be done with FRR to start EVPN on the host, or alternatively on the leafs (which can be considered to be the most common deployment model of EVPN in the DCs). In this case it’s possible that eBGP session is used for both building the routed underlay and the exchange of the virtual network information. OpenVSwitch OpenVSwitch is a prominent open source implementation of the ​OpenFlow​ switch that was developed by RedHat and VMware. ​OpenFlow​ is a protocol that was proposed in an i​ nfluential paper​ that framed the basics of the Software Defined Networking (SDN) model and was based on 3 BigSwitch Networks was acquired by Arista Networks in February 2020. This is specific to Cumulus Linux, other network operating systems/ASIC integrations might require specific code to program the forwarding place. It should also be noted that while Cumulus builds on top of open source, it does contain proprietary extensions (such as switchd for example). 4 17 V2.0 the idea that packet processing can be determined by the flow tables, available on most packet switching silicon, and can be programmed remotely, usually from a centralised controller (so called SDN controller). Following this model, OVS implements ovs-switchd, which is a Linux user space application that implements an OpenFlow switch (running directly on the compute nodes) that can be programmed remotely from a centralised controller called Open Virtual Network (OVN). Additional OVS components are the kernel module that handles the datapath packet processing and OVSDB which persists the state of the ovs-switchd. Datapath module has pluggable offloading mechanism for VXLAN, GRE offloading (if NIC supports it) or DPDK. OVSDB has IETF specification [​RFC7047​] and was implemented as a component in some of the physical switches, thus making it possible to integrate easily with OVN or other controllers (VMware NSX is using this approach). Multiple encapsulation methods are supported including GRE, VXLAN and GENEVE to create virtual networks. OVN as the central controller is used to integrate with the orchestrators (K8s and OpenStack). Apart from OVN, there are a number of other controllers that can be used to program OpenFlow switches (both physical and software-based), but it’s unclear to what extent those are actually used outside of testlabs (in production setups). In general, flow tables have proven to be very useful for certain functions (policy and security control) and are used in many other approaches. SDN/OpenFlow approaches have been very successful in the SD/WAN area and are quite successfully used in the existing production systems. There are also successful production deployments of SDN/OpenFlow in DCs, most notable is Google’s GCP (it should be mentioned that this was achieved using proprietary protocols, tools and equipment). There are however also a number of obstacles in OpenFlow adoption and its production deployments: ● OpenFlow as such suffers from some fundamental problems; silicon implementation of flow tables is difficult to scale (inexpensively); abstraction models proposed are too loose (or too strict) allowing for many different implementations that don’t work together; the proposed 18 V2.0 ● ● model is trying to encompass a wide range of functionality, which addresses not only what doesn’t work well in the classic networks but also what worked well for many years. This probably explains why OpenFlow hasn’t seen a broader adoption in the DCs. OVS re-implements a significant part of the code already available in the Linux kernel (ARP, IP, etc.) as it doesn’t use the lookup tables or packet forwarding logic (using OpenFlow instead). This will be difficult to maintain in the long run and will require commitment going forward. Integration with physical equipment is complex due to different versions of OVSDB and different levels of support from the network vendors. Tungsten Fabric Tungsten Fabric​ is an open source project (branched from Juniper’s Contrail) that manages and implements virtual networking in cloud environments using OpenStack, OpenShift, Kubernetes and VMWare as orchestrators, including cross-stack networking. As a basis it uses EVPN/L3VPN in combination with BGP [​RFC4684​] on top of VXLAN or MPLS running over MPLSoUDP/GRE (can also mix them up). It can both integrate with underlay via BGP peering, or directly manage/configure the equipment using netconf (right now only available for Juniper equipment). Tungsten core components are: ​Tungsten Fabric Controller​ - set of services that maintain a model of networks and network policies and ​Tungsten Fabric vrouter5 - software switch running on each compute node and performing forwarding and enforcing the network and security policies. It’s implemented with two components (similar to OVS): Linux user space agent called ​vrouter agent and kernel datapath module (​vrouter forwarder​). Virtual networks have VRFs with routes and ACLs in the ​vrouter agent​ tables which are synchronised from the controller (via XMPP messaging protocol). Datapath module (vrouter forwarder) had pluggable support for different off-loading modes (kernel, DPDK, SR-IOV, and recently smartNICs). 5 For a good comparison between OVS and vrouter see O ​ VS talk by Y. Yang 19 V2.0 The agent establishes virtual network interfaces that are connected to the VRF forwarding tables as well as forwarding information base (FIB) in each VRF, which is configured with the forwarding entries. Flow tables are used (per VRF) to implement access control list and other security policies. In addition, VXLAN and MPLS tables are kept to map VXLAN VNIs to VRFs and MPLS labels to virtual network interfaces. Vrouter acts as a default gateway for all the virtual interfaces (responding to all ARP queries outside of its subnet with its MAC). A sample packet flow between VMs in the same subnet is shown in Fig below. Using the forwarding mechanism described above, Tungsten can easily implement service chains (forcing traffic through specific interfaces/nodes). 20 V2.0 Tigera Calico Calico​ is a network solution primarily for Kubernetes but also supports Docker, Mesos, OpenStack and Rkt. It’s an IP routed fabric that can work both on top layer-2 or layer-3 only networks and uses BGP peering to integrate. Calico is made up of the four basic components: an orchestrator plugin, a data store (etcd), felix - an agent that runs on each compute node, BIRD - BGP client that distributes routing information (also runs on each compute node) and optionally BGP Route Reflector (also BIRD), which can be used for higher scales to aggregate the BGP clients running on the nodes (avoids that each BGP client connect to all the others - full BGP mesh). Calico is using the native Linux kernel FIB to insert the routes [​Calico​] and then uses BIRD to distribute those routes to the other nodes in the infrastructure. It is also using the Linux kernel iptables to enforce security policies and restrict visibility across multiple tenants. By default it’s not using any overlay, but can also work over VXLAN. Cilium Cilium​ is somewhat similar to Calico, but focuses mainly on securing the connectivity between the containers, it’s using the Berkeley Packet Filter (BPF, best known from tcpdump). Linux network stack supports a set of hooks that can be used to execute BPF programs and Cilium is using those to directly manipulate the packets and enforce security policies. Two networking modes are support, overlay with VXLAN or Geneve and native routing, which uses the routing table of the Linux host. Cilium supports integration with Kubernetes, Istio, Docker, Mesos and Envoy, so it’s primarily a container networking option. Contiv/VPP Contiv​ is a networking solution that directly integrates with Cisco Application Centric Infrastructure, it supports Kubernetes and OpenShift only. Contiv is using VLANs networks with subnet pools and 21 V2.0 gateways, directly configured on the switches to interconnect the containers (with integration at the container level networking - docker network/linux bridge). It can also use eBGP peering for layer-3 mode, in which communication between containers on different hosts runs natively using VLANs and communication between container and non-containers is done by BGP peering between the compute nodes and the leafs. Another project that is related to Contiv is C ​ ontiVPP.io​, which is a Kubernetes CNI plugin that uses a combination of Contiv, Ligato libraries and FD.io/VPP. Unlike Contiv, it uses FD.io ​VPP​ for dataplane, which is an open source implementation of Cisco's Vector Packet Processing (VPP) technology. The main benefit that is claimed is its ability to provide high performance packet-processing that can run directly on the compute nodes. As this is a container-only solution, the software switch is running as a separate container in the Kubernetes (Contiv vSwitch), which runs as an agent in the user space and integrates with Intel’s DPDK. Flannel Flannel is a simple way to configure a layer 3 network fabric for Kubernetes. It provides a layer 3 IPv4-only network between multiple nodes using several different backends. The recommended choice is the VXLAN (only one vxlan network is created), host-gw (for direct routing, remote gw must be reachable via layer-2), UDP (for debugging purposes) and also AWS, GCE and AliCloud VPC backends (experimental). Summary Looking at a possible list of requirements to be taken into account when picking up a solution to build up the DC network and trying to accommodate a cloud native networking, the following list of requirements show areas that can be helpful: ● Native support for multi-stack ○ Connecting and integrating multiple orchestration stacks like K8s, OpenStack, etc. ○ Only very few of the current solutions are multi-stack, but most do support both OpenStack and Kubernetes ● Networking and security across legacy, virtualized and containerized applications ○ Ability to provide networking solution covering not only multiple orchestrators but also multiple types of compute endpoints (bare metal, VMs, containers) ● Multi-tenancy/isolation ○ Support for application/experiment level networking. This is supported by all solutions described. ● Native support for multi-cloud ○ In case extending DC networks to Commercial Clouds is a requirement, or creating federations between DCs, it’s necessary to look at how the current solution can support multi-cloud (or hybrid-cloud). ● Network automation ○ Ability to deploy most of the existing solutions relies on some form of automation, both at the compute level and at the network level. This is usually complex if both network and compute is involved as usually compute and network don’t share the same automation tools. ● Security and observability 22 V2.0 ○ ● Tools to support tracing and debugging are essential and native support from the solutions should expect the core ability to trace and debug or some form of analytical platform that can help in debugging the issues. Across stack policy control, visibility and analytics ○ If applications spanning different stacks (bare metal, VM, containers) are to be deployed then it’s important to have a solution that can run policies across different stacks as necessary. Network Disaggregation Network disaggregation refers to the breaking apart of the router/bridge into its hardware and software components, allowing each to be purchased separately. This trend has been started and accelerated by the cloud-native networking requirements due the following factors: ● Adoption of the Clos topology requiring standard small-form factor equipment in large quantities, which requires a small set of basic features and functions ● Rise of the merchant packet switching silicon (e.g. Broadcom, Mellanox, Barefoot, Innovium, Marwell, etc.), which is now at the core of most of the network switches (including from traditional vendors) as well as the rise of the manufacturing of the bare metal switches - so called white boxes (Edgecore, Quanta, Agema, Celestica) ● Cost reduction and technical difficulties/limitations in building up large cloud providers networks with traditional approaches and vendors [​AWS​] Network disaggregation is important as it significantly impacts the evolution of the network in the same way server disaggregation impacted compute in the past century. It’s also important for controlling the cost of the infrastructure, not only capital costs but also operational costs by enabling standard open source tools to be easily used to operate and automate the network equipment. This trend also helps avoid vendor lockin and enables a transition to a more standards-based environment (Open Compute Project, Open Network Installer Environment, etc.). Building up a large-scale network infrastructures following the Clos topology requires buying a significant amount of equipment, making operators very sensitive to the cost/device ratio and for ways to operate a large number of devices efficiently. This wasn’t easy with the classical approaches and vendors and thus opened up the door for open network platforms and operating systems. The basic requirements include the ability to program the devices (to allow for network automation; run own components, such as routing frameworks where new approaches can be tested quickly and/or fix bugs) and the ability to run third-party applications (for monitoring, tracing, etc.). While there are on-going discussions on the feasibility of the open source network operating systems and ODMs, there has been significant progress made in the past few years and we’re well past the early adopters stage [​SONiC​]. Another important area of network disaggregation are the network operating systems (NOSes). Currently, the two most common elements in the modern network operating systems are Linux (on Intel x86/ARM processors) and some form of packet switching silicon, supporting the core task of 23 V2.0 processing the packets. The major exception to this is Juniper’s Junos, which is still using FreeBSD,while most of the other vendors have already migrated to Linux. The current NOSes usually differ in the following two areas: ● Switch Network State location - a common model is to run vendor-specific code in the user space (all DPDK-based NOSes, Cisco NXOS); the hybrid model uses partially user space and kernel space (common in software switches, OVS/vrouter, but also in Arista EOS, MS SONIC, Dell’s Openswitch, etc.) and the third one tries to implement all network state directly in the kernel (mainly Cumulus) ● Driver placement and programming - most common placement for the drivers of the switching silicon are in the user space, but kernel space implementations also exist. Then depending on the placement, different interfaces can be used to communicate with kernel and abstract away the silicon driver details: ​netlink​ (Linux kernel default), ​Switch Abstraction Interface (SAI)​ - hardware abstraction layer driven by Microsoft and Dell and switchdev​ - hardware abstraction layer driven by Cumulus and Mellanox, (others ?) The list of existing network operating systems is now quite large and can be found in ​OCP​ pages. There are also additional open source network operating systems, such as for example S ​ tratum​ by ONF​, which was recently released as an open source. Programmable Network Interfaces Programmable network interfaces (line cards/switches) are currently an area of intense interest for network interface vendors, data center architects and end-users trying to optimize network performance. Ability to program the network interfaces has a number of different use cases: ● Reducing the amount of CPU workload that network virtualisation imposes on the servers and network equipment. The aim is to use dedicated hardware to completely offload packet processing to the network interface card(s) or network switches. Depending on how much functionality is being offloaded this could range from standard network cards (Mellanox cards offers significant amount of offloading features in its drivers - tc/vxlan/mpls/gre offloading ) up to so called intelligent NICs (so called smartNICs, e.g. Netronome Agilio Adapters can offload the entire OVS or Tungsten’s vrouter switches). ● Security isolation by means of encryption or intrusion detection, which can be performed directly on the smartNIC or on the bare metal switch ● Virtualizing Storage and other cloud resources, such as GPUs or TPUs, through protocols such as NVMe-oF (NVMe over Fabrics), which can help bring storage or graphics to any server over the network while at the same time make it look like a local resource. Currently three hardware approaches are being followed for NICs: ● FPGA based - good performance, but difficult to program, workload specific optimisation ● ASIC based - best price/performance, easy to program but extensibility usually limited to pre-defined capabilities. ● SOC based - usually higher price, but easily programmable, providing highest flexibility 24 V2.0 Recently, new ASIC based NICs with full data path programmability have appeared. The combination of a network programming language and new programmable ASICs is proposed to overcome the limitations of​ fixed-function switch ASICs. ​The ​P4​ language​ has been proposed in 2015 and it is now maintained by ONF.​ P4 compilers support various chips FPGA, B ​ arefoot ​ PL​ language for its own products. Tofino​, etc. Broadcom has recently started to support the N A programmable data plane transforms the network ASICs, allowing the programming of new forwarding behaviours in the packet processor itself. Full programming control on processor memory and functions permits a substantial degree of freedom on packet header processing, including information insertion, change and removal enabling in-band Telemetry for detailed monitoring and real time dataplane functionalities. Network programmability (​tutorial​) in both NICs and switches has been possible in the past, but significant advances were made recently due to network disaggregation effort, invention of new programming languages and compilers (such as P4) as well as efficient hardware implementations. Different layers of programmability are possible and most of them can be found in the existing offers: ● Application level - OpenVSwitch, Tungsten vRouter in which the entire control and data plane can be offloaded to the smartNIC. ● Packet movement infrastructure (part of data path) - BPF (Berkeley Packet Filter)/eBPF ● Full description of data path - ​P4​ language, also available for network switches such ​ CM SIGARCH article​, as those from B ​ arefoot​, etc (good overview provided in A Netronome​, ​Mellanox​) Overall, smartNICs provide a powerful blend of hardware capability and software programmability appropriate for implementing new dynamic compute, storage and network architectures. The challenge will be how to incorporate them into future HEP infrastructures to achieve the best impact. Linux Kernel Networking Models Network programmability is not limited to hardware, in the past couple of years, there has been a lot of active development in the Linux kernel related to the programmable networking (and corresponding interfaces). As Linux kernel networking stack is rather complex, requiring significant processing resources, various different “fast path” networking approaches were invented. They can be divided into three different areas (with some overlaps): ● Linux kernel bypass models​, which try to improve networking performance by going around the Linux networking stack, this in turn needs modified device drivers (or additional kernel modules for higher level APIs) as well as handling of the upper protocol layers in the user space, some notable approaches in this area are: ○ Intel Data Plane Development Kit (DPDK) - is an open source toolkit (community established by 6WIND) to accelerate packet processing workloads (on a variety of CPU architectures, x86, Power, ARM); it works by running user space forwarding plane that interacts with a PMD driver (poll mode driver/kernel module) that supports many different NICs. It has many additional features, such as custom accelerators, QoS and extensions for monitoring and metering. It’s used as an optional backend 25 V2.0 ● ● in most of the existing projects (OVS, Tungsten, Contiv). The downside of DPDK is that it needs a dedicated CPU/core fraction to be allocated for the packet processing, thus reducing the overall resources for compute. ○ VPP (FD.io) - open source version of Cisco's vector packet processing (VPP), runs in user space (used in Contiv/VPP). ○ Dataplane processing with a dedicated kernel modules is also supported by software switches (OVS, Tungsten) and is often the default deployment model (offering easy deployment with reasonable performance) ○ There are other projects that could fall into this area, e.g OpenDataPlane, OpenFastPath, netmap, Snabb, pf_ring, etc. but their use in production environments is unclear at the moment Linux kernel fast path models​, which try to process as much data early on the data path as possible, in Linux driver code or on NIC itself: ○ eBPF - Linux socket filtering (best known from tcpdump), i.e. extended Berkeley Packet Filter (BPF), provides a very flexible and efficient way how to program packet processing directly in the Linux kernel (it’s a VM-like construct for Linux kernel that allows safe bytecode execution at various hook points like for example traffic control/tc). It’s used primarily in Cilium, which claims to have significant performance advantages over Calico (by avoiding extensive use of iptables and using eBPF hooks instead). ○ XDP - eXpress Data Path - uses eBPF programs and performs processing RX packet-pages directly before the driver. It can run as native or offloaded (BPF in NIC, or via DPDK). Linux kernel hardware offloading models​, usually implemented in the drivers (already described in the previous section). DC Edge The data center edge is very important for high energy physics and deserves some discussion. HEP data centers typically require significant Wide Area Networking (WAN) bandwidth between centers. High bandwidth capacity is an area in which modern data centre networks differ from the traditional ones, in a typical 2-tier Clos topology with 4 spines/16 leaves and 100GbE connectivity between the leaf and spine results in 400GbE east-west connectivity. Connecting to the external world in the typical HEP site wouldn’t require that high capacity, but north-south throughput is still a very important factor as it’ll determine how to design the networking to the external world. In Clos topology there are two possible choices, one is to create another set of leaf switches (called border/exit leafs in HA setup, each connected to all spines/or superspines) and instead of connecting them to the servers, connecting them to the edge routers6. The other option is to connect the spines to the routers directly, which is usually a rare case in the cloud-native data centres, but it could make sense if expected north-south traffic is higher (or equal) than east-west (it can quickly add up cost as router ports are expensive in the traditional brands, so it would only make sense if there are only few spines or if using a commodity hardware). Number of border leaves would be determined by the oversubscription ratio btw north-south and east-west traffic in the DC 6 26 V2.0 Border leaves would also be a good place where one can instrument additional equipment, such as firewalls, perfSONARs, Cloud gateways as well as strip off any routing data that shouldn’t be exposed to the outside world (such as internal AS numbers, etc.). For the Cloud (or SDN) gateways, which essentially serve as a way to extend the DC network to the cloud provider (or to another site), the most common model is called virtual private cloud (​VPC​). VPC connectivity is usually purely layer-3 (no broadcast/multicast) and uses a virtual private router (or virtual private gateway) that routes between subnets and controls access to the external world (via NAT or by providing public IPs that are routed to the endpoints via the gateway). Connecting to a VPC involves either establishing VPN or a direct routed connection with the cloud provider (recommended). Most of the major cloud providers (Amazon, MS, Google) currently provide mostly equivalent forms of VPC [​ESnet-AWS​, ​LHCONECloud​]. In a similar way, it’s also possible to extend some of the existing network virtualisation solutions to the other centres or to the public clouds (via Data Centre Interconnect - DCI). Tungsten Fabric provides different ways how to extend the DC network by translating the MPLS labels from internal LAN to external WAN and back (expecting MPLS WAN backbone) or by running vrouter switches remotely connecting to a centralised Tungsten Controller [​TungstenPrimer​]. One of the goals of this working group is to try to document, provision and test compatible ways of matching LAN technologies (driven by the data center) with evolving SD-WAN technologies such that the resulting end-to-end network best meets our HEP needs. The extensive use of data centers by large content providers created the need for simple, cost-effective, very high capacity connectivity between their sites. The traditional ITU-T Optical Transport Network (OTN) defines equipment does not provide the form size and the easy scalability capabilities to fulfil these requirements at an affordable cost. A more compact and simpler equipment is available now, named Data Centre Interconnect (DCI), which simplifies the optical element to its functionalities, while preserving DWDM capabilities. DCI equipment builds on the high level of integration in optical components as well as technologies that offer large-scale fabrication at low cost. Such equipment allows to provide optical connectivity in the data centre and between the data centers (as a function of distance) at a small fraction of OTN equipment, being the lasers or pluggable optics the larger part of the cost. DCI equipment is also more suited to dynamic provisioning and agile utilization than previous optical equipment. DCI optics is also frequently coupled in the same chip with advanced, in-line forward error correction algorithms, performed by DSPs, improving substantially the WAN performance and 27 V2.0 range at very high transfer rates. Recent advances in optical technologies for data centers were surveyed by C ​ heng, et al​. Additional information and vendor materials are available from I​ nfinera and ​CIENA​. Finally, another potential use case for the DC edge is the integration of experiment’s readout systems and/or other instrumentations into the existing High-Level Trigger (HLT) topology in a similar way as is currently pursued by the telco industry, which is integrating its instrumentation and edge services directly into its core topologies. Fig. below shows a possible topology organisation motivated by Clos on how the readout systems would be integrated with the HLTs. It remains to be seen if this would be feasible in practice, but monitoring the existing approaches in the integration of the edge services and hyper-converged infrastructures can provide valuable input on how to run large-scale experiment readout infrastructures. Challenges and Outlook Cloud native networking and related technologies are providing a way how to design and cost effectively operate large scale DC networks. Still a number of challenges remain, both technological and non-technological that could impact a broader adoption by the HEP community: ● Most of the existing HEP sites won’t be able to re-design their DC networking from scratch, which will pose additional challenges in order to find ways to progressively migrate and accommodate the existing constraints. ● Historically, there has been a very clear separation between network and compute, but this no longer applies for the cloud native approaches where complex networking is present at the level of servers/hypervisors. This means that network and compute engineers must work together and build up expertise in the cross-domain areas. Collaboration between the sites will also be very important to bridge the gap and come up with more effective approaches that better fit the existing HEP use cases. Encouraging closer collaboration between network and compute engineers within and across sites will be therefore an important factor in the adoption. ● Automation of the networking is another important area as relying on the open source network operating systems usually requires migrating to standard open source configuration tools. In addition, the usual approaches that work for configuration of compute might not work for network infrastructure due to various reasons [​CNDCN​]. 28 V2.0 ● ● ● ● Additional work on performance studies and performance tuning of the previously mentioned systems will be needed. This is mainly to evaluate their performance for the HEP specific network patterns (fat flows, frequent remote access, etc.). In addition, comprehensive performance studies comparing different approaches are currently missing (performance tests over LAN/WAN via SR-IOV, OVS, VRouter, DPDK, with/without flows, with/without smartNICs, etc. ) [​10Gbps benchmarks​]. DCIs (both HW and software-based) are quite novel approaches that will require initial deployments and testbeds to evaluate how they could benefit inter-DC networking (federated/data lakes) and how they could be integrated into the existing R&E infrastructures (commercial DCI software approaches usually rely on MPLS backends, it’s unclear if there are alternatives and what developments are needed to make this work over the existing R&E networks). DC network extensions towards Cloud via VPC have been demonstrated and different trade-offs have been identified [​ESNet-AWS​]. Further tests and experiences are needed to better understand if some of the new technologies could help in this quickly evolving area. A range of other approaches that were only mentioned briefly such as GPU, virtualised storage (using NVMe-of, e.g. ​CephNVMEof​, M ​ ellanoxNVME​), hyper-converged architectures and edge services for HEP instrumentation and experiments will require initial testbeds and evaluations. Overall, this chapter summarises core solutions and technologies that could help our community to rethink the way we design and operate our data centre networks and provide a great opportunity to build large scale data centers (centralised or distributed) that could benefit from economies of scale, simplification of the operational models and potential reduction in the overall operational cost, but apart from technology this will require new policies, priorities and funding to materialise. Some of the most promising areas of R&D that could lead to a broader adoption of the mentioned technologies are container-based compute platforms such as SLATE edge services, HTCondor K8s backend or native K8s sites that could offer the best opportunity to test and evaluate cloud native networking. Another potential area are batch system deployments using VMs where an isolated region of OpenStack/VMware can be provisioned and tested using the native integration with Tungsten, OVS or other approaches. Provisioning of the storage servers with software switches and/or virtualized storage solutions is another area that has the potential to be easily deployed and tested. Finally, it’s also very important that cloud native approaches are considered for any planned extensions of existing centres or new data centers right from the start. 29 V2.0 Programmable Wide-Area Networks Motivation Paradigm shift in the computing and its impact on the network technologies as described in the previous chapter will make it easier to design and develop bigger data centres that will be inter-connected at very high capacities at lower latencies (by means of DCIs) and could easily off-load to nearby Clouds, HPC centres or other opportunistic resources. A cluster of such centres can then create a federated site that will be exposed behind a single endpoint/interface for the experiments. This transformation has already started within the WLCG data lakes activities (​WLCG DOMA WG​) and some of the participating sites are already running their storages/compute in a federated setup [ref DOMA pilots]. At the same time, small to medium sites will be able to complement the functionality of the core sites by off-loading some of the Compute activities that can be orchestrated and provisioned directly from a core site by means of Kubernetes or other federated orchestrators (such approaches are being followed by O ​ SG​, ​NSF-funded SLATE and other “lightweight” site projects). This can have a profound impact as design and development of the offline HEP distributed computing model can be radically simplified. Recently, SLATE edge services have been released, which are based on Helm and K8s and H ​ T-Condor team announced that it plans to add the ability to use Kubernetes as one of its backends. This could become a key-enabler for HEP sites that could support different workloads by only running a single container orchestration or edge system (this would be revolutionary and impact many different aspects of running a HEP site apart from networking such as security, operations and management policies). From a network perspective operating a set of clustered DCs will bring its own challenges and will require close collaboration with the R&E providers. Provisioning of the networks, network telemetry, packet tracing and inspection as well as overall security and network automation will need to improve in order to make it easier for the federations to operate not only their inter-DC activities, but also easily expose their services to the outside. Historically the National Research and Education Networks (NRENs) have managed to meet HEP networking needs by strategically purchasing capacity when network use exceeded trigger thresholds (over-provisioning). This has been a straightforward method to provide seemingly unlimited capacity for HEP, requiring no new technologies, policies or capabilities. There were occasional “bumps” when regional or local capacities didn’t keep up, but overall over-provisioning resulted in excellent networking for HEP. While over-provisioning has been very good at meeting overall HEP capacity needs, it hasn’t addressed localized over-subscription (in time or space) that leads to packet-loss. This is important, since any packet-loss on our long-fat network pipes significantly degrades the throughput TCP can achieve. For example, losing 1 packet out of 22,000 on a Trans-Atlantic 10 Gbps path can reduce the throughput by a factor of around 85, turning a 1.2 GByte/sec transfer into 15 MByte/sec transfer. Unfortunately there are many ways to lose packets on the network 30 V2.0 including locations where the network bandwidth changes, where heavily used network paths converge to a smaller number of paths (congestion), where device buffers are insufficient to handle micro-bursts or where there are misconfigurations, failing hardware or lossy network cabling. Some of the issues are solved by using good network design principles or by purchasing upgraded hardware but in practice these kinds of issues will persist in our infrastructure, simply because sites and NRENs continuously evolve their infrastructure with no global ( or even regional ) coherence. While the R&E end-to-end network paths are not under any individual organization’s control, there is something that can be done to improve the performance of flows while simultaneously creating a less volatile network environment for other users of the network:​ packet pacing​. Some background: by default, network interfaces try to inject their data buffers into the network at the speed the interface operates at. A 100 Gbps interface with a buffer of data to transmit, chunks the data into packets corresponding to the Maximum Transmission Unit (MTU) and puts them on the network at a rate corresponding to 100 Gbps until the outgoing buffer is emptied or until TCP limits the sending of packets. This is OK as long as all the network segments and the receiver are running at 100 Gbps. However, if any part of the path has a speed less than 100 Gbps, packets begin to queue-up until packet loss occurs and TCP responds by halving the sending rate. The resulting ramp-up and back-off of TCP leads to an average transmission rate significantly below the theoretical bottleneck link capacity. In addition this kind of bursty flow can have an adverse impact on other users of the network as packet trains at the higher rate (micro-bursts) fill various buffers and cause packet-loss to other flows. Wouldn’t it be much better to pace the original packets to match the bottleneck rate? For example, if a 100Gbps server was sending to a 50Gbps server, a logical starting point would be to pace the packets leaving the 100Gbps network interface at 50Gbps. Assuming the bottleneck for the transfer was the 50 Gbps interface on the destination, this pacing would result in a much smoother flow able to hold the 50 Gbps rate. Of course the bottleneck could be elsewhere along the path or even in the destination storage which might not be able to keep up with data coming in at 50 Gbps, but still the impact on the shared network would be less and the achievable rate would be at least as good as the unpaced 100 Gbps case. This chapter surveys the existing approaches in the Wide-Area Programmable Networks that could address some of the previously mentioned challenges. The focus is mainly on the following set of use cases: ● Traffic engineering - additional capacity is available in the existing networks and can be provisioned on-demand by steering traffic via alternate paths. ● Network provisioning - with DC networking moving towards WAN protocols and network virtualisation, there is an opportunity to leverage this to find alternative ways to organise/manage current HEP networks (L3VPNs/LHCONE) ● Provide QoS transfers - our current networking is usually segmented into private networks providing bandwidth guarantees and public/R&E networks. With other experiments coming online it can be expected that they come up with similar requirements and it would therefore be beneficial to look for synergies and technical approaches that can eliminate additional networks and operations. ● Improve network to storage performance - currently there is often a mismatch between target storage and network performance. In addition, inefficiencies arise when existing data transfer tools run on DTNs (NUMA I/O, scheduling overheads, caches, etc.). 31 V2.0 ● ● ● Capacity sharing between multiple domains is currently based on a first-come first-serve basis. Network as a resource is becoming likely in the future (like compute/storage today), but it depends on the network requirements (and real usage) of the new experiments coming online. Effective use of HPCs and Clouds across WAN as well as other opportunistic resources will be an important area. One of the major obstacles in fully utilising current HPCs are existing network limitations (mostly wrt. capabilities not capacity). (SEE euroHPC initiative/PRACE in EU) Autonomous networking (cyber-attacks/security relevant) and network telemetry WAN programmable networks address many of the use cases outlined above and have the potential to change the way HEP sites and experiments connect and interact with the network, this chapter aims to highlight key projects in this area as well as provide an overview of the plans of the major R&E providers in the domain of orchestration, automation and virtualisation of WAN. As WAN programmable networks involve several major technologies the main focus of this chapter is to introduce the core concepts and projects as well as highlight key challenges in their adoption. This chapter is organised as follows: ● Programmable Networks for Data-Intensive Sciences introduces the basic concept of software-defined WAN (SD/WAN), software-defined exchanges (SDX), network orchestrators, provisioning systems and network-aware data transfer systems. ● Research & Education Networks Programmable Services surveys the current plans of the major R&E in the area of orchestration, automation and virtualisation of WAN. Programmable Networks for Data-Intensive Science Software-defined WAN (SD-WAN) What is a Software-Defined WAN ? The SD-WAN paradigm appears to come from the telecom industry. SD-WAN is defined as a specific application of software-defined networking (SDN) technology applied to WAN connections, such as broadband internet, 4G, LTE, or MPLS. The aim of an SD-WAN is to connect enterprise networks - including branch offices and data centers - over large geographic distances [​SDWAN​]. For HEP, enterprise networks can translate to a campus network or a research laboratory (facility) network. Branch office refers to a campus where mainly people (researchers/faculty, students, administrators) are located, as well as serve as a site for compute, storage and network resources. Data center is synonymous in use to HEP, as a physical site, in a cloud environment, or hybrid. A key characteristic of the definition is connecting the networks at all of these different sites over large geographic distances. The term Wide-Area Network (WAN) is broadly used to describe a connection between an auxiliary campus site and a central campus site, where the WAN connection, for example, is a service provided by a metro-ethernet service provider. Likewise, a WAN connection could be between a campus participating in ATLAS or CMS, with a WAN connection to an LHC provider, such as 32 V2.0 FermiLab. The difference between WAN and SD-WAN is that SD-WAN decouples the network from the management plane, and detaches the traffic management and monitoring from hardware. What is the SD-WAN Standard? The SD-WAN standard describes requirements for an application-aware, over-the-top WAN connectivity service that uses policies to determine how application flows are directed over multiple underlay networks irrespective of the underlay technologies or service providers who deliver them [​SDWAN-standard​]. A goal of the SD-WAN standard is to distinguish the components of the architecture that describe the SD-WAN Service and the Underlay Connectivity Services over which the SD-WAN operates [​SDWANas​]. An SD-WAN Service Provider provides the SD-WAN Service to the Subscriber. The SD-WAN Service Provider may provide one or more of the Underlay Connectivity Services, but some or all of them may be provided by the SD-WAN Subscriber or other Service Providers. The following is a description of SD-WAN from the MEF SD-WAN standard [​SDWANas​]: An SD-WAN Service provides a virtual overlay network that enables application-aware, policy-driven, and orchestrated connectivity between SD-WAN User Network Interfaces, and provides the logical construct of a L3 Virtual Private Routed Network for the Subscriber that conveys IP packets between Subscriber sites. An SD-WAN Service operates over one or more Underlay Connectivity Services (IP service, carrier ethernet, MPLS, etc). Since the SD-WAN service can use multiple disparate Underlay Connectivity Services, it can offer more differentiated service delivery capabilities than connectivity services based on a single transport facility. An SD-WAN service is aware of, and forwards traffic based on, Application Flows. The Service agreement includes specification of Application Flows—IP Packets that match a set of criteria—and Policies that describe rules and constraints on the forwarding of the Application Flows. A motivation for SD-WAN is its ability to adjust aspects of the service in near real time to meet business needs. This is done by the Subscriber by specifying desired behaviors at the level of familiar business concepts, such as applications and locations and by the Service Provider by monitoring the performance of the service and modifying how packets in each Application Flow are forwarded based on the assignment of policy and the real-time telemetry from the underlying network components. Network Provisioning (multi-ONE) This project aims to find technical solutions that can allow a computing site to separate the data traffic generated by the resources allocated for different research collaborations and then use dedicated VPNs similar to LHCONE. LHCONE is a Virtual Private Network (VPN) that was conceived to connect WLCG Tier1 and Tier2, a relatively small number of trusted data-centres that share common security practices. Its main advantage is that it can be connected directly into the data-centres bypassing slow security border 33 V2.0 devices. Today LHCONE has reached a critical number of connected sites, at a point where trust may start to decrease and thus undermining the most valuable characteristic it has. Here is a list of a few possible solutions that could allow the traffic separation. The list is not exhaustive and doesn’t exclude other ideas. Individual sites can use different solutions, as long as the objective of routing traffic into the correct VPN is achieved. - - - Dynamic, software defined allocation of computing resources: using network virtualization techniques (VXLAN, EVPN, Linux namespaces, etc.), create virtual groups of computing resources that interface with the network to separate the traffic at the sources. Paired use of source-destination address: making use of secondary IP addresses, computing resources should make sure to use appropriate source-destination address pairs, to allow simple destination routing. Packet tagging: using available fields in the IP header (DSCP, flow label..) set by the applications, make the network policy routes those packets in the correct VPN. In case of solutions based on mapping by IP addresses, it would be necessary for every site to allocate dedicated IP prefixes to the different collaborations. Since IPv4 is a scarce resource, it is envisaged that the proposed solutions have to work with IPv6. DUNE and protoDUNE have been proposed as a collaboration that can allow CERN, FNAL and other interested sites to prototype and test these solutions. Network Service Interface (NSI) NSI​ is a web-service based API that operates between a requester software agent and a provider software agent. The full suite of NSI services allows an application or network provider to request and manage circuit service instances. Apart from the Connection Service these include the Topology Service and the Discovery Service. The complete set of NSI services is described in the Network Services Framework v2.0. This Connection Service document describes the protocol, state machine, architecture and associated processes and environment in which software agents interact to deliver a Connection. A 34 V2.0 Connection is a point-to-point network circuit that can transit multiple networks belonging to different providers. Software defined exchanges (SDX) Currently there is no single agreed upon definition of what Software-Defined Exchange (SDX) means. The concept of an SDX (from an academic perspective) was first introduced by ​Gupta et al​, motivating the role of Internet Exchange Points (IXPs) as a compelling place to introduce SDN for wide-area traffic delivery by offering direct control over packet- processing rules that match on multiple header fields and perform a variety of actions. To further motivate the SDN community, a report from the ​NSF workshop on Software Defined Infrastructures and Software Defined Exchanges​ widened the scope to position SDX in the context of Software Defined Infrastructure (SDI). A taxonomy for SDX was published by C ​ hung et al​ to describe a classification of SDX and their implementations by organizations and projects. This taxonomy defined Layer-3 SDX, which exchanges BGP routes in Internet Exchange Points; Layer-2 SDX, which exchange multi-domain ethernet circuits in R&E networks; and SDN SDX, which interconnect SDN islands. For HEP, where R&E networks are used to interconnect research facilities across long geographics distances, having SDXs in the path can diminish, or even eliminate, some of the challenges provisioning end-to-end circuits between research facilities. Network operators can take from days to several weeks to provision an intercontinental end-to-end circuit over multiple R&E networks, because these tasks are often done manually [​SDNAmLight​]. Adding SDN to the provisioning process can significantly reduce provisioning costs in time and effort of science network services [​AdvRes​]. HEP can benefit from the advanced networking capabilities of an SDX that is underpinned by SDN even if the R&E networks connected to the exchange point are not SDN enabled. In the SDN architecture, the control and data planes are decoupled, network intelligence and state are logically centralized, and the underlying network infrastructure is abstracted from the applications [​ONFwp​]. Since an SDX is logically centralized, controllers have global visibility of the whole network topology, unlike conventional networking, making it possible to maintain related states for a full path of a flow. Hence, an SDX can dynamically optimize flow-management and resources. Furthermore, per-flow or application-level QoS provisioning becomes easier and feasible for network administrators [​QoS​]. The U.S. National Science Foundation (NSF) International Research Network Connections (IRNC) program supports three projects to build and operate SDXs that connect the U.S. R&E networks to Europe, Asia-Pacific, and South America. References for these three projects are: ​StarLight SDX​: A Software Defined Networking Exchange for Global Science Research and Education; AtlanticWave-Software Defined Exchange​: A Distributed Intercontinental Experimental Software Defined Exchange (SDX); ​Pacific Wave Expansion Supporting SDX & Experimentation​. A detailed description of these projects, their goals and accomplishments can be found in the Appendix. 35 V2.0 Orchestrators SENSE The Software-defined network for End-to-end Networked Science at Exascale (SENSE) system is a model-based orchestration and resource management system which enables end-to-end multi-domain and multi-resource Layer 2/3 network services instantiation. The SENSE system includes an Orchestrator, a Network Resource Manager, and an End-Site Resource Manager which can be deployed in flexible ways to adapt to a variety of site and infrastructure deployments. SENSE allows the science applications to manage the network as a first-class schedulable resource in a manner similar to instruments, compute, and storage. SENSE operates between the SDN layer controlling the individual networks/end-sites, and science workflow agents/middleware. The SENSE system defines the mechanisms needed to dynamically build end-to-end deterministic and policy-guided Layer 2/3 network services. An intent-based interface allows applications to express service requirements in a high-level domain science specific context, with mechanisms for interactive and full-service lifecycle coordination with workflow automation systems. The SENSE system has two key objectives: ● ● Facilitate science workflow related provisioning and life-cycle management for a variety of end-to-end network services. Provide the intelligence and control to enable more optimal and efficient use of network infrastructures. The SENSE approach to end-to-end at-scale networking is based on software programmability and intelligent service orchestration. These are enabled by some novel technologies, including a) hierarchical service-resource architecture, b) unified network and end-site resource modeling and computation, c) model based real time control, d) application driven orchestration workflow, and e) future plans for end-to-end network data collection and analytics integration. There are four main functions of the SENSE system: ● SENSE Orchestrator North Bound Interface - A highly customizable interface for application workflow agents to query regarding possible actions, recommendation, and/or request specific service instantiation. While a standard northbound interface has been defined, this interface is designed to be easily and rapidly customized for individual user requirements. The SENSE system has significant amounts of data and intelligence regarding the underlying networked systems. This information can be customized for user consumption in a highly detailed or abstract manner. ● SENSE Orchestration - This includes the integration of resource model-based descriptions from the underlying network infrastructures, the computation services to process resource models to respond to user requests, and the coordination of provisioning actions. ● SENSE Orchestrator South Bound Interface - Provides for a continuous exchange of topology descriptions which include an ability for the resource owners to tailor the level of abstraction and realtime states in accordance with local policies and service objectifies. This is one of the key innovations of the SENSE system and is based on semantic 36 V2.0 ● web-based graph models which provides for a high degree of service flexibility and infrastructure owner-controlled customizations. Multi-Resource "SDN" Layer - The SENSE architecture does rely on an underlying SDN layer, however it does not require a particular SDN controller or system implementation. The SENSE architecture accepts that there will be a variety of deployed SDN solutions which will cover specific network and administrative regions. The SENSE system provides the mechanisms and infrastructure to leverage these systems and provide guidance for how they can be fully integrated into the SENSE system. One method is for existing SDN systems to implement the SENSE Orchestrator Southbound Interface as their controller Northbound Interface. Existing systems may realize this via native implementation of the SENSE API or via a thin layer on top of their existing API which provides the proper interface. This thin SENSE layer can actually be embedded into the SENSE Orchestrator, such that no changes are required to many existing SDN systems. Adapting to underlying SDN systems for SENSE system integration has been used successfully as part of the SENSE system deployment on ESnet and other R&E infrastructures. Systems based on OpenDaylight (ODL), Network Services Interface (NSI), On-Demand Secure Circuits and Advance Reservation System (OSCARS), and Open Network Operating System (ONOS) have all been integrated into SENSE orchestrator operations. The SENSE development and testing activities have demonstrated that valuable orchestrated services can be provided using these existing SDN systems as they are with no internal modifications. In addition, more advanced functions can be enabled via use of a native SENSE API implementation. To demonstrate these more advanced features, SENSE has developed a "Network Resource Manager" and "End-Site Resource Manager". These resource managers utilize the SENSE native model-based API to realize advanced features in the areas of multi-resource integration, real time responsiveness, computation to support negotiations, and user interactions. Additional information regarding the SENSE system is available via the S ​ ENSE INDIS Paper​ and SENSE WebSite​. NOTED NOTED​ (Network-Optimized Transfer of Experimental Data) is a project led by CERN which aims to improve the efficient use of WAN networks and accelerate the transfers of large amounts of data between remote data centres. Engineering the traffic to avoid a chokepoint is not complicated to implement when traffic volumes and end points can be predicted. Unfortunately it’s not easy to understand how the traffic could be optimized by just looking at the network, because large flows can end just seconds after they have been identified. NOTED will implement a Transfer Broker, a framework that can understand when a large data transfer has started and publish such information for the use of network controllers. The Transfer Broker will query transfer services like Rucio and FTS to understand when a large data transfer has started, the total amount of data that will be exchanged, between which sites, the percentage of transfer completed. It will make this information available over a programmable 37 V2.0 interface, to allow network controllers to decide when and where to implement network optimizations. NOTED will also implement a Network Controller that can query the Transfer Broker and implement network optimizations using SDN. 38 V2.0 Data Transfer Services BigData Express BigData Express (​http://bigdataexpress.fnal.gov​) is a U.S. Department of Energy (DOE) funded network research project. It seeks to provide a schedulable, predictable, and high-performance data transfer service for big data science. A. System architecture and design BigData Express will typically run in a data center. As illustrated in Figure 1, a typical site will feature a dedicated cluster of high-performance Data Transfer Nodes (DTNs), an SDN-enabled LAN, and a large-scale storage system. BigData Express optionally requires an on-demand site-to-site WAN connection service to provide the path(s) between source and destination sites. Normally, the WAN service supports guaranteed bandwidth and designated time slot reservations. ESnet and Internet2 currently are capable of providing such a WAN service via OSCARs and AL2S, respectively. This requirement is necessary for BigData Express to establish end-to-end network paths with guaranteed QoS to support real-time and deadline-bound data transfer. Otherwise, BigData Express would provide best-effort data transfer. Figure 1 BigData Express Architecture BigData Express adopts a distributed, peer-to-peer model. A logically centralized BigData Express scheduler coordinates all activities at each BigData Express site. This BigData Express scheduler manages and schedules local resources (DTNs, storage, and the BigData Express LAN) through agents (DTN agents, storage agents, and AmoebaNet). Each type of resource may require one or multiple agents. The scheduler communicates with agents through a MQTT-based message bus. This architecture offers flexibility, robustness, and scalability. BigData Express Schedulers located at different sites negotiate and collaborate to execute data transfer tasks. They execute a 39 V2.0 distributed rate-based resource brokering mechanism to coordinate resource allocation across autonomous sites. Web Portal allows users and applications to access BigData Express services. For a data transfer task, the following information will be conveyed to BigData Express via Web Portal: X.509 certificates of the task submitter, the paths and filenames of the data source, the paths of the data destination, the task deadline, and the QoS requirements. BigData Express uses this information to schedule and broker resources for the data transfer task, then launch data transfers. Web portal also allows users to browse file folders, check the data transfer status, or monitor the system/site status. DTN agents collect and report the DTN configuration and status. They also assign and configure DTNs for data transfer tasks as requested by the BigData Express scheduler. AmoebaNet keeps track of the BigData Express LAN topology and traffic status with the aid of SDN controllers. As requested by the BigData Express scheduler, AmoebaNet programs local networks at run-time to provide custom network services. Storage agents keep track of local storage systems usage, provide information regarding storage resource availability and status to the scheduler, and execute storage assignments. Data Transfer Launching Agents initiate data transfer jobs as requested by the BigData Express scheduler. Typically, Data Transfer Launching Agents launch 3​rd​ party data transfers between DTNs using X.509 certificates on behalf of users. Data transfer launching agent features an extensible plugin framework that is capable of supporting different data transfer protocols, such as mdtmFTP, GridFTP, and XrootD. The BigData Express scheduler implements a time-constraint-based scheduling mechanism to schedule resources for data transfer tasks. Each resource is estimated, calculated, and converted into a rate that can be apportioned into data transfer tasks. The scheduler ​assigns rates for data transfer tasks in the following order of priority: real-time data transfer tasks à deadline-bound data transfer tasks à background data transfer tasks. B. A High-performance Data Transfer Engine mdtmFTP is BigData Express’ default data transfer engine. It offers high-performance data transfer capabilities. mdtmFTP achieves high performance through several key mechanisms. First, mdtmFTP adopts a pipelined I/O centric design. A data transfer task is carried out in a pipelined manner across multiple cores. Dedicated I/O threads are spawned to perform network and disk I/O operations in parallel. Second, mdtmFTP utilizes the MDTM middleware services to make optimal use of the underlying multicore system. Finally, mdtmFTP implements a large virtual file mechanism to address the Lots of Small Files (LOSF) problem. Evaluations have shown that mdtmFTP achieves higher performance than data transfer tools such as GridFTP, FDT, and BBCP. mdtmFTP supports third-party data transfer. It also supports GSI-based security. Figure 2 illustrates a BigData Express data transfer example. A Data Transfer Launching Agent launches a third-party data transfer between two DTNs using X.509 certificates. 40 V2.0 Figure 2 BigData Express launches data transfer jobs C. On-Demand Provisioning of End-to-End Network Paths with Guaranteed QoS BigData Express intelligently programs network at run-time to suit data transfer requirements. It dynamically provisions end-to-end network paths with guaranteed QoS between DTNs. An end-to-end network path typically consists of LAN and WAN segments. In BigData Express end-to-end data transfer model, LAN segments are provisioned and guaranteed by AmoebaNet, while WAN segments are provisioned through on-demand WAN path services such as ESnet OSCARS, or Internet2 AL2S to provide paths between the data source and destination sites. AmoebaNet applies SDN technologies to provide “Application-aware” network service services in the local network environment. It offers several capabilities to support BigData Express operations. To support network programmability, AmoebaNet provides a rich set of network programming primitives to allow BigData Express to program the local area network at run-time. To support QoS guarantees, ​AmoebaNet provides two classes of services, ​priority and ​best-effort​. Priority traffic flows are typically specified with designated rates or bandwidth. AmoebaNet uses QoS queues to differentiate priority and best-effort traffic at each SDN switch. Priority traffic is transmitted first, but metered to enforce rate control. In addition, AmoebaNet supports QoS-based routing and path selection. Finally, AmoebaNet supports fine-grained control of network traffic. WAN QoS can be provisioned and guaranteed by utilizing ESnet OSCARS, or Internet2 AL2S to reserve bandwidths between Service Termination Points (STPs), where AmoebaNet services end. Typically, AmoebaNet gateways (GWs) are either logically, or physically connected to WAN STPs. VLAN popping, pushing, and/or swapping operations are performed at AmoebaNet gateways to concatenate WAN and LAN segments. As illustrated in Figure 3, BigData Express typically performs the following operations to provision an end-to-end network path: 1) ​Estimate and calculate the DTN-to-DTN traffic matrix, and the related QoS requirements (e.g. throughput, delay). 2)​ N ​ egotiate and broker network resources to determine the end-to-end rate for the path. 3)​ C ​ all ESnet OSCARS or Internet2 AL2S circuit service to set up a site-to-site WAN path. 41 V2.0 4)​ C ​ all AmoebaNet at each site to program and configure the LAN paths. 5) ​Send PING traffic to verify a contiguous end-to-end network path has been successfully established. A large data transfer job typically involves many DTNs, and a corresponding large number of data flows. To avoid the necessity of establishing many WAN paths between the source and destination sites, multiple LAN segments can be multiplexed/de-multiplexed to/from a single WAN path, which in turn is configured to support the aggregated bandwidth of its component paths. This strategy helps to reduce burden on WAN path services. Figure 3 Provisioning of end-to-end path with guaranteed QoS D. Security BigData Express runs in secure environments. At each site, BigData Express systems run in trusted security zones protected by security appliances. All DTNs are secured by using X.509 certificates. All BigData Express sits use a common single-point sign-on service (CILogon) to obtain X.509 certificates for secure access to DTNs. In addition, each site publishes its public key so that different sites can establish trust. Communication channels between two sites are secured by HTTPS. Users are authenticated and authorized to access BigData Express services. From a user’s perspective, BigData Express provides two layers of security: (1) ​A user must first use his/her username and password to login to a particular BigData Express web portal. Once login is successful, a user can manage data transfer tasks (submission, cancellation, and monitoring), or monitor the system/site status. (2) ​Within a logged-in web portal, a user must further login to data transfer source and/or destination site(s) to obtain X.509 certificates for secure access to local DTNs. Once authenticated locally, the user can browse files, and/or launch data transfer tasks. With 42 V2.0 CILogon issued X.509 certificates, the BigData Express scheduler will request Data Transfer Launching Agents to launch data transfer tasks on behalf of the user. The BigData Express software is currently deployed and being evaluated at multiple research institutions, including UMD, StarLight, FNAL, KISTI, KSTAR, SURFnet, and Ciena. The BigData Express research team is deploying BigData Express on various research platforms, including Pacific Research Platform, National Research Platform, and Global Research Platform. 43 V2.0 Research & Education Networks Programmable Services ESnet6 ESnet is the High-Performance Network (HPN) user facility for the DOE Office of Science. As the current ESnet network reaches almost 10 years of operations, the organization is in the middle of a significant upgrade to its network that is being executed in parallel. ESnet6 Architecture for programmability The ESnet6 network was conceived with three main objectives in mind: 1. Capability​: Meeting the bandwidth requirements of the DOE Office of Science research mission. With the current exponential growth that ESnet is experiencing, ESnet staff project that the network will carry approximately 1 exabyte of traffic per month in the FY 2021-2022 timeframe. 2. Reliability​: Addressing the resiliency requirement of scientific workflows. With increasing reliance on the network to support data movements ranging from bulk data transfers to low-latency remote control applications, the network is a critical resource in the larger application workflow that includes instrument, compute, and storage elements. 3. Flexibility​: Enabling the evolution of network services. With the increasing need for data management within complex workflows involving distributed resources (e.g., instruments, local computing, remote high performance computing (HPC), storage, and networks), it is necessary for network services to evolve from simply providing best-effort connectivity to higher-level intelligent services. To achieve these objectives, ESnet6 network was designed with five overarching goals: 1. No single point of failure.​ The primary consideration for this design goal was to ensure that the physical network infrastructure met this objective. The single-point-of-failure protection also extends to the network elements deployed within the ESnet6 network. This entails properties such as redundant routing engines, switching fabrics, and power supplies. In addition, management of compute processes (e.g., Virtual Network Functions (VNF)) can be run in a high-availability (HA) fashion, utilizing local compute clusters with software to manage process migration. 2. Agility to add and/or move bandwidth capacity​. Having the capability to dynamically add and/or move bandwidth capacity within the network greatly increases the ability to address short-term capacity over-subscription and long-term growth. 3. Fine-grained traffic engineering​. The ability to perform fine-grained traffic engineering at scale within the wide-area network is a crucial building block for many advanced network services. In addition to supporting data transfers that require predictable service guarantees (e.g., deadline scheduling, real-time remote control, etc.), fine-grained traffic engineering can be used to run the network more efficiently by implementing network-wide, optimized load-balancing and global (vs. local) resiliency strategies. 4. Comprehensive automated management​. As the complexity of the network increases, manual management at scale becomes exponentially more difficult and untenable. 44 V2.0 Automated functions are needed to administer the network beyond human-scale limitations, as well as provide a basis for more intelligent proactive management. The use of abstraction and modeling can facilitate intent-based management to simplify administration by allowing engineers to dictate what should be done, instead of how it is done. 5. Highly programmable and flexible services​. As scientific workflows become more complex, there is an increasing dependency on the network to provide more intelligent services beyond best-effort connectivity. In a traditional Internet Service Provider (ISP) environment, the creation of new protocols or software features needed to support advanced services is typically gated by what commercial vendors are willing to develop. Therefore, it is astute to augment the network with flexible programmable platforms, such as P4 switches or FPGAs, to enable the development of new services that cannot be run on more traditional ASIC-based network equipment. Apart from a fundamental change in the network architecture from the ESnet5 to ESnet6, there is an intentional focus on comprehensive orchestration and automation for ESnet6. As part of this effort, ESnet has formalized a software architecture which dictates how both in-house developed, and purchased software will interoperate and integrate within this software framework. The ESnet6 software architecture covers five main areas necessary to the day-to-day operations of the ESnet network, these include; i. Assurance (e.g., fault management, root cause analysis, incident management), ii. Provisioning (e.g., workflow orchestration, automation provisioning), iii. Security (e.g., audit logging, black hole routing), iv. Analytics (e.g., time series data, streaming telemetry, traffic planning), and v. a unified data model for network intent, discovery, and topology. “High-Touch” services ESnet6 is implementing a new class of programmable services called “High-Touch”. These services will motivate the deployment of programmable data plane hardware like P4 switches, NPUs, and/or FPGAs into the network, along with associated compute and storage platforms. While many ideas can be implemented using this platform, from NFV-like services to in-network caching, the first service ESnet is working on is Precision Network Telemetry described below. Precision Network Telemetry Precision Network Telemetry is a key new capability being developed as part of the High-Touch services portfolio that will be present in ESnet6. As a network operator, ESnet will be able to select flows based on a variety of criteria, such as L3VPN, L2VPN, and IP 5 tuples within these VPNs to select anything from a single data transfer, to groups of transfers between a range of sites. For any of these flows, it will be possible to monitor every packet header in the flow. With the capability to analyze each packet, along with the augmentation of precision timestamps of packet arrival times, obtaining per flow detailed statistics becomes a reality. This in turn opens a myriad of possibilities into understanding how data transfers are performing (e.g., real-time auditing of packet loss per flow, monitoring throughput vs goodput), discerning usage patterns and trends 45 V2.0 (e.g., mapping data movement across the entire data lake), and developing models to characterize distributed workflows (e.g., correlating per flow information of data movements with job submissions). Translating this deep visibility into a useful tool for HEP workflows will take iterative dialogue and joint work, especially to understand what user experience needs to be improved vs. what can be improved. It is critical to understand that developing relevant High-Touch services will require a level of engagement and collaboration between HEP and ESnet. It is important to have representatives who understand every detail of HEP technologies like data lakes, XrootD, Rucio and compute schedulers, along with ESnet counterparts who understand what level of detail ESnet’s high precision telemetry can expose, in order to design a solution that works and has real benefit to HEP. In-network caching/computing In-network caching and in-transit computing services: In-network services such as temporary data caching and computing could potentially have a big impact on how the remote data is being accessed and processed while in-transit. For example, the amount of data volume moving through the network gets increased with newer scientific experiments and simulations, and network bandwidth requirements also gets higher proportionally to deliver the data within a certain time frame. We observe that a significant portion of the popular dataset is transferred multiple times to different users as well as to the same user due to various reasons including lack of accommodating storage volume. Sharing data among users and multiple jobs will reduce the redundant data transfers and save network traffic volume consequently. Sharing data per site may be able to be accomplished by the large storage on the site. However, sharing data among geographically distributed users can only be accommodated with some kind of content-delivery network. We experiment how in-network caching mechanism helps network traffic performance and application performance, how much data can be shared within the network, how much network traffic volume can be reduced consequently, and how much efficiency the temporary in-network cache gives to the application performance. We expect significant network traffic savings by sharing data and overall performance improvement in applications because the access to the shared data has less latency of the network cache. In our preliminary study with a ~14TB dataset for multiple analysis jobs, about 60% of the dataset is cached in the end, and about 19TB of network traffic is saved with the shared data among the analysis jobs. Shared data caching mechanism is expected to reduce the redundant data transfers and saved network traffic consequently. There will be further studies on the capabilities of in-network data caching mechanism and in-transit computing to provide better services for the HEP community. FABRIC FABRIC is a unique research infrastructure to enable cutting-edge and exploratory research at-scale in networking, cybersecurity, distributed computing and storage systems, machine learning, and science applications. It is an everywhere programmable nationwide instrument composed of novel extensible network elements equipped with large amounts of compute and storage, interconnected by high speed, dedicated optical links. It will connect a number of 46 V2.0 high-performance computing facilities and specialized testbeds (5G+/IoT, NSF Clouds) to create a rich fabric for a wide variety of experimental activities. It will create the opportunities to explore innovative solutions not previously possible and will provide a platform on which to educate and train the next generation of researchers on future advanced distributed systems designs. This National Science Foundation (NSF) funded project creates a nation-wide high-speed (100-1000 Gigabits per-second) network interconnecting major research centers and national computing facilities that will allow researchers and scientists at these facilities to develop and experiment with new distributed application, compute and network architectures not possible today. FABRIC nodes can store and process information "in the network" in ways not possible in the current Internet, enabling research into completely new networking protocols, architectures and applications that address pressing problems with performance, security and adaptability in the Internet. Reaching deep into university campuses, FABRIC will connect university researchers and their local compute clusters and scientific instruments to the larger FABRIC infrastructure. The figure below shows the planned FABRIC deployment with initial nodes being available by the end of 2020. The FABRIC network will include a set of core nodes deployed at ESnet6 points of presence, and multiple edge nodes deployed at a variety of regional network, campus, and scientific facility locations. The FABRIC infrastructure is not intended to be an isolated testbed, but will serve as a programmable interconnect fabric to a variety of experimental and production resources, allowing the community to deploy novel applications that range from a one-time wide-scale deployments with highly experimental software to a near-production-ready persistent experimental service serving a wide range of users from campuses, industry and labs. In addition to the network embedded compute, storage, and data plane programmability, the testbed will include Layer 3/2 peering with the Research and Education (R&E) infrastructure such that experiment topologies can access production resources and data sets. FABRIC will also support hybrid infrastructures interconnecting FABRIC’s physical network infrastructure with cloud-based virtual end systems or clusters. This will include dedicated Cloud Connect Services including Amazon AWS Direct Connect, Google Dedicated Interconnect, and Microsoft ExpressRoute between FABRIC core 47 V2.0 nodes and commercial cloud providers. This will allow the virtual networks, virtual machines and application stacks in these cloud providers to “peer” with the FABRIC network to form hybrid systems to support the development of customized application workflow functions. The High Energy Physics (HEP) experiment community could leverage the FABRIC infrastructure to enhance, validate and test pre-production workflows at scale associated with LHC data distribution, computation, and analysis. The network embedded compute, storage, and data plane programmability can be combined with access to NSF supercomputer centers (like TACC), R&E production networks (like ESnet) and Public Cloud infrastructures and data repositories to facilitate research and development into new scientific methods and paradigms. The FABRIC team is led by UNC-Chapel Hill RENCI (Renaissance Computing Institute) and includes the University of Kentucky, the Department of Energy’s Energy Sciences Network (ESnet), Clemson University and the Illinois Institute of Technology. Contributors from the University of Kentucky and ESnet will be instrumental in designing and deploying the platform’s hardware and developing new software. Clemson and Illinois Institute of Technology researchers will work with a wide variety of user communities—including those focused on security, distributed architectures, scientific applications and data transfer protocols—to ensure FABRIC can serve their needs. In addition, researchers from many other universities will help test the platform and integrate their computing infrastructure and scientific instruments into FABRIC. GEANT Higher-level Services GÉANT community survey on orchestration, automation and virtualization In 2019 the G ​ N4-3 project​ executed a s​ urvey​ to learn more about the current status and plans of NRENs (National Research and Education Networks) in the ​GÉANT​ community for adopting and implementing orchestration, automation and virtualization (OAV) principles [​D6.2​]. Main goals of the initiative were as follows: ● ● ● ● Learn about the strategy and actions of each NREN related to network and service orchestration, automation and virtualisation (OAV). Explore if there are common use cases, ideas, needs and issues in the community in the areas of automation, orchestration and virtualisation. Recognise possible areas of collaboration both amongst NRENs and between NRENs and GÉANT. Determine and recommend possible future work of the project that could be of benefit to as many partners as possible for identified use cases. The survey participants have shown a high level of interest in OAV. The benefits of adopting OAV in NRENs are very similar to those seen by commercial communication service providers or content providers and includes: ● ● ● ● Faster service delivery / reduced delivery time. Reduced human error and manual work. Lower service delivery costs. Better reporting. 48 V2.0 ● ● Increased efficiency. Ensured and increased configuration uniformity and consistency. There is a wide variety of components and systems being used by NRENs. These are mostly OSS, with rather fewer BSS, and very little OSS-BSS integration (whether proprietary, in-house or open source). Further, there is no obvious common single direction or best practice for OAV, and currently little use of standard data models and APIs that facilitate OAV. A wide variety of use cases were reported. Connectivity services were of very broad interest to most if not all NRENs, with configuration integrity ranking highest of the application areas, followed by problem troubleshooting and security operations centre enhancements. While most NRENs do not yet have inter-domain OAV, NRENs were mostly interested in the following cross-domain use-cases: NREN and a campus (42%), NREN-GÉANT network (35%) and then NREN-cloud (23%). Interestingly, some 80% of NRENs would be willing to allow changes to their configurations resulting from requests initiated in other domains, assuming the agreed policies, procedures, authentication and authorisation are in place. GÉANT Testbeds Service The G ​ ÉANT Testbeds Service​ (GTS) was initially developed to allow researchers to set up individual testbed environments in an automated fashion. However, the word ‘testbed’ is misleading as GTS is a general provisioning system that allows users or network engineers to automatically set up any virtual network for any purpose. GTS is based on the concept of a Generalized Virtualization Model (GVM) which allows networks to be sliced in an abstracted way where abstracted virtualized objects, called resources, can be defined, instantiated and arranged to create application-specific insulated networks. GVM defines this set of service objects (drawn from the NSI standards) to create virtual objects or resources and describe their data flows through their life-cycles. Everything in GVM is considered a resource, i.e. a resource could refer to computational facilities, transport circuits, custom switching, forwarding capabilities or integrated storage and all resources are building blocks for dynamically provisioned networks. In GVM the purpose of a resource, its function or behavior is not constrained by the GVM architecture. GVM simply describes how types or classes of resources are defined and how they can be reserved and managed through their life-cycles and how their data flow relationships can be structured and controlled. Since a resource in GVM can represent anything, GVM is technology agnostic; in other words, it does not promote a certain underlying physical infrastructure or hardware technology to be a resource or service capability. This virtualisation of service objects in GTS allows the automation of resource provisioning: Users can construct network environments using these virtual building blocks via a web-page and the GTS software-based resource manager then coordinates and orchestrates the setup of the virtual network environments. Available resource building blocks in GTS are currently Virtual Machines (VMs), Virtual Circuits (VCs), Virtual SDN switches (which are mapped to completely virtualized hardware switch 49 V2.0 instances (VSIs) and Bare Metal Servers (BMSs). Different resource classes can be defined at the virtualization layer representing certain types of physical resources and software components. Software modules referred to as Resource Control Agents (RCAs) are responsible for instantiating these abstract resource classes in the physical infrastructure. GTS therefore allows the integration of any type of resource for automatic provisioning and integration into the system such as science instruments (e.g. radio telescopes, gene sequencers, etc.), licensed mobile spectrum or sensors, as long as there is an RCA available that maps the virtual service object to the underlying infrastructure. The concept of abstract service objects extends to fine grained functional capabilities such as traffic handlers (rate shapers, policers, load balancers, filters, IDS functions, etc.) in the emerging field of network function virtualization (NFV). GTS and its underlying GVM architecture were developed in parallel to the ETSI NFV industry specification group’s virtualisation model [ETSI] and both models show many overlapping components: The ETSI-NFV orchestrator can be compared to the GUI of GTS, and the ETSI Service, VNF and Infrastructure description corresponds to the DSL code used in GTS to describe a virtual network’s desired topology. ETSI-NVF Virtual Computing and Computing Hardware components resemble RCA-VM and RCA-BMS components in GVM; similarly the Virtual Network component matches the RCA-VC for Virtual Circuits. AmLight Express and Protect (AmLight-ExP) AmLight-ExP is a project with support from the U.S. National Science Foundation (NSF) International Research Network Connections (IRNC) program, in collaboration with project partners Florida International University, AURA, LSST, RNP, ANSP, Clara, REUNA, FLR, Telecom Italia Sparkle, Angola Cables, and Internet2. This figure represents the topology of the AmLight-ExP network infrastructure. The Express Ring from Boca Raton to Fortaleza to Sao Paulo consists of six (green lines) x 100G links (four managed by RNP and two managed by FIU/ANSP/LSST). The 100G Protect Ring from Miami to Fortaleza, Fortaleza to Sao Paulo, Sao Paulo to Santiago, Santiago to Panama, and Panama to Miami has one x 100G link (solid orange). The 10G ring from Miami to Sao Paulo and 50 V2.0 Sao Paulo to Miami for protection has one x 10G link (red dashed). There is an additional 10G link Miami to Santiago for protection (orange dashed). The 100G and 10G rings are diverse, operating on multiple submarine cables. Current total upstream capacity presently is at 630Gbps. The total aggregated capacity of all the segments is 1,230 Gbps / 1.23 Tbps. All the links are configured to support Jumbo frames, OpenFlow flow entries, IPv4/6, and multicast. Support for Network Virtualization and Programmability on AmLight-ExP AmLight has been supporting SDN in production since 2014. Researchers may use slicing or virtualization to prototype network-aware applications. Researchers may implement testbeds with real network devices. They can validate their research in a production environment, and at scale. In the figure above, the boxes in lavender to the right of the yellow OESS box represent testbeds. The green boxes above are applications that use the northbound API of the SDN applications below in the figure. For example,the production SDN controller OESS uses OpenFlow to connect to the DataPlane and provides a REST API to be used by OpenNSA. For interdomain connections, BGP for Layer 3 and NSI for Layer 2 are supported. 51 V2.0 Challenges and Outlook Programmable WAN is still an area of intensive research and development and while the existing projects have well defined scope and good match to the HEP use cases, there are still a number of challenges that remain: ● One of the core challenges for some time is the fact that it appears to be difficult to bring the existing projects from testbed/prototype stage into production. Within LHCONE R&D efforts, a number of projects were successfully demo-ed in the past, but it has proven to be very challenging to deploy them in the production infrastructure. What appears to be missing are network infrastructures where prototypes can be tested at scale and then easily deployed/migrated to production. ● While the situation differs between R&Es, there is currently a significant lack of telemetry and monitoring data that can be programmatically accessed via APIs. This makes it difficult to develop systems that would improve efficiency in using the existing capacities (avoiding spikes in traffic when multiple experiments transfer large amounts of data at the same time). ● The lack of available telemetry, tracing and insight into how the current network operate is to some extent currently balanced out by deployment and operation of a complex network of end-to-end testing infrastructure (perfSONAR), but this is only applicable to a limited set of use cases (mostly to debugging/tracing of end-to-end network performance issues). ● From the perspective of data transfer systems, the concept of a transfer broker that could provide planned/anticipated transfers across multiple experiments is currently missing. It should be relatively easy to accomplish this provided that experiments would agree on ways to publish their planned/anticipated or even real-time transfers. ● As another alternative, automated methods for traffic engineering that would automatically adapt to the existing workloads have been proposed by different projects (both in SDX and orchestrators). Such systems promise to keep the existing status quo where the state of the underlying network and its operations are transparent to the experiments. It remains to be seen if such approaches will be feasible in a large scale federated environment (such as LHCONE). ● There is a wide range of functionalities and services planned and it can be expected that significant variations will exist between different R&Es. It’s less clear how synergies will be found and how software defined networks will be offered across R&E boundaries. ● As was mentioned in the previous chapter, DCIs and other inter-DC connectivity options are available, but it’s unclear how they will be supported and work across R&Es. ● One of the major challenges that remains is to understand how we can effectively propose and run cross domain projects involving sites, experiments and R&Es (there are existing projects that have faced similar challenges such as EU OCRE, HNSciCloud; but cross continental projects still remain a major challenge). 52 V2.0 Proposed Areas of Future Work A primary goal of this document is to seed a collaboration between the experiments, the sites and the research and education networks to deliver capabilities needed by HEP for their future infrastructure while enabling the sites and NRENs to most effectively support HEP with the resources they have​. In this section we will outline possible areas of future work that can help tie together activities within and among the experiments and sites with network engineers, NRENs and researchers. It is critical that we identify projects that are useful to the experiments, deployable by sites, and that involve a range of participants spanning the sites, the experiments and the (N)RENs. Without the involvement of each, we risk creating something unusable, irrelevant or incompatible. This document completes the HEPiX NFV/SDN working group Phase I effort, which targeted surveying the various technologies and projects relevant in this area for WLCG and suggesting an outline of a possible Phase II. The Phase I results were presented at the Fall 2019 HEPiX in mid-October and this report will be broadly distributed thereafter for comment. We will use the next LHCOPN/LHCONE meeting at CERN i​ n mid-January 2020 as a decision point for a Phase II program of work. Below is a list of areas we have discussed that we believe are relevant to at least some of our stakeholders for a second phase of this working group: ● Data lakes, federated sites, cross site deployment technologies, specifically projects focusing on exploring inter-DC connectivity technologies such as DCI (software or hardware) exploring the possibilities for automated cross-site deployment and operations as well as the possibility to co-locate caches (such as XCache) at the network hubs ● Container-based edge services (such as SLATE), container native and/or federated prototypes (Kubernetes only sites), HTCondor Kubernetes backend prototype testing. In particular, projects focusing on deployment strategies of the container-based system in HEP and adoption of previously listed industry standards and approaches. ● Federated sites, cross site deployment of network virtualisation and cloud-native networking. Two potential areas would be orchestration of a separated VM region running on cloud-native and inter-connecting to other sites via DCIs, the other would be federated storage using DCIs and cloud-native. ● Storage/GPU virtualisation for container/VM based sites - integration of block storage and GPUs in the data centres using NVMe-oF (NVMe over Fabrics) or similar technologies DC Edge hyper-converged infrastructures and edge service architectures for the HLT/readout systems ● SD/WAN integrations at the DC edge (multi-ONE and other SD/WAN approaches) - transfer broker prototypes (that would include anticipated and existing transfers across experiments; or existing transfers per domain/experiment; also tie to WLCG DOMA). Additional projects in this area might involve programmatic access to the network topologies (LHCONE), routing tables and their visualisation. 53 V2.0 ● SDX use cases in network provisioning, domain failure limitation, automated/intelligent brokering of the experiments traffic in the background, APIs (for multi-ONE/LHCONE ?) ● Clouds - extending DC site networking to Clouds (via VPC) and exploring transit gateways and similar approaches [​AWS​] ● GEANT connect, ESNet6 telemetry/higher level services, FABRIC and GEANT Testbed Service (GTS) as testbeds for data management R&D. Packet tracing mechanisms across different R&Es, monitoring and network telemetry APIs. However, given the diversity and length of this list, individual items are either too generic, too specific or not sufficiently relevant to all stakeholders to create a broad collaboration around. To make progress in the near-term, we would like to suggest a few specific areas where the sites, experiments and NRENs might profitably collaborate. Below are three examples we believe deserve discussion to see if we can identify interest in near-term collaboration with targeted deliverables. ​These are not meant to be exclusive, merely suggestions based upon the working groups interactions and discussions amongst its members. 1. Making our network use visible​ - Understanding the HEP traffic flows in detail is critical for understanding how our complex systems are actually using the network. The potential work here is to identify how we might label our traffic to indicate which experiment and task it is a part of. This is especially important for sites which support many experiments simultaneously where any worker node or storage system may quickly change between different users. With a standardized way of marking traffic, any NREN or end-site could quickly provide detailed visibility into HEP traffic to and from their site. The technical work would encompass how to mark traffic at the network level, defining a standard set of markings and providing the tools to the experiments to make it easy for them to participate. We note that the increasing use of VMs and containers might make marking traffic easier where those technologies are in use. 2. Shaping data flows ​- It remains a challenge for HEP storage endpoints to utilize the network efficiently and fully. An area of potential interest to the experiments is traffic shaping (see packet pacing discussion above in the Programmable WAN Motivation section). WIth traffic shaping, network packets are emitted by the network interface in bursts corresponding to the wire speed of the interface. Packets sent from a 10 Gbps network interface can create a micro-burst which can overflow buffers along the path or at the destination, causing packet loss. The impact on TCP, especially for high-bandwidth transfers on long network paths can be significant. If instead, flows are shaped to better match the end-to-end usable throughput, the result is much smoother flows at higher bandwidth. A significant extra benefit is that these smooth flows are much friendlier to other users of the network by not bursting and causing buffer overflows. 3. Network orchestration to enable multi-site infrastructures​ - Within our data centers, technologies like OpenStack and Kubernetes are being leveraged to create very dynamic infrastructures to meet a range of needs. Critical for these technologies is a level of automation for the required networking using both software defined networking and network function virtualization. As we look toward high luminosity LHC, the experiments are trying to find tools, technologies and improved workflows that may help bridge the anticipated gap between the resources we can afford and what will actually be required to extract new physics from massive data we expect to produce. The ways in which we may organize our computing and storage resources will need to evolve. Architectures like Data Lakes, 54 V2.0 federated or distributed Kubernetes and multi-site resource orchestration will certainly benefit (or require) some level of WAN network orchestration to be effective. To support this type of resource organization evolution, we need to begin to prototype and understand what services and interactions are required from the network. We would suggest a sequence of limited scope proof-of-principle activities in this area would be beneficial for all our stakeholders. 55 V2.0 Conclusions The primary challenge we face is ensuring that WLCG and its constituent collaborations will have the networking capabilities required to most effectively exploit LHC data for the lifetime of the LHC. To deliver on this challenge, automation is a must. The dynamism and agility of our evolving applications, tools, middleware and infrastructure require automation of at least part of our networks, which is a significant challenge in itself. While there are many technology choices that need discussion and exploration, ​the most important thing is ensuring the experiments and sites collaborate with the RENs, network engineers and researchers to develop, prototype and implement a useful, agile network infrastructure that is well integrated with the computing and storage frameworks being evolved by the experiments as well as the technology choices being implemented at the sites and RENs. 56 V2.0 Terminology Switch/bridge - is networking hardware that connects devices on a computer network by using packet switching to receive, and forward data to the destination device. In the context of this report switch/bridge are used interchangeably and refer to a multiport network bridge that uses media access control addresses to forward data at the data link layer (​layer 2​) of the OSI model. Router - a networking device that forwards data packets between computer networks. Routers perform the traffic directing functions on the Internet. When multiple routers are used in interconnected networks, the routers can exchange information about destination addresses using a ​routing protocol.​ In the context of this report we usually refer to routers/routing at layer 3, where packets are sent to a specific next-hop IP address, based on destination IP address. VPN - Virtual Private Network - extends a private network across a public network, and enables users to send and receive data across shared or public networks as if their computing devices were directly connected to the private network. VLAN - ​Virtual Local Area Network​, provides layer 2 abstraction, in traditional bridging network it’s used together with Spanning Tree Protocol (STP), which constructs a single tree across a bridged network to forward all packets. With VLANs, there is usually a separate tree per VLAN. VRF - ​Virtual Routing and Forwarding​ is a layer 3 abstraction, which provides a separate routing table for each virtual private network (​VPN​), usually this is done by adding some sort of VRFID to the routing table lookup. VXLAN - Virtual eXtensible Local Area Network (​VXLAN​) - a typical example of an overlay model, which is used to build a traffic isolation at layer 2 across layer 3 infrastructure. Usually it’s coupled with VRF to provide a full bridging and routing overlay solution. NVGRE - Network Virtualisation over GRE (​GRE​) protocol [​RFC7673​] STT - S ​ tateless Transport Tunneling DWDM - D ​ ense Wavelength Division Multiplexing​ is an optical technology used to increase bandwidth over existing fiber optic backbones. GENEVE - Generic Network Virtualisation Encapsulation (​GENEVE​) GRE - Generic Routing Encapsulation protocol FRR - Free Range Routing (​FRR​) is an open source framework implementing number of common routing protocols (OSPF, IS-IS, BGP) and deployed as part of open source network operating systems 57 V2.0 SDN - Software Defined Networking - technology is an approach to network management that enables dynamic, programmatically efficient network configuration in order to improve network performance and monitoring making it more like cloud computing than traditional network management. AS - Autonomous System - is a collection of connected Internet Protocol (IP) routing prefixes under the control of one or more network operators on behalf of a single administrative entity or domain that presents a common, clearly defined routing policy to the internet. BGP - Border Gateway Protocol - is a standardized exterior gateway protocol designed to exchange routing and reachability information among autonomous systems (AS) on the Internet. eBGP - exterior Border Gateway Protocol - protocol used to transport information to other BGP enabled systems in ​different​ autonomous systems (AS) iBGP - interior Border Gateway Protocol - protocol used between the routers in the ​same autonomous system (AS). OSPF - Open Shortest Path First (OSPF) is a routing protocol for Internet Protocol (IP) networks. It uses a link state routing (LSR) algorithm and falls into the group of interior gateway protocols (IGPs), operating within a single autonomous system (AS). IS-IS - Intermediate System to Intermediate System (IS-IS, also written ISIS) is a routing protocol designed to move information efficiently within a computer network, a group of physically connected computers or similar devices. It accomplishes this by determining the best route for data through a Packet switching network. MPLS - Multiprotocol Label Switching (​MPLS​) is routing and forwarding protocol; it operates between layer 2 and layer 3, so it’s often referred to as layer 2.5 protocol. It assigns labels to packets and uses the labels to decide on subsequent routing and forwarding. It has been around since 90-ties and one of it’s core use cases was to establish and manage layer-3 VPNs (L3VPN). 58 V2.0 Acknowledgments We would like to thank the WLCG, HEPiX, OSG and the many national research and education networks for their work on the topics we have discussed in this paper. In addition we want to explicitly acknowledge the support of various national funding agencies which supported this work, including: The National Science Foundation in the United States ● OSG: NSF MPS-1148698 ● IRIS-HEP: NSF OAC-1836650 ● AmLight Express and Protect (AmLight-ExP): NSF OAC-1451018 ● AtlanticWave-SDX: NSF OAC-1451024 The Department of Energy in the United States... 59 V2.0 Appendix StarLight SDX The NSF IRNC-funded StarLight International/National Software Defined Networking Exchange (SDX) focuses on research, development, and deployment of services, architectures, and technologies designed to provide scientists, engineers, and educators with highly advanced, diverse, reliable, persistent, and secure networking services, enabling them to optimally access resources in North America, South America, Asia, South Asia, Australia, New Zealand, Europe, Africa, and other sites around the world. The StarLight SDX ensures continued innovation and development of advanced networking services and technologies. The goal of the StarLight SDX initiative is to expand upon existing and emerging experimental and prototype SDN/SDX/SDI networking architectures and technologies through the development of advanced international communication services for data-intensive science, in part, by leveraging existing designs and developments as well as by collaborating with science-domain research communities. This NSF IRNC SDX initiative is creating and implementing innovative methods for transporting large capacity data over long distance WANs, including many thousands of miles. The StarLight SDX initiative specifically focuses on three main activities: (1) designing and implementing an SDN Development Environment – the StarAX (StarLight Advanced eXchange) Innovation Engine – that enables facilities around the world to deploy SDXs that are compatible and interoperable with others; (2) deploying instances of this system at the StarLight facility; and, (3) coordinating and collaborating with others around the world to deploy such systems at their exchange points. This project is developing an SDN Innovation Platform (software closely integrated with select hardware) to provide multiple network functions and features quickly and reliably. The StarLight SDX initiative is creating methods for programmable networking using SDN and related virtualization techniques to enable higher levels of abstraction for network control and management functions at all layers and across all underlying technologies. Virtualization-based architecture leads to new capabilities for network services and infrastructure, allowing for a wide range of services and capabilities and options to closely integrate networks with other resources, such as compute resources, storage, instruments and sensors (e.g., through SDI architecture). These new fabrics are comprised of multiple components, implemented within a emerging flexible architecture. The architecture is based on the concept of distributed Science DMZs. Components include high level APIs, protocols, data commons, compute systems, file systems, storage systems, and specialized network systems and services. Key enablers of these emerging services and capabilities are SDXs, which leverage the programmability of virtualized network resources made possible by SDN techniques. SDN has been widely implemented in large scale data centers and on private WANs interconnecting multiple data centers. The increased deployment of SDNs has motivated the development of creating SDXs, not only to extend SDN capabilities across multiple domains but also to provide exchanges with many more capabilities, especially for data intensive science. Multiple demonstrations at the SC International Conferences for High Performance Computing, Networking, Storage and Analysis have showcased a wide range of 60 V2.0 capabilities of the StarLight SDX, including capabilities for dynamic provisioning, SD-WAN provisioning, dynamically implementing WAN L2 paths, and supporting large scale file transfers and high capacity data streams through Data Transfer Nodes (DTNs). SDXs have a significant potential for highly granulated policy implementation; enabling slice exchanges that include more than network resources, migrating away from restrictive protocols such as BGP to provide a richer set of peering options; reduce routing complexity and allow more path flexibility, enabling highly distributed and granulated domains; enhancing capabilities for specialized access to resource descriptions, brokerages, and clearinghouses; providing enhanced flowvisor capabilities; including: (1) managing large-scale network capacity, such as end-to-end 100 Gbps flows over WANs, as high-volume individual data streams; (2) highly granulated programmability for network services and resources, including ultra-high-capacity flows; (3) direct edge access to and control over these capabilities, e.g., by individual processes and applications; (4) direct management of individual workflow data streams; (5) individual stream attribute management; (6) topology exchange services; (7) controller federation, including support for East-West control channels across multiple domains (a major challenge today for SDN/OpenFlow networks); and demonstrating and showcasing these services and capabilities at major national and international conferences and workshops. SDN technology not only specifies a separate control plane and data plane, it creates non-traditional data-plane traffic. For example, a national OpenFlow backbone project coined the term OpenFlowvlan (OFvlan), which means an OpenFlow switch can add a VLAN tag to the traffic header and send it to the next hop‘s network equipment, and it might not carry features that traditionally tagged VLAN frames support. To allow different types of SDN traffic to traverse data planes at different SDXs and traverse non-SDN networks without problems, this capability is a high-priority challenge. This IRNC SDX initiative has established a project to make networks more application aware, for example, through the use of Jupyter, a scientific workflow manager that is being integrated with extensions for managing network services and resources. Today, science workflow management and network orchestration are accomplished with completely separate software stacks. The StarLight SDX has been used to investigate how science workflow management and network orchestration can be integrated using a combined software stack vs two separate software stacks. One project established a technique using Jupyter to integrate both within a common package that can eventually be integrated into an SDX service. This project is demonstrating the utility of using Jupyter for integrating scientific workflows with network orchestration techniques required for data intensive science. Optimal management of large scale data intensive science workflows is often dependent on the orchestration of infrastructure resources, including network paths and attributes. This project leverages the utility of using Jupyter for integrating science workflow management and programmable network orchestration, using SDN techniques. A major advantage of this approach is that it allows scientists to avoid learning the complexities of network provisioning and orchestration by providing a layer of abstraction above network orchestration processes. By using a well known scientific workflow 61 V2.0 management tool, Jupyter, integrated with network orchestration processes, data intensive science can more easily provision and configure network resources without a deep understanding of network management and control processes. Another project is investigating the potential for integrating Jupyter high level processes with an underlying set of network programming processes, including P4 (Programming Protocol-Independent Packet Processors). P4 is an emerging language networking programming language, a domain specific language for network protocols. This SDX project is exploring how P4 can be used to support large scale data intensive science workflows on high capacity high performance WANs and LANs, by introducing into the network a high degree of differentiation not previously possible. It has long been recognized that managing large scale data intensive science workflows on high performance LANs and WANs requires special programmable networking techniques, because most networks are designs for millions of small flows. The StarLight SDX supports an international P4 testbed for computer science research in partnership with the GENI initiative. This IRNC SDX initiative has established a collaborative partnership that is exploring specialized software stacks useful for creating a multi-institution, hyperconverged science DMZ, using Kubernetes as a core orchestrating resource. This consortium has been demonstrating the utility of the services provided by a prototype model to support data intensive science. The JupyterHub Kubernetes Spawner enables JupyterHub to spawn single-user notebook servers on a Kubernetes cluster. Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Kubernetes enables the deployment of a JupyterHub setup that can scale across multiple nodes, for example, supporting over 50 simultaneous users. It scale to more nodes and reduces to less easily. JupyterHub can be run from inside Kubernetes. Consequently, many JupyterHub deployments can be run with Kubernetes only, eliminating a need for using scripts such as Ansible, Puppet and Bash. It has utilities for integrated monitoring and failover for the hub process. It allows for spawning multiple hubs in the same kubernetes cluster, with support for namespaces. Kubernetes also has comprehensive resource and security parameter controls. Other components being developed to support these services are Data Transfer Nodes (DTNs). The PetaTrans 100 Gbps Data Transfer Node (DTN) research project is directed at improving large scale WAN services for high performance, long duration, large capacity single data flows. iCAIR is designing, developing, and experimenting with multiple designs and configurations for 100 Gbps Data Transfer Nodes (DTNs) over 100 Gbps Wide Area Networks (WANs), especially trans-oceanic WANs, PetaTrans – high performance transport for petascale science, including demonstrations at major conferences and workshops. The PetaTrans 100 Gbps Data Transfer Node (DTN) research project is directed at improving large scale WAN services for high performance, long duration, large capacity single data flows. iCAIR is designing, developing, and experimenting with multiple designs and configurations for 100 Gbps Data Transfer Nodes (DTNs) over 100 Gbps Wide Area Networks (WANs), especially trans-oceanic WANs, PetaTrans – high performance transport for petascale science, including demonstrations at major conferences and workshops. These DTNs are being designed specifically to optimize capabilities for supporting E2E (e.g., edge servers with 100 Gbps 62 V2.0 NICs) large scale, high capacity, high performance, reliable, high quality, sustained individual data streams for science research. To provide multiple domain capabilities, this IRNC SDX consortium is developing and demonstrating services, architecture, and technologies for SDXs that can directly control dynamic 100 Gbps paths. These SDX capabilities are also based on advanced switching capabilities , including through experiments. Several of the SC19 IRNC SDX planned demonstrations include a fundamentally new concept of ―consistent network operations, where stable load balanced high throughput workflows crossing optimally chosen network paths, up to preset high water marks to accommodate other traffic, are provided by autonomous site-resident services, dynamically interacting with network-resident services, in response to requests from science programs principal data distribution and management systems. This is empowered by end-to-end SDN control at Layer 2 (switching) and Layer 1 (optical) extending to autoconfigured DTNs, combined with advanced transfer applications such as mdtmFTP and Big Data Express. The StarLight consortium and its research partners have also been working on a DTN-as-a-Service project so that these capabilities can be integrated with SDXs. The DTN-as-a-Service Software Stack provides users with an advanced science workflow environment for high-speed network data transfer using DTNs. The DTN-as- a-Service Software Stack consists of multiple modules implement core functions and the modules are controlled by a science workflow controller which allows users to orchestrate high-speed network data transfer with predefined parameters. The system configuration module is responsible for managing storage devices for the data transfer. Users can define a transfer setup to configure RAIDs or individual drives, choose a file system and choose the desired files or create test files for the transfer. The system optimization module provides autonomous system tuning for DTNs. It checks system parameters for a high-speed network data transfer and informs users on the current settings. Users can optimize DTNs autonomously using for the high-speed network data transfer. Transfer tools are managed by the transfer tools module, which executes high-speed network data transfer based on the defined transfer setup. The module manages predefined transfer tools and executes with the transfer setup which includes a number of parallel flows and storage devices to use. The network provisioning module orchestrates network paths between the DTNs using OpenNSA, this module also provides interfaces for other types of network provision.The monitoring module allows users to watch transfer activities in real-time and evaluate each transfer executed. Users can generate graphs from each transfer and evaluate them using the monitoring module. It also allows users to observe the activity in Grafana dashboards. Below is an overview of the software stack. 63 V2.0 T​his DTN service provides a) a network and system test point (software stack, architecture, connectivity and performance, including over WANs) for large data transfer projects b) a platform for large flow projects c) additional capacity for large flow projects d) access to experimental scalable DTN technology, including over WANs. In preparation for SC19, a project has been initiated to evaluate the standards, technologies, and switches that are emerging for support for 400 Gbps Ethernet NICs. Plans are being developed to determine how to showcase a) 400 Gbps LAN capabilities at SC19, including in the StarLight booth, b) 400 WAN capabilities and c) the interface between the 400 Gbps WAN and the 400 Gbps LAN. AtlanticWave-SDX The AtlanticWave-SDX (AW-SDX) is being built as a distributed, multi-domain, wide-area SDX platform that controls many network switches across the U.S. and South America. Its primary user audience is network administrators and domain scientists. Because of its distributed nature, the AW-SDX architecture is split into multiple layers: (1) Users, Orchestrators and Applications; (2) the AW-SDX Controller; and (3) the Local Controllers (LC). Figure below provides a representation of the current AW-SDX architecture. 64 V2.0 The SDX Controller provides northbound interfaces for external requests from orchestrators, users and applications. Orchestrators that are currently supported are SENSE [14] and NSI [15]. Users consist of domain scientists and network operators. They are supported through different web interfaces with meaningful input templates for each external request, as well as through a REST API. Applications, such as scientific workflow management systems, that consume end-to-end services composed by an SDX controller, are also supported through a REST API. Authentication and Authorization of external requests are the responsibility of the AW-SDX Controller. The AW-SDX Controller breaks down policies from the northbound interfaces to rules for the Local Controllers. A Local Controller-to-SDX API (LC/SDX API) enables the AW-SDX Controller to maintain a full topology of the network. Local Controllers (LC) operate at each local exchange point site. Each site as a LC and local switches that collectively define the Local Data Plane. LC translate rules into a switch’s lower-level southbound interface. LC are responsible for relaying status information from a resource to the SDX Controller. Southbound interfaces currently supported are OpenFlow 1.3, and OpenFlow 1.3 with Corsa extensions. Southbound interfaces are designed to add new interfaces easily, such a P4, and compute and storage. AtlanticWave-SDX will be deploying SDX Controllers at the following SDXs: SoX in Atlanta, AMPATH in Miami, SouthernLight in Sao Paulo, Brazil, and AndesLight in Santiago, Chile. Pacific Wave Expansion Supporting SDX & Experimentation References: 1. MEF Publishes Industry’s First SD-WAN Standard, https://newswire.telecomramblings.com/2019/08/mef-publishes-industrys-first-sd-wan-stand ard/ 2. SD-WAN Service Attributes and Services, July 2019. h ​ ttps://www.mef.net/mef-3-0-sd-wan 3. What is Software-Defined WAN (or SD-WAN or SDWAN)? https://www.sdxcentral.com/networking/sd-wan/definitions/software-defined-sdn-wan/ 65 V2.0 4. Gupta, Arpit, et al. "Sdx: A software defined internet exchange." ACM SIGCOMM Computer Communication Review. Vol. 44. No. 4. ACM, 2014. 5. Feamster, N., Ricci, R., “Report of the NSF Workshop on Software Defined Infrastructures and Software Defined Exchanges.” February 4-5, 2016. 6. Chung, Joaquín, et al. "Atlanticwave-sdx: An international sdx to support science data applications." Software Defined Networking (SDN) for Scientific Networking Workshop, SC'15. 2015. 7. J. Ibarra, J. Bezerra, H. Morgan, L. Fernandez Lopez, M. Stanton, I. Machado, E. Grizendi, D. Cox, Benefits brought by the use of OpenFlow/SDN on the AmLight intercontinental research and education network, in: 2015 IFIP/IEEE International Symposium on Integrated Network Management, IM, 2015, pp. 942–947, http://dx.doi.org/10.1109/INM.2015.7140415​. 8. J. Chung, E.-S. Jung, R. Kettimuthu, N.S. Rao, I.T. Foster, R. Clark, H. Owen, Advance reservation access control using software-defined networking and tokens, Future Gener. Comput. Syst. 79 (Part 1) (2018) 225–234. 9. Software-Defined Networking: The New Norm for Networks, 2012. Technical Report, Open Networking Foundation (ONF) (April). 10. Karakus, Murat, and Arjan Durresi. "Quality of service (qos) in software defined networking (sdn): A survey." Journal of Network and Computer Applications 80 (2017): 200-218. 11. StarLight SDX: A Software Defined Networking Exchange for Global Science Research and Education. https://www.nsf.gov/awardsearch/showAward?AWD_ID=1450871&HistoricalAwards=false 12. AtlanticWave-Software Defined Exchange: A Distributed Intercontinental Experimental Software Defined Exchange (SDX). https://www.nsf.gov/awardsearch/showAward?AWD_ID=1451024&HistoricalAwards=false 13. Pacific Wave Expansion Supporting SDX & Experimentation. https://www.nsf.gov/awardsearch/showAward?AWD_ID=1451050&HistoricalAwards=false 14. I. Monga, C. Guok, J. MacAuley, A. Sim, H. Newman, J. Balcas, P. Demar, L. Winkler, T. Lehman, X. Yang, SDN for end-to-end networked science at the exascale (SENSE), in: The 5th Innovating the Network for Data-Intensive Science, INDIS Workshop, Dallas, TX, USA, 2018. 15. Open Grid Forum, “Network Service Interface.” ​https://redmine.ogf.org/projects/nsi-wg​. 16. SDN for End-to-end Networked Science at the Exascale (SENSE), http://sense.es.net 66
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Volker Beckmann
Centre National de la Recherche Scientifique / French National Centre for Scientific Research
Vladimir Dzhunushaliev
Kazakh national university named after al-Farabi
L. Burderi
Università degli Studi di Cagliari
Sabin Thapa
Kent State University