Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
VL2: A Scalable and Flexible Data Center Network Albert Greenberg Srikanth Kandula David A. Maltz James R. Hamilton Changhoon Kim Parveen Patel Navendu Jain Parantap Lahiri Sudipta Sengupta Microsoft Research Abstract Agility promises improved risk management and cost savings. Without agility, each service must pre-allocate enough servers to meet difficult to predict demand spikes, or risk failure at the brink of success. With agility, the data center operator can meet the fluctuating demands of individual services from a large shared server pool, resulting in higher server utilization and lower costs. Unfortunately, the designs for today’s data center network prevent agility in several ways. First, existing architectures do not provide enough capacity between the servers they interconnect. Conventional architectures rely on tree-like network configurations built from high-cost hardware. Due to the cost of the equipment, the capacity between different branches of the tree is typically oversubscribed by factors of : or more, with paths through the highest levels of the tree oversubscribed by factors of : to :. This limits communication between servers to the point that it fragments the server pool — congestion and computation hot-spots are prevalent even when spare capacity is available elsewhere. Second, while data centers host multiple services, the network does little to prevent a traffic flood in one service from affecting the other services around it — when one service experiences a traffic flood,it is common for all those sharing the same network sub-tree to suffer collateral damage. Third, the routing design in conventional networks achieves scale by assigning servers topologically significant IP addresses and dividing servers among VLANs. Such fragmentation of the address space limits the utility of virtual machines, which cannot migrate out of their original VLAN while keeping the same IP address. Further, the fragmentation of address space creates an enormous configuration burden when servers must be reassigned among services, and the human involvement typically required in these reconfigurations limits the speed of deployment. To overcome these limitations in today’s design and achieve agility, we arrange for the network to implement a familiar and concrete model: give each service the illusion that all the servers assigned to it, and only those servers, are connected by a single non-interfering Ethernet switch—a Virtual Layer — and maintain this illusion even as the size of each service varies from  server to ,. Realizing this vision concretely translates into building a network that meets the following three objectives: To be agile and cost effective, data centers should allow dynamic resource allocation across large server pools. In particular, the data center network should enable any server to be assigned to any service. To meet these goals, we present VL, a practical network architecture that scales to support huge data centers with uniform high capacity between servers, performance isolation between services, and Ethernet layer- semantics. VL uses () flat addressing to allow service instances to be placed anywhere in the network, () Valiant Load Balancing to spread traffic uniformly across network paths, and () end-system based address resolution to scale to large server pools, without introducing complexity to the network control plane. VL’s design is driven by detailed measurements of traffic and fault data from a large operational cloud service provider. VL’s implementation leverages proven network technologies, already available at low cost in high-speed hardware implementations, to build a scalable and reliable network architecture. As a result, VL networks can be deployed today, and we have built a working prototype. We evaluate the merits of the VL design using measurement, analysis, and experiments. Our VL prototype shuffles . TB of data among  servers in  seconds – sustaining a rate that is  of the maximum possible. Categories and Subject Descriptors: C.. [Computer-Communication Network]: Network Architecture and Design General Terms: Design, Performance, Reliability Keywords: Data center network, commoditization 1. INTRODUCTION Cloud services are driving the creation of data centers that hold tens to hundreds of thousands of servers and that concurrently support a large number of distinct services (e.g., search, email, mapreduce computations, and utility computing). The motivations for building such shared data centers are both economic and technical: to leverage the economies of scale available to bulk deployments and to benefit from the ability to dynamically reallocate servers among services as workload changes or equipment fails [, ]. The cost is also large – upwards of  million per month for a , server data center — with the servers themselves comprising the largest cost component. To be profitable, these data centers must achieve high utilization, and key to this is the property of agility — the capacity to assign any server to any service. • Uniform high capacity: The maximum rate of a server-to-server traffic flow should be limited only by the available capacity on the network-interface cards of the sending and receiving servers, and assigning servers to a service should be independent of network topology. • Performance isolation: Traffic of one service should not be affected by the traffic of any other service, just as if each service was connected by a separate physical switch. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGCOMM’09, August 17–21, 2009, Barcelona, Spain. Copyright 2009 ACM 978-1-60558-594-9/09/08 ...$10.00. • Layer- semantics: Just as if the servers were on a LAN—where any IP address can be connected to any port of an Ethernet switch due to flat addressing—data-center management software should be able to easily assign any server to any service and configure 51 ✂✄ ✂ ✂✄ ✂ ✁ ☎ ✁ ✄ ✁ ☎ ✁ ✄ ✆ ✝ ✆ ✝ ☞ ✡ ✂ ✂✄ ☛ ☛ ✄ ✁ ☎ ✌ ☛ ✄ ☎✎ ✞ ✍ ✝ ✞ ✝✟ ✞ ✝ ✞ ✝ ✌ ☛ ✄ ☎✑ ✞ ✍ ✠ ✞ ✠ ✦ ✧ ★ ✙ ✬ ✪ ✪ ✮✘ ✤ ✜ ✘ ✤ ✭ ✜ ✫ ✩ •✩ ✒ ✙ ✬ ✒ ✪ ✪ ✮ ✯ ✯ ✘ ✰ ✰ ✤ ✭ ✘ ✜ ✫ • ✠ ✠ ✠ ✠✟ ✲ ✒ ✓✙ ✢ ✒ ✓ ✔✮✯ ✖ ✜ ✱ • ✫✖ ✲ ✓ ✙ ✢ ✓ ✔ ✮ ✯ • ✫✱ ✶✵✪ ✷ ✲ ✓ ✔✮✯ ✪ ✤ ✤ ✴ ✚ ✯ ✱ ✫ ✵✤ ✸ ✺ ✸ ✺✸ ✺ ✸ ✺ •✳ ✳ ✹ ✹ ✹ ✹ ✻✼✽✾✼✽✿✏ ✻✼✽✾✼✽✿ ✻✼✽✾✼✽✿✏ ✻✼✽✾✼✽✿ ✗✘ ✒ ✓ ✔✕ ✙ ✔✕ ✣ ✖ ✚ ✛ ✘ ✜✢ ✤ ✥ ✚ that server with whatever IP address the service expects. Virtual machines should be able to migrate to any server while keeping the same IP address, and the network configuration of each server should be identical to what it would be if connected via a LAN. Finally, features like link-local broadcast, on which many legacy applications depend, should work. In this paper we design, implement and evaluate VL, a network architecture for data centers that meets these three objectives and thereby provides agility. In creating VL, a goal was to investigate whether we could create a network architecture that could be deployed today, so we limit ourselves from making any changes to the hardware of the switches or servers, and we require that legacy applications work unmodified. However, the software and operating systems on data-center servers are already extensively modified (e.g., to create hypervisors for virtualization or blob file-systems to store data). Therefore, VL’s design explores a new split in the responsibilities between host and network — using a layer . shim in servers’ network stack to work around limitations of the network devices. No new switch software or APIs are needed. VL consists of a network built from low-cost switch ASICs arranged into a Clos topology [] that provides extensive path diversity between servers. Our measurements show data centers have tremendous volatility in their workload, their traffic, and their failure patterns. To cope with this volatility, we adopt Valiant Load Balancing (VLB) [, ] to spread traffic across all available paths without any centralized coordination or traffic engineering. Using VLB, each server independently picks a path at random through the network for each of the flows it sends to other servers in the data center. Common concerns with VLB, such as the extra latency and the consumption of extra network capacity caused by path stretch, are overcome by a combination of our environment (propagation delay is very small inside a data center) and our topology (which includes an extra layer of switches that packets bounce off of). Our experiments verify that our choice of using VLB achieves both the uniform capacity and performance isolation objectives. The switches that make up the network operate as layer- routers with routing tables calculated by OSPF, thereby enabling the use of multiple paths (unlike Spanning Tree Protocol) while using a well-trusted protocol. However, the IP addresses used by services running in the data center cannot be tied to particular switches in the network, or the agility to reassign servers between services would be lost. Leveraging a trick used in many systems [], VL assigns servers IP addresses that act as names alone, with no topological significance. When a server sends a packet, the shim-layer on the server invokes a directory system to learn the actual location of the destination and then tunnels the original packet there. The shim-layer also helps eliminate the scalability problems created by ARP in layer- networks, and the tunneling improves our ability to implement VLB. These aspects of the design enable VL to provide layer- semantics, while eliminating the fragmentation and waste of server pool capacity that the binding between addresses and locations causes in the existing architecture. Taken together, VL’s choices of topology, routing design, and software architecture create a huge shared pool of network capacity that each pair of servers can draw from when communicating. We implement VLB by causing the traffic between any pair of servers to bounce off a randomly chosen switch in the top level of the Clos topology and leverage the features of layer- routers, such as EqualCost MultiPath (ECMP), to spread the traffic along multiple subpaths for these two path segments. Further, we use anycast addresses and an implementation of Paxos [] in a way that simplifies the design of the Directory System and, when failures occur, provides consistency properties that are on par with existing protocols. Figure : A conventional network architecture for data centers (adapted from figure by Cisco []). The feasibility of our design rests on several questions that we experimentally evaluate. First, the theory behind Valiant Load Balancing, which proves that the network will be hot-spot free, requires that (a) randomization is performed at the granularity of small packets, and (b) the traffic sent into the network conforms to the hose model []. For practical reasons, however, VL picks a different path for each flow rather than packet (falling short of (a)), and it also relies on TCP to police the offered traffic to the hose model (falling short of (b), as TCP needs multiple RTTs to conform traffic to the hose model). Nonetheless, our experiments show that for data-center traffic, the VL design choices are sufficient to offer the desired hot-spot free properties in real deployments. Second, the directory system that provides the routing information needed to reach servers in the data center must be able to handle heavy workloads at very low latency. We show that designing and implementing such a directory system is achievable. In the remainder of this paper we will make the following contributions, in roughly this order. • We make a first of its kind study of the traffic patterns in a production data center, and find that there is tremendous volatility in the traffic, cycling among - different patterns during a day and spending less than  s in each pattern at the th percentile. • We design, build, and deploy every component of VL in an server cluster. Using the cluster, we experimentally validate that VL has the properties set out as objectives, such as uniform capacity and performance isolation. We also demonstrate the speed of the network, such as its ability to shuffle . TB of data among  servers in  s. • We apply Valiant Load Balancing in a new context, the interswitch fabric of a data center, and show that flow-level traffic splitting achieves almost identical split ratios (within  of optimal fairness index) on realistic data center traffic, and it smoothes utilization while eliminating persistent congestion. • We justify the design trade-offs made in VL by comparing the cost of a VL network with that of an equivalent network based on existing designs. 2. BACKGROUND In this section, we first explain the dominant design pattern for data-center architecture today []. We then discuss why this architecture is insufficient to serve large cloud-service data centers. As shown in Figure , the network is a hierarchy reaching from a layer of servers in racks at the bottom to a layer of core routers at the top. There are typically  to  servers per rack, each singly connected to a Top of Rack (ToR) switch with a  Gbps link. ToRs connect to two aggregation switches for redundancy, and these switches aggregate further connecting to access routers. At the top of the hierarchy, core routers carry traffic between access routers and man- 52 PDF age traffic into and out of the data center. All links use Ethernet as a physical-layer protocol, with a mix of copper and fiber cabling. All switches below each pair of access routers form a single layer domain, typically connecting several thousand servers. To limit overheads (e.g., packet flooding and ARP broadcasts) and to isolate different services or logical server groups (e.g., email, search, web front ends, web back ends), servers are partitioned into virtual LANs (VLANs). Unfortunately, this conventional design suffers from three fundamental limitations: Limited server-to-server capacity: As we go up the hierarchy, we are confronted with steep technical and financial barriers in sustaining high bandwidth. Thus, as traffic moves up through the layers of switches and routers, the over-subscription ratio increases rapidly. For example, servers typically have : over-subscription to other servers in the same rack — that is, they can communicate at the full rate of their interfaces (e.g.,  Gbps). We found that up-links from ToRs are typically : to : oversubscribed (i.e.,  to  Gbps of up-link for  servers), and paths through the highest layer of the tree can be : oversubscribed. This large over-subscription factor fragments the server pool by preventing idle servers from being assigned to overloaded services, and it severely limits the entire data-center’s performance. Fragmentation of resources: As the cost and performance of communication depends on distance in the hierarchy, the conventional design encourages service planners to cluster servers nearby in the hierarchy. Moreover, spreading a service outside a single layer- domain frequently requires reconfiguring IP addresses and VLAN trunks, since the IP addresses used by servers are topologically determined by the access routers above them. The result is a high turnaround time for such reconfiguration. Today’s designs avoid this reconfiguration lag by wasting resources; the plentiful spare capacity throughout the data center is often effectively reserved by individual services (and not shared), so that each service can scale out to nearby servers to respond rapidly to demand spikes or to failures. Despite this, we have observed instances when the growing resource needs of one service have forced data center operations to evict other services from nearby servers, incurring significant cost and disruption. Poor reliability and utilization: Above the ToR, the basic resilience model is :, i.e., the network is provisioned such that if an aggregation switch or access router fails, there must be sufficient remaining idle capacity on a counterpart device to carry the load. This forces each device and link to be run up to at most  of its maximum utilization. Further, multiple paths either do not exist or aren’t effectively utilized. Within a layer- domain, the Spanning Tree Protocol causes only a single path to be used even when multiple paths between switches exist. In the layer- portion, Equal Cost Multipath (ECMP) when turned on, can use multiple paths to a destination if paths of the same cost are available. However, the conventional topology offers at most two paths. 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Flow Size PDF Total Bytes PDF 1 100 10000 1e+06 1e+08 1e+10 1e+12 CDF Flow Size (Bytes) 1 0.8 0.6 0.4 0.2 0 Flow Size CDF Total Bytes CDF 1 100 10000 1e+06 1e+08 1e+10 1e+12 Flow Size (Bytes) Figure : Mice are numerous;  of flows are smaller than  MB. However, more than  of bytes are in flows between  MB and  GB. Our measurement studies found two key results with implications for the network design. First, the traffic patterns inside a data center are highly divergent (as even over  representative traffic matrices only loosely cover the actual traffic matrices seen),and they change rapidly and unpredictably. Second, the hierarchical topology is intrinsically unreliable—even with huge effort and expense to increase the reliability of the network devices close to the top of the hierarchy, we still see failures on those devices resulting in significant downtimes. 3.1 Data-Center Traffic Analysis Analysis of Netflow and SNMP data from our data centers reveals several macroscopic trends. First, the ratio of traffic volume between servers in our data centers to traffic entering/leaving our data centers is currently around : (excluding CDN applications). Second, data-center computation is focused where high speed access to data on memory or disk is fast and cheap. Although data is distributed across multiple data centers, intense computation and communication on data does not straddle data centers due to the cost of long-haul links. Third, the demand for bandwidth between servers inside a data center is growing faster than the demand for bandwidth to external hosts. Fourth, the network is a bottleneck to computation. We frequently see ToR switches whose uplinks are above  utilization. To uncover the exact nature of traffic inside a data center, we instrumented a highly utilized , node cluster in a data center that supports data mining on petabytes of data. The servers are distributed roughly evenly across  ToR switches, which are connected hierarchically as shown in Figure . We collected socket-level event logs from all machines over two months. 3.2 Flow Distribution Analysis 3. MEASUREMENTS & IMPLICATIONS Distribution of flow sizes: Figure  illustrates the nature of flows within the monitored data center. The flow size statistics (marked as ‘+’s) show that the majority of flows are small (a few KB); most of these small flows are hellos and meta-data requests to the distributed file system. To examine longer flows, we compute a statistic termed total bytes (marked as ‘o’s) by weighting each flow size by its number of bytes. Total bytes tells us, for a random byte, the distribution of the flow size it belongs to. Almost all the bytes in the data center are transported in flows whose lengths vary from about  MB to about  GB. The mode at around  MB springs from the fact that the distributed file system breaks long files into -MB size chunks. Importantly, flows over a few GB are rare. To design VL, we first needed to understand the data center environment in which it would operate. Interviews with architects, developers, and operators led to the objectives described in Section , but developing the mechanisms on which to build the network requires a quantitative understanding of the traffic matrix (who sends how much data to whom and when?) and churn (how often does the state of the network change due to changes in demand or switch/link failures and recoveries, etc.?). We analyze these aspects by studying production data centers of a large cloud service provider and use the results to justify our design choices as well as the workloads used to stress the VL testbed. 53 1000 Number of Concurrent flows in/out of each Machine 15 10 300 200 200 5 0 Figure : Number of concurrent connections has two modes: ()  flows per node more than  of the time and ()  flows per node for at least  of the time. Frequency 100 20 100 10 25 0 200 400 600 800 1000 Time in 100s intervals (a) 0 1 Frequency 0 30 50 100 0.01 35 0 0.02 40 Index of the Containing Cluster 1 0.8 0.6 0.4 0.2 0 PDF CDF 0.03 Cumulative Fraction of Time 0.04 0 5 10 Run Length (b) 20 2.0 3.0 4.0 log(Time to Repeat) (c) Figure : Lack of short-term predictability: The cluster to which a traffic matrix belongs, i.e., the type of traffic mix in the TM, changes quickly and randomly. Similar to Internet flow characteristics [], we find that there are myriad small flows (mice). On the other hand, as compared with Internet flows, the distribution is simpler and more uniform. The reason is that in data centers, internal flows arise in an engineered environment driven by careful design decisions (e.g., the -MB chunk size is driven by the need to amortize disk-seek times over read times) and by strong incentives to use storage and analytic tools with well understood resilience and performance. Number of Concurrent Flows: Figure  shows the probability density function (as a fraction of time) for the number of concurrent flows going in and out of a machine, computed over all , monitored machines for a representative day’s worth of flow data. There are two modes. More than  of the time, an average machine has about ten concurrent flows, but at least  of the time it has greater than  concurrent flows. We almost never see more than  concurrent flows. The distributions of flow size and number of concurrent flows both imply that VLB will perform well on this traffic. Since even big flows are only  MB ( s of transmit time at  Gbps), randomizing at flow granularity (rather than packet) will not cause perpetual congestion if there is unlucky placement of a few flows. Moreover, adaptive routing schemes may be difficult to implement in the data center, since any reactive traffic engineering will need to run at least once a second if it wants to react to individual flows. Surprisingly, the number of representative traffic matrices in our data center is quite large. On a timeseries of  TMs, indicating a day’s worth of traffic in the datacenter, even when approximating with 50 − 60 clusters, the fitting error remains high () and only decreases moderately beyond that point. This indicates that the variability in datacenter traffic is not amenable to concise summarization and hence engineering routes for just a few traffic matrices is unlikely to work well for the traffic encountered in practice. Instability of traffic patterns: Next we ask how predictable is the traffic in the next interval given the current traffic? Traffic predictability enhances the ability of an operator to engineer routing as traffic demand changes. To analyze the predictability of traffic in the network, we find the  best TM clusters using the technique above and classify the traffic matrix for each  s interval to the best fitting cluster. Figure (a) shows that the traffic pattern changes nearly constantly, with no periodicity that could help predict the future. Figure (b) shows the distribution of run lengths - how many intervals does the network traffic pattern spend in one cluster before shifting to the next. The run length is  to the th percentile. Figure (c) shows the time between intervals where the traffic maps to the same cluster. But for the mode at s caused by transitions within a run, there is no structure to when a traffic pattern will next appear. The lack of predictability stems from the use of randomness to improve the performance of data-center applications. For example, the distributed file system spreads data chunks randomly across servers for load distribution and redundancy. The volatility implies that it is unlikely that other routing strategies will outperform VLB. 3.3 Traffic Matrix Analysis Poor summarizability of traffic patterns: Next, we ask the question: Is there regularity in the traffic that might be exploited through careful measurement and traffic engineering? If traffic in the DC were to follow a few simple patterns, then the network could be easily optimized to be capacity-efficient for most traffic. To answer, we examine how the Traffic Matrix(TM) of the , server cluster changes over time. For computational tractability, we compute the ToR-to-ToR TM — the entry T M (t)i,j is the number of bytes sent from servers in ToR i to servers in ToR j during the  s beginning at time t. We compute one TM for every  s interval, and servers outside the cluster are treated as belonging to a single “ToR”. Given the timeseries of TMs, we find clusters of similar TMs using a technique due to Zhang et al. []. In short, the technique recursively collapses the traffic matrices that are most similar to each other into a cluster, where the distance (i.e., similarity) reflects how much traffic needs to be shuffled to make one TM look like the other. We then choose a representative TM for each cluster, such that any routing that can deal with the representative TM performs no worse on every TM in the cluster. Using a single representative TM per cluster yields a fitting error (quantified by the distances between each representative TMs and the actual TMs they represent), which will decrease as the number of clusters increases. Finally, if there is a knee point (i.e., a small number of clusters that reduces the fitting error considerably), the resulting set of clusters and their representative TMs at that knee corresponds to a succinct number of distinct traffic matrices that summarize all TMs in the set. 3.4 Failure Characteristics To design VL to tolerate the failures and churn found in data centers, we collected failure logs for over a year from eight production data centers that comprise hundreds of thousands of servers, host over a hundred cloud services and serve millions of users. We analyzed hardware and software failures of switches, routers, load balancers, firewalls, links and servers using SNMP polling/traps, syslogs, server alarms, and transaction monitoring frameworks. In all, we looked at M error events from over K alarm tickets. What is the pattern of networking equipment failures? We define a failure as the event that occurs when a system or component is unable to perform its required function for more than  s. As expected, most failures are small in size (e.g.,  of network device failures involve <  devices and  of network device failures involve <  devices) while large correlated failures are rare (e.g., the largest correlated failure involved  switches). However, downtimes can be significant:  of failures are resolved in  min,  in <  hr, . in <  day, but . last >  days. What is the impact of networking equipment failure? As discussed in Section , conventional data center networks apply : re- 54 Separating names from locators: The data center network must support agility, which means, in particular, support for hosting any service on any server, for rapid growing and shrinking of server pools, and for rapid virtual machine migration. In turn, this calls for separating names from locations. VL’s addressing scheme separates server names, termed application-specific addresses (AAs), from their locations, termed location-specific addresses (LAs). VL uses a scalable, reliable directory system to maintain the mappings between names and locators. A shim layer running in the network stack on every server, called the VL agent, invokes the directory system’s resolution service. We evaluate the performance of the directory system in §.. Embracing End Systems: The rich and homogeneous programmability available at data-center hosts provides a mechanism to rapidly realize new functionality. For example, the VL agent enables fine-grained path control by adjusting the randomization used in VLB. The agent also replaces Ethernet’s ARP functionality with queries to the VL directory system. The directory system itself is also realized on servers, rather than switches, and thus offers flexibility, such as fine-grained, context-aware server access control and dynamic service re-provisioning. We next describe each aspect of the VL system and how they work together to implement a virtual layer- network. These aspects include the network topology, the addressing and routing design, and the directory that manages name-locator mappings. Internet Link-state network carrying only LAs (e.g., 10/8) DA/2 x Intermediate Switches ... Int DI x10G DA/2 x 10G ... Aggr DI x Aggregate Switches DA/2 x 10G 2 x10G ... DADI/4 x ToR Switches ToR ❀❁ ❂❃❄❅❃❄❆ 20(DADI/4) x Servers .... Fungible pool of servers owning AAs (e.g., 20/8) Figure : An example Clos network between Aggregation and Intermediate switches provides a richly-connected backbone wellsuited for VLB. The network is built with two separate address families — topologically significant Locator Addresses (LAs) and flat Application Addresses (AAs). dundancy to improve reliability at higher layers of the hierarchical tree. Despite these techniques, we find that in . of failures all redundant components in a network device group became unavailable (e.g., the pair of switches that comprise each node in the conventional network (Figure ) or both the uplinks from a switch). In one incident, the failure of a core switch (due to a faulty supervisor card) affected ten million users for about four hours. We found the main causes of these downtimes are network misconfigurations, firmware bugs, and faulty components (e.g., ports). With no obvious way to eliminate all failures from the top of the hierarchy, VL’s approach is to broaden the topmost levels of the network so that the impact of failures is muted and performance degrades gracefully, moving from : redundancy to n:m redundancy. 4.1 Scale-out Topologies As described in §., conventional hierarchical data-center topologies have poor bisection bandwidth and are also susceptible to major disruptions due to device failures at the highest levels. Rather than scale up individual network devices with more capacity and features, we scale out the devices — build a broad network offering huge aggregate capacity using a large number of simple, inexpensive devices, as shown in Figure . This is an example of a folded Clos network [] where the links between the Intermediate switches and the Aggregation switches form a complete bipartite graph. As in the conventional topology, ToRs connect to two Aggregation switches, but the large number of paths between every two Aggregation switches means that if there are n Intermediate switches, the failure of any one of them reduces the bisection bandwidth by only 1/n–a desirable graceful degradation of bandwidth that we evaluate in §.. Further, it is easy and less expensive to build a Clos network for which there is no over-subscription (further discussion on cost is in §). For example, in Figure , we use DA -port Aggregation and DI -port Intermediate switches, and connect these switches such that the capacity between each layer is DI DA /2 times the link capacity. The Clos topology is exceptionally well suited for VLB in that by indirectly forwarding traffic through an Intermediate switch at the top tier or “spine” of the network, the network can provide bandwidth guarantees for any traffic matrices subject to the hose model. Meanwhile, routing is extremely simple and resilient on this topology — take a random path up to a random intermediate switch and a random path down to a destination ToR switch. VL leverages the fact that at every generation of technology, switch-to-switch links are typically faster than server-to-switch links, and trends suggest that this gap will remain. Our current design uses G server links and G switch links, and the next design point will probably be G server links with G switch links. By leveraging this gap, we reduce the number of cables required to implement the Clos (as compared with a fat-tree []), and we simplify the task of spreading load over the links (§.). 4. VIRTUAL LAYER TWO NETWORKING Before detailing our solution, we briefly discuss our design principles and preview how they will be used in the VL design. Randomizing to Cope with Volatility: VL copes with the high divergence and unpredictability of data-center traffic matrices by using Valiant Load Balancing to do destinationindependent (e.g., random) traffic spreading across multiple intermediate nodes. We introduce our network topology suited for VLB in §., and the corresponding flow spreading mechanism in §.. VLB, in theory, ensures a non-interfering packet switched network [], the counterpart of a non-blocking circuit switched network, as long as (a) traffic spreading ratios are uniform, and (b) the offered traffic patterns do not violate edge constraints (i.e., line card speeds). To meet the latter condition, we rely on TCP’s end-to-end congestion control mechanism. While our mechanisms to realize VLB do not perfectly meet either of these conditions, we show in §. that our scheme’s performance is close to the optimum. Building on proven networking technology: VL is based on IP routing and forwarding technologies that are already available in commodity switches: link-state routing, equal-cost multi-path (ECMP) forwarding, IP anycasting, and IP multicasting. VL uses a link-state routing protocol to maintain the switch-level topology, but not to disseminate end hosts’ information. This strategy protects switches from needing to learn voluminous, frequently-changing host information. Furthermore, the routing design uses ECMP forwarding along with anycast addresses to enable VLB with minimal control plane messaging or churn. 55 tory service can enforce access-control policies. Further, since the directory system knows which server is making the request when handling a lookup, it can enforce fine-grained isolation policies. For example, it could enforce the policy that only servers belonging to the same service can communicate with each other. An advantage of VL is that, when inter-service communication is allowed, packets flow directly from a source to a destination, without being detoured to an IP gateway as is required to connect two VLANs in the conventional architecture. These addressing and forwarding mechanisms were chosen for two reasons. First, they make it possible to use low-cost switches, which often have small routing tables (typically just 16K entries) that can hold only LA routes, without concern for the huge number of AAs. Second, they reduce overhead in the network control plane by preventing it from seeing the churn in host state, tasking it to the more scalable directory system instead. Link-state network with LAs (10/8) Int (10.1.1.1) ... H(ft) 10.1.1.1 H(ft) 10.0.0.6 20.0.0.55 20.0.0.56 Payload (10.0.0.4) Int (10.1.1.1) Int ... (10.1.1.1) H(ft) 10.0.0.6 20.0.0.55 20.0.0.56 Payload (10.0.0.6) ToR ToR (20.0.0.1) (20.0.0.1) H(ft) 10.1.1.1 H(ft) 10.0.0.6 20.0.0.55 20.0.0.56 Payload 20.0.0.55 20.0.0.66 Payload S (20.0.0.55) IP subnet with AAs (20/8) D (20.0.0.56) IP subnet with AAs (20/8) Figure : VLB in an example VL network. Sender S sends packets to destination D via a randomly-chosen intermediate switch using IP-in-IP encapsulation. AAs are from 20/8, and LAs are from 10/8. H(ft) denotes a hash of the five tuple. 4.2 VL2 Addressing and Routing 4.2.2 Random traffic spreading over multiple paths This section explains how packets flow through a VL network, and how the topology, routing design, VL agent, and directory system combine to virtualize the underlying network fabric — creating the illusion that hosts are connected to a big, non-interfering datacenter-wide layer- switch. To offer hot-spot-free performance for arbitrary traffic matrices, VL uses two related mechanisms: VLB and ECMP. The goals of both are similar — VLB distributes traffic across a set of intermediate nodes and ECMP distributes across equal-cost paths — but each is needed to overcome limitations in the other. VL uses flows, rather than packets, as the basic unit of traffic spreading and thus avoids out-of-order delivery. Figure  illustrates how the VL agent uses encapsulation to implement VLB by sending traffic through a randomly-chosen Intermediate switch. The packet is first delivered to one of the Intermediate switches, decapsulated by the switch, delivered to the ToR’s LA, decapsulated again, and finally sent to the destination. While encapsulating packets to a specific, but randomly chosen, Intermediate switch correctly realizes VLB, it would require updating a potentially huge number of VL agents whenever an Intermediate switch’s availability changes due to switch/link failures. Instead, we assign the same LA address to all Intermediate switches, and the directory system returns this anycast address to agents upon lookup. Since all Intermediate switches are exactly three hops away from a source host, ECMP takes care of delivering packets encapsulated with the anycast address to any one of the active Intermediate switches. Upon switch or link failures, ECMP will react, eliminating the need to notify agents and ensuring scalability. In practice, however, the use of ECMP leads to two problems. First, switches today only support up to -way ECMP, with way ECMP being released by some vendors this year. If there are more paths available than ECMP can use, then VL defines several anycast addresses, each associated with only as many Intermediate switches as ECMP can accommodate. When an Intermediate switch fails, VL reassigns the anycast addresses from that switch to other Intermediate switches so that all anycast addresses remain live, and servers can remain unaware of the network churn. Second, some inexpensive switches cannot correctly retrieve the five-tuple values (e.g., the TCP ports) when a packet is encapsulated with multiple IP headers. Thus, the agent at the source computes a hash of the fivetuple values and writes that value into the source IP address field, which all switches do use in making ECMP forwarding decisions. The greatest concern with both ECMP and VLB is that if “elephant flows” are present, then the random placement of flows could lead to persistent congestion on some links while others are underutilized. Our evaluation did not find this to be a problem on datacenter workloads (§.). Should it occur, initial results show the VL agent can detect and deal with such situations with simple mechanisms, such as re-hashing to change the path of large flows when TCP detects a severe congestion event (e.g., a full window loss). 4.2.1 Address resolution and packet forwarding VL uses two different IP-address families, as illustrated in Figure . The network infrastructure operates using location-specific IP addresses (LAs); all switches and interfaces are assigned LAs, and switches run an IP-based (layer-) link-state routing protocol that disseminates only these LAs. This allows switches to obtain the complete switch-level topology, as well as forward packets encapsulated with LAs along shortest paths. On the other hand, applications use application-specific IP addresses (AAs), which remain unaltered no matter how servers’ locations change due to virtual-machine migration or re-provisioning. Each AA (server) is associated with an LA, the identifier of the ToR switch to which the server is connected. The VL directory system stores the mapping of AAs to LAs, and this mapping is created when application servers are provisioned to a service and assigned AA addresses. The crux of offering layer- semantics is having servers believe they share a single large IP subnet (i.e., the entire AA space) with other servers in the same service, while eliminating the ARP and DHCP scaling bottlenecks that plague large Ethernets. Packet forwarding: To route traffic between servers, which use AA addresses, on an underlying network that knows routes for LA addresses, the VL agent at each server traps packets from the host and encapsulates the packet with the LA address of the ToR of the destination as shown in Figure . Once the packet arrives at the LA (the destination ToR), the switch decapsulates the packet and delivers it to the destination AA carried in the inner header. Address resolution: Servers in each service are configured to believe that they all belong to the same IP subnet. Hence, when an application sends a packet to an AA for the first time,the networking stack on the host generates a broadcast ARP request for the destination AA. The VL agent running on the host intercepts this ARP request and converts it to a unicast query to the VL directory system. The directory system answers the query with the LA of the ToR to which packets should be tunneled. The VL agent caches this mapping from AA to LA addresses, similar to a host’s ARP cache, such that subsequent communication need not entail a directory lookup. Access control via the directory service: A server cannot send packets to an AA if the directory service refuses to provide it with an LA through which it can route its packets. This means that the direc- 56 ❇❈❉ ❲❚❳ ④ ❬ ❇❈❉❩ ❭❪❫✉♥②♣❭❇❈❉❚❖◆❯❖◆❱ ❨❩♦❭♣ q❩♠♥❝rs❩t✉✈❭✇✉①②♣❭③ ❑❑❑❨❬❫❊❈ ❊❈❨ ❫❑❑❑ ❊❈ ❑❑❑ ▲❚▼❖◆❖◆❯P◗❖❘◆◆❱❙ ❩ ❭❪❴ ❵❩❛❜❜❝❞❪ ❩❬❭❪❴ ❵❩⑤❪⑥②♣❭ ❧❩♠♥❝ ❡❋❢❣●❣❍❤■✐❏❥❦ ❡❋ ⑦●❥⑧❍⑨■⑩❶❏❦ Hence, we choose  ms as the maximum acceptable response time. For updates, however, the key requirement is reliability, and response time is less critical. Further, for updates that are scheduled ahead of time, as is typical of planned outages and upgrades, high throughput can be achieved by batching updates. Consistency requirements: Conventional L networks provide eventual consistency for the IP to MAC address mapping, as hosts will use a stale MAC address to send packets until the ARP cache times out and a new ARP request is sent. VL aims for a similar goal, eventual consistency of AA-to-LA mappings coupled with a reliable update mechanism. Figure : VL Directory System Architecture 4.3.2 Directory System Design The differing performance requirements and workload patterns of lookups and updates led us to a two-tiered directory system architecture. Our design consists of () a modest number (- servers for K servers) of read-optimized, replicated directory servers that cache AA-to-LA mappings and handle queries from VL agents, and () a small number (- servers) of write-optimized, asynchronous replicated state machine (RSM) servers that offer a strongly consistent, reliable store of AA-to-LA mappings. The directory servers ensure low latency, high throughput, and high availability for a high lookup rate. Meanwhile, the RSM servers ensure strong consistency and durability, using the Paxos [] consensus algorithm, for a modest rate of updates. Each directory server caches all the AA-to-LA mappings stored at the RSM servers and independently replies to lookups from agents using the cached state. Since strong consistency is not required, a directory server lazily synchronizes its local mappings with the RSM every  seconds. To achieve high availability and low latency, an agent sends a lookup to k (two in our prototype) randomly-chosen directory servers. If multiple replies are received, the agent simply chooses the fastest reply and stores it in its cache. The network provisioning system sends directory updates to a randomly-chosen directory server, which then forwards the update to a RSM server. The RSM reliably replicates the update to every RSM server and then replies with an acknowledgment to the directory server, which in turn forwards the acknowledgment back to the originating client. As an optimization to enhance consistency, the directory server can optionally disseminate the acknowledged updates to a few other directory servers. If the originating client does not receive an acknowledgment within a timeout (e.g., s), the client sends the same update to another directory server, trading response time for reliability and availability. Updating caches reactively: Since AA-to-LA mappings are cached at directory servers and in VL agents’ caches, an update can lead to inconsistency. To resolve inconsistency without wasting server and network resources, our design employs a reactive cacheupdate mechanism. The cache-update protocol leverages this observation: a stale host mapping needs to be corrected only when that mapping is used to deliver traffic. Specifically, when a stale mapping is used, some packets arrive at a stale LA—a ToR which does not host the destination server anymore. The ToR may forward a sample of such non-deliverable packets to a directory server, triggering the directory server to gratuitously correct the stale mapping in the source’s cache via unicast. 4.2.3 Backwards Compatibility This section describes how a VL network handles external traffic, as well as general layer- broadcast traffic. Interaction with hosts in the Internet: 20 of the traffic handled in our cloud-computing data centers is to or from the Internet, so the network must be able to handle these large volumes. Since VL employs a layer- routing fabric to implement a virtual layer network, the external traffic can directly flow across the highspeed silicon of the switches that make up VL, without being forced through gateway servers to have their headers rewritten, as required by some designs (e.g., Monsoon []). Servers that need to be directly reachable from the Internet (e.g., front-end web servers) are assigned two addresses: an LA in addition to the AA used for intra-data-center communication with backend servers. This LA is drawn from a pool that is announced via BGP and is externally reachable. Traffic from the Internet can then directly reach the server, and traffic from the server to external destinations will exit toward the Internet from the Intermediate switches, while being spread across the egress links by ECMP. Handling Broadcast: VL provides layer- semantics to applications for backwards compatibility, and that includes supporting broadcast and multicast. VL completely eliminates the most common sources of broadcast: ARP and DHCP. ARP is replaced by the directory system, and DHCP messages are intercepted at the ToR using conventional DHCP relay agents and unicast forwarded to DHCP servers. To handle other general layer- broadcast traffic, every service is assigned an IP multicast address, and all broadcast traffic in that service is handled via IP multicast using the servicespecific multicast address. The VL agent rate-limits broadcast traffic to prevent storms. 4.3 Maintaining Host Information using the VL2 Directory System The VL directory provides three key functions: () lookups and () updates for AA-to-LA mappings; and () a reactive cache update mechanism so that latency-sensitive updates (e.g., updating the AA to LA mapping for a virtual machine undergoing live migration) happen quickly. Our design goals are to provide scalability, reliability and high performance. 4.3.1 Characterizing requirements We expect the lookup workload for the directory system to be frequent and bursty. As discussed in Section ., servers can communicate with up to hundreds of other servers in a short time period with each flow generating a lookup for an AA-to-LA mapping. For updates, the workload is driven by failures and server startup events. As discussed in Section ., most failures are small in size and large correlated failures are rare. Performance requirements: The bursty nature of workload implies that lookups require high throughput and low response time. 5. EVALUATION In this section we evaluate VL using a prototype running on an  server testbed and  commodity switches (Figure ). Our goals are first to show that VL can be built from components that are available today, and second, that our implementation meets the objectives described in Section . 57 6000 50 5000 40 4000 30 3000 Aggregate goodput Active flows 20 2000 10 0 0 Active flows Aggregate goodput (Gbps) 60 1000 50 100 150 200 Time (s) 250 300 350 0 400 Figure : Aggregate goodput during a .TB shuffle among  servers. ginning at time  (some flows start late due to a bug in our traffic generator). VL completes the shuffle in  s. During the run, the sustained utilization of the core links in the Clos network is about . For the majority of the run, VL achieves an aggregate goodput of . Gbps. The goodput is evenly divided among the flows for most of the run, with a fairness index between the flows of . [], where . indicates perfect fairness (mean goodput per flow . Mbps, standard deviation . Mbps). This goodput is more than x what the network in our current data centers can achieve with the same investment (see §). How close is VL to the maximum achievable throughput in this environment? To answer this question, we compute the goodput efficiency for this data transfer. The goodput efficiency of the network for any interval of time is defined as the ratio of the sent goodput summed over all interfaces divided by the sum of the interface capacities. An efficiency of 1 would mean that all the capacity on all the interfaces is entirely used carrying useful bytes from the time the first flow starts to when the last flow ends. To calculate the goodput efficiency, two sources of inefficiency must be accounted for. First, to achieve a performance efficiency of 1, the server network interface cards must be completely fullduplex: able to both send and receive  Gbps simultaneously. Measurements show our interfaces are able to support a sustained rate of . Gbps (summing the sent and received capacity), introducing an inefficiency of 1 − 1.8 = 10%. The source of this inefficiency is 2 largely the device driver implementation. Second, for every two fullsize data packets there is a TCP ACK, and these three frames have  B of unavoidable overhead from Ethernet, IP and TCP headers for every  B sent over the network. This results in an inefficiency of . Therefore, our current testbed has an intrinsic inefficiency of  resulting in a maximum achievable goodput for our testbed of (75∗.83) = 62.3 Gbps. We derive this number by noting that every unit of traffic has to sink at a server, of which there are  instances and each has a Gbps link. Taking this into consideration, the VL network sustains an efficiency of 58.8/62.3 = 94% with the difference from perfect due to the encapsulation headers (.), TCP congestion control dynamics, and TCP retransmissions. To put this number in perspective, we note that a conventional hierarchical design with  servers per rack and : oversubscription at the first level switch would take x longer to shuffle the same amount of data as traffic from each server not in the rack () to each server within the rack () needs to flow through the Gbps downlink from first level switch to the ToR switch. The  efficiency combined with the fairness index of . demonstrates that VL promises to achieve uniform high bandwidth across all servers in the data center. Figure : VL testbed comprising  servers and  switches. The testbed is built using the Clos network topology of Figure , consisting of  Intermediate switches,  Aggregation switches and  ToRs. The Aggregation and Intermediate switches have  Gbps Ethernet ports, of which  ports are used on each Aggregation switch and  ports on each Intermediate switch. The ToRs switches have  Gbps ports and  Gbps ports. Each ToR is connected to two Aggregation switches via Gbps links, and to  servers via Gbps links. Internally, the switches use commodity ASICs — the Broadcom  and  — although any switch that supports line rate L forwarding, OSPF, ECMP, and IPinIP decapsulation will work. To enable detailed analysis of the TCP behavior seen during experiments, the servers’ kernels are instrumented to log TCP extended statistics [] (e.g., congestion window (cwnd) and smoothed RTT) after each socket buffer is sent (typically KB in our experiments). This logging only marginally affects goodput, i.e., useful information delivered per second to the application layer. We first investigate VL’s ability to provide high and uniform network bandwidth between servers. Then, we analyze performance isolation and fairness between traffic flows, measure convergence after link failures, and finally, quantify the performance of address resolution. Overall, our evaluation shows that VL provides an effective substrate for a scalable data center network; VL achieves ()  optimal network capacity, () a TCP fairness index of ., () graceful degradation under failures with fast reconvergence, and () K lookups/sec under ms for fast address resolution. 5.1 VL2 Provides Uniform High Capacity A central objective of VL is uniform high capacity between any two servers in the data center. How closely does the performance and efficiency of a VL network match that of a Layer  switch with : over-subscription? To answer this question, we consider an all-to-all data shuffle stress test: all servers simultaneously initiate TCP transfers to all other servers. This data shuffle pattern arises in large scale sorts, merges and join operations in the data center. We chose this test because, in our interactions with application developers, we learned that many use such operations with caution, because the operations are highly expensive in today’s data center network. However, data shuffles are required, and, if data shuffles can be efficiently supported, it could have large impact on the overall algorithmic and data storage strategy. We create an all-to-all data shuffle traffic matrix involving  servers. Each of  servers must deliver MB of data to each of the  other servers - a shuffle of . TB from memory to memory. Figure  shows how the sum of the goodput over all flows varies with time during a typical run of the . TB data shuffle. All data is carried over TCP connections, all of which attempt to connect be- 5.2 VL2 Provides VLB Fairness Due to its use of an anycast address on the intermediate switches, VL relies on ECMP to split traffic in equal ratios among the intermediate switches. Because ECMP does flow-level splitting, coexisting elephant and mice flows might be split unevenly at small time scales. To evaluate the effectiveness of VL’s implementation 58 Aggregate goodput (Gbps) 1.02 0.98 0.96 Agg1 Agg2 Agg3 0.94 0.92 0.9 0 100 200 300 Time (s) 400 500 600 Figure : Fairness measures how evenly flows are split to intermediate switches from aggregation switches. 15 10 Service 1 Service 2 5 0 60 80 100 120 140 Time (s) 160 180 200 220 Figure : Aggregate goodput of two services with servers intermingled on the ToRs. Service one’s goodput is unaffected as service two ramps traffic up and down. Aggregate goodput (Gbps) of Valiant Load Balancing in splitting traffic evenly across the network, we created an experiment on our -node testbed with traffic characteristics extracted from the DC workload of Section . Each server initially picks a value from the distribution of number of concurrent flows and maintains this number of flows throughout the experiment. At the start, or after a flow completes, it picks a new flow size from the associated distribution and starts the flow(s). Because all flows pass through the Aggregation switches, it is sufficient to check at each Aggregation switch for the split ratio among the links to the Intermediate switches. We do so by collecting SNMP counters at  second intervals for all links from Aggregation to Intermediate switches. Before proceeding further, we note that, unlike the efficiency experiment above, the traffic mix here is indicative of actual data center workload. We mimic the flow size distribution and the number of concurrent flows observed by measurements in §. In Figure , for each Aggregation switch, we plot Jain’s fairness index [] for the traffic to Intermediate switches as a time series. The average utilization of links was between  and . As shown in the figure, the VLB split ratio fairness index averages more than . for all Aggregation switches over the duration of this experiment. VL achieves such high fairness because there are enough flows at the Aggregation switches that randomization benefits from statistical multiplexing. This evaluation validates that our implementation of VLB is an effective mechanism for preventing hot spots in a data center network. Our randomization-based traffic splitting in Valiant Load Balancing takes advantage of the 10x gap in speed between server line cards and core network links. If the core network were built out of links with the same speed as the server line cards, then only one fullrate flow will fit on each link, and the spreading of flows has to be perfect in order to prevent two long-lived flows from traversing the same link and causing congestion. However, splitting at a sub-flow granularity (for example, flowlet switching []) might alleviate this problem. 20 2000 15 1500 10 1000 Aggregate goodput # mice started 5 0 50 60 70 80 90 Time (s) 100 110 120 500 # mice started Fairness 1 0 130 Figure : Aggregate goodput of service one as service two creates bursts containing successively more short TCP connections. A key question we need to validate for performance isolation is whether TCP reacts sufficiently quickly to control the offered rate of flows within services. TCP works with packets and adjusts their sending rate at the time-scale of RTTs. Conformance to the hose model, however, requires instantaneous feedback to avoid oversubscription of traffic ingress/egress bounds. Our next set of experiments shows that TCP is "fast enough" to enforce the hose model for traffic in each service so as to provide the desired performance isolation across services. In this experiment, we add two services to the network. The first service has  servers allocated to it and each server starts a single TCP transfer to one other server at time  and these flows last for the duration of the experiment. The second service starts with one server at  seconds and a new server is assigned to it every  seconds for a total of  servers. Every server in service two starts an GB transfer over TCP as soon as it starts up. Both the services’ servers are intermingled among the  ToRs to demonstrate agile assignment of servers. Figure  shows the aggregate goodput of both services as a function of time. As seen in the figure, there is no perceptible change to the aggregate goodput of service one as the flows in service two start or complete, demonstrating performance isolation when the traffic consists of large long-lived flows. Through extended TCP statistics, we inspected the congestion window size (cwnd) of service one’s TCP flows, and found that the flows fluctuate around their fair share briefly due to service two’s activity but stabilize quickly. We would expect that a service sending unlimited rates of UDP traffic might violate the hose model and hence performance isolation. We do not observe such UDP traffic in our data centers, although techniques such as STCP to make UDP “TCP friendly” are well known if needed []. However, large numbers of short TCP connections (mice), which are common in DCs (Section ), have the potential to cause problems similar to UDP as each flow can transmit small bursts of packets during slow start. To evaluate this aspect, we conduct a second experiment with service one sending long-lived TCP flows, as in experiment one. Servers in service two create bursts of short TCP connections ( to  KB), each burst containing progressively more connections. Figure  shows the aggregate goodput of service one’s flows along with the total number of TCP connections created by service two. Again, 5.3 VL2 Provides Performance Isolation One of the primary objectives of VL is agility, which we define as the ability to assign any server, anywhere in the data center, to any service (§). Achieving agility critically depends on providing sufficient performance isolation between services so that if one service comes under attack or a bug causes it to spray packets, it does not adversely impact the performance of other services. Performance isolation in VL rests on the mathematics of VLB — that any traffic matrix that obeys the hose model is routed by splitting to intermediate nodes in equal ratios (through randomization) to prevent any persistent hot spots. Rather than have VL perform admission control or rate shaping to ensure the traffic offered to the network conforms to the hose model, we instead rely on TCP to ensure that each flow offered to the network is rate-limited to its fair share of its bottleneck. 59 failing links the VL agent instances generate lookups and updates following a bursty random process, emulating storms of lookups and updates. Each directory server refreshes all mappings (K) from the RSM once every  seconds. Our evaluation supports four main conclusions. First, the directory system provides high throughput and fast response time for lookups; three directory servers can handle K lookups/sec with latency under ms (th percentile latency). Second, the directory system can handle updates at rates significantly higher than expected churn rate in typical environments: three directory servers can handle K updates/sec within ms (th percentile latency). Third, our system is incrementally scalable; each directory server increases the processing rate by about K for lookups and K for updates. Finally, the directory system is robust to component (directory or RSM servers) failures and offers high availability under network churns. Throughput: In the first micro-benchmark, we vary the lookup and update rate and observe the response latencies (st , th and th percentile). We observe that a directory system with three directory servers handles K lookups/sec within ms, which we set as the maximum acceptable latency for an “ARP request”. Up to K lookups/sec, the system offers a median response time of < ms. Updates, however, are more expensive, as they require executing a consensus protocol [] to ensure that all RSM replicas are mutually consistent. Since high throughput is more important than latency for updates, we batch updates over a short time interval (i.e., ms). We find that three directory servers backed by three RSM servers can handle K updates/sec within ms and about K updates/sec within s. Scalability: To understand the incremental scalability of the directory system, we measured the maximum lookup rates (ensuring sub-ms latency for  requests) with , , and  directory servers. The result confirmed that the maximum lookup rates increases linearly with the number of directory servers (with each server offering a capacity of 17K lookups/sec). Based on this result, we estimate the worst case number of directory servers needed for a K server data center. From the concurrent flow measurements (Figure ), we select as a baseline the median of  correspondents per server. In the worst case, all K servers may perform  simultaneous lookups at the same time resulting in a million simultaneous lookups per second. As noted above, each directory server can handle about K lookups/sec under ms at the th percentile. Therefore, handling this worst case requires a modest-sized directory system of about  servers (0.06 of the entire servers). Resilience and availability: We examine the effect of directory server failures on latency. We vary the number of directory servers while keeping the workload constant at a rate of K lookups/sec and K updates/sec (a higher load than expected for three directory servers). In Figure (a), the lines for one directory server show that it can handle  of the lookup load (K) within ms. The spike at two seconds is due to the timeout value of s in our prototype. The entire load is handled by two directory servers, demonstrating the system’s fault tolerance. Additionally, the lossy network curve shows the latency of three directory servers under severe () packet losses between directory servers and clients (either requests or responses), showing the system ensures availability under network churns. For updates, however, the performance impact of the number of directory servers is higher than updates because each update is sent to a single directory server to ensure correctness. Figure (b) shows that failures of individual directory servers do not collapse the entire system’s processing capacity to handle updates. The step pattern on the curves is due to a batching of updates (occurring every ms). We also find that the primary RSM server’s failure leads restoring links Figure : Aggregate goodput as all links to switches Intermediate and Intermediate are unplugged in succession and then reconnected in succession. Approximate times of link manipulation marked with vertical lines. Network re-converges in < 1s after each failure and demonstrates graceful degradation. service one’s goodput is unaffected by service two’s activity. We inspected the cwnd of service one’s TCP flows and found only brief fluctuations due to service two’s activity. The two experiments above demonstrate TCP’s natural enforcement of the hose model combined with VLB and a network with no oversubscription is sufficient to provide performance isolation between services. 5.4 VL2 Convergence After Link Failures In this section, we evaluate VL’s response to a link or a switch failure, which could be caused by a physical failure or due to the routing protocol converting a link flap to a link failure. We begin an all-to-all data shuffle and then disconnect links between Intermediate and Aggregation switches until only one Intermediate switch remains connected and the removal of one additional link would partition the network. According to our study of failures, this type of mass link failure has never occurred in our data centers, but we use it as an illustrative stress test. Figure  shows a time series of the aggregate goodput achieved by the flows in the data shuffle, with the times at which links were disconnected and then reconnected marked by vertical lines. The figure shows that OSPF re-converges quickly (sub-second) after each failure. Both Valiant Load Balancing and ECMP work as expected, and the maximum capacity of the network gracefully degrades. Restoration, however, is delayed by the conservative defaults for OSPF timers that are slow to act on link restoration. Hence, VL fully uses a link roughly s after it is restored. We note, however, that restoration does not interfere with traffic and, the aggregate goodput eventually returns to its previous level. This experiment also demonstrates the behavior of VL when the network is structurally oversubscribed, i.e., the Clos network has less capacity than the capacity of the links from the ToRs. For the over-subscription ratios between : and : created during this experiment, VL continues to carry the all-to-all traffic at roughly  of maximum efficiency, indicating that the traffic spreading in VL fully utilizes the available capacity. 5.5 Directory-system performance Finally, we evaluate the performance of the VL directory system through macro- and micro-benchmark experiments. We run our prototype on up to  machines with - RSM nodes, - directory server nodes, and the rest emulating multiple instances of VL agents that generate lookups and updates. In all experiments, the system is configured such that an agent sends a lookup request to two directory servers chosen at random and accepts the first response. An update request is sent to a directory server chosen at random. The response timeout for lookups and updates is set to s to measure the worst-case latency. To stress test the directory system, 60 ❽❸❷ ➡ ➠ ➟ ➋ ➤ ➥ ➭ ➢➟➠ ➊ ➦ ➧ ➨ ➩ ➫ ➉ ❷ ❼ ➈ ❸ ➡ ➠ ➟ ➄ ❷ ❻ ❸ ➆➇➄ ➞➟➠ ➅ ❷ ❺ ❸ ➃❾ ➂➄ ➁ ➀ ❿❷ ❸❷❸❷❹ ❷❸❽ ❽➌❸➍ ❷➍ ❷❸➔ ❷→➣↔↕ ❽→❷➛❷➜❸❷➌➍ ➎➏ ➐➌➑➒❽➓ ➙➓ ➝❽❷❷❷❸❷ ➻ ➯ ➲ ✃➾➯➲➺ ➱ ßÝÞ ßÝÞàáâãäåæç ➪➷ ➮ ➯ ➸ ➬ ➲ ➴➹ ÜÝÞ ➯ ➵ ➲ ➘ ➹ ➶➼ ➪➚ ➾ ➽➯ ➯➲➲➯➳➻➲➯ ➻➯➲➯ ➻➯➯➲➯èÝÞ ➻➯➯➯➲➯ ❮ ❐ ❒ ❰ÏÐÑ❰ÏÐÒ ÓÔÕÖ ×ÐÓØÙÑÚ Û ✕ ✥✣✚✤✗ ✎✎✏✏✔✓ ✢✜✛✚✗✙✘✗ ✎✏✒ ✖ ✎✏✑ ✎✎ ïêéî ý é ü êêí ôù ò û é ú øö ÷ öééêêìë õð ôóñò éêéïêé ïéêé ïéêé ïééêé þ ÿ✁✂✄☎✂✆✂✝✞✟✂✆✠✡☛ ☞✂✆✌✍✝ÿ☎ Figure : The directory system provides high throughput and fast response time for lookups and updates to only about s delay for updates until a new primary is elected, while a primary’s recovery or non-primary’s failures/recoveries do not affect the update latency at all. Fast reconvergence and robustness: Finally, we evaluate the convergence latency of updates, i.e., the time between when an update occurs until a lookup response reflects that update. As described in Section ., we minimize convergence latency by having each directory server pro-actively send its committed updates to other directory servers. Figure (c) shows that the convergence latency is within ms for  of the updates and  of updates have convergence latency within  ms. ✸✹✻✼✺ ✺✽✿✾✿❅❀❁❆❂❇❀ ❀ ❂❃ ❄ ❁ ❈❉❃ ✎✏✑ ✦✧★✎✩✏✒✪✫✧✬ ✭✮✫✧✯✎★✏✓✰✱✲✯✳✴✮✎✬✧✏✭✔✵✶✷ ✕ ✕✏✑ Figure : CDF of normalized link utilizations for VLB, adaptive, and best oblivious routing schemes, showing that VLB (and best oblivious routing) comes close to matching the link utilization performance of adaptive routing. 6. DISCUSSION In this section, we address several remaining concerns about the VL architecture, including whether other traffic engineering mechanisms might be better suited to the data center than Valiant Load Balancing, and the cost of a VL network. Optimality of VLB: As noted in §.., VLB uses randomization to cope with volatility, potentially sacrificing some performance for a best-case traffic pattern by turning all traffic patterns (including both best-case and worst-case) into the average case. This performance loss will manifest itself as the utilization of some links being higher than they would under a more optimal traffic engineering system. To quantify the increase in link utilization VLB will suffer, we compare VLB’s maximum link utilization with that achieved by other routing strategies on the VL topology for a full day’s traffic matrices (TMs) (at min intervals) from the data center traffic data reported in Section .. We first compare to adaptive routing (e.g., TeXCP []), which routes each TM separately so as to minimize the maximum link utilization for that TM — essentially upper-bounding the best performance that real-time adaptive traffic engineering could achieve. Second, we compare to best oblivious routing over all TMs so as to minimize the maximum link utilization. (Note that VLB is just one among many oblivious routing strategies.) For adaptive and best oblivious routing, the routings are computed using linear programs in cplex. The overall utilization for a link in all schemes is computed as the maximum utilization over all routed TMs. In Figure , we plot the CDF for link utilizations for the three schemes. We normalized the link utilization numbers so that the maximum utilization on any link for adaptive routing is 1.0. The results show that for the median utilization link in each scheme, VLB performs about the same as the other two schemes. For the most heavily loaded link in each scheme, VLB’s link capacity usage is about  higher than that of the other two schemes. Thus, evaluations on actual data center workloads show that the simplicity and universality of VLB costs relatively little capacity when compared to much more complex traffic engineering schemes. Cost and Scale: With the range of low-cost commodity devices currently available, the VL topology can scale to create networks with no over-subscription between all the servers of even the largest data centers. For example, switches with  ports (D = 144) are available today for K, enabling a network that connects K servers using the topology in Figure  and up to K servers using a slight variation. Using switches with D = 24 ports (which are available today for K each), we can connect about K servers. Comparing the cost of a VL network for K servers with a typical one found in our data centers shows that a VL network with no over-subscription can be built for the same cost as the current network that has : over-subscription. Building a conventional network with no over-subscription would cost roughly x the cost of a equivalent VL network with no over-subscription. We find the same factor of - cost difference holds across a range of over-subscription ratios from : to :. (We use street prices for switches in both architectures and leave out ToR and cabling costs.) Building an oversubscribed VL network does save money (e.g., a VL network with : over-subscription costs  less than a nonoversubscribed VL network), but the savings is probably not worth the loss in performance. 7. RELATED WORK Data-center network designs: Monsoon [] and Fat-tree [] also propose building a data center network using commodity switches and a Clos topology. Monsoon is designed on top of layer  and reinvents fault-tolerant routing mechanisms already established at layer . Fat-tree relies on a customized routing primitive that does not yet exist in commodity switches. VL, in contrast, achieves hot-spot-free routing and scalable layer- semantics using forwarding primitives available today and minor, applicationcompatible modifications to host operating systems. Further, our experiments using traffic patterns from a real data center show that random flow spreading leads to a network utilization fairly close to the optimum, obviating the need for a complicated and expensive optimization scheme suggested by Fat-tree. We cannot empir- 61 ically compare with these approaches because they do not provide results on communication-intensive operations (e.g., data shuffle) that stress the network; they require special hardware []; and they do not support agility and performance isolation. DCell [] proposes a dense interconnection network built by adding multiple network interfaces to servers and having the servers forward packets. VL also leverages the programmability of servers, however, it uses servers only to control the way traffic is routed as switch ASICs are much better at forwarding. Furthermore, DCell incurs significant cabling complexity that may prevent large deployments. BCube [] builds on DCell, incorporating switches for faster processing and active probing for load-spreading. Valiant Load Balancing: Valiant introduced VLB as a randomized scheme for communication among parallel processors interconnected in a hypercube topology []. Among its recent applications, VLB has been used inside the switching fabric of a packet switch []. VLB has also been proposed, with modifications and generalizations [, ], for oblivious routing of variable traffic on the Internet under the hose traffic model []. Scalable routing: The Locator/ID Separation Protocol [] proposes “map-and-encap” as a key principle to achieve scalability and mobility in Internet routing. VL’s control-plane takes a similar approach (i.e., demand-driven host-information resolution and caching) but adapted to the data center environment and implemented on end hosts. SEATTLE [] proposes a distributed hostinformation resolution system running on switches to enhance Ethernet’s scalability. VL takes an end host based approach to this problem, which allows its solution to be implemented today, independent of the switches being used. Furthermore, SEATTLE does not provide scalable data plane primitives, such as multi-path, which are critical for scalability and increasing utilization of network resources. Commercial Networks: Data Center Ethernet (DCE) [] by Cisco and other switch manufacturers shares VL’s goal of increasing network capacity through multi-path. However, these industry efforts are primarily focused on consolidation of IP and storage area network (SAN) traffic, which is rare in cloud-service data centers. Due to the requirement to support loss-less traffic, their switches need much bigger buffers (tens of MBs) than commodity Ethernet switches do (tens of KBs), hence driving their cost higher. 8. VL is efficient. Our working prototype, built using commodity switches, approaches in practice the high level of performance that the theory predicts. Experiments with two data-center services showed that churn (e.g., dynamic re-provisioning of servers, change of link capacity, and micro-bursts of flows) has little impact on TCP goodput. VL’s implementation of Valiant Load Balancing splits flows evenly and VL achieves high TCP fairness. On all-to-all data shuffle communications, the prototype sustains an efficiency of  with a TCP fairness index of .. Acknowledgements The many comments from our shepherd David Andersen and the anonymous reviewers greatly improved the final version of this paper. John Dunagan provided invaluable help implementing the Directory System. 9. REFERENCES [] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. In SIGCOMM, . [] M. Armbrust, A. Fox, R. Griffith, et al. Above the Clouds: A Berkeley View of Cloud Computing UC Berkeley TR UCB/EECS--. [] C. Chang, D. Lee, and Y. Jou. Load balanced Birkhoff-von Neumann switches, part I: one-stage buffering. IEEE HPSR, . [] Cisco. Data center Ethernet. http://www.cisco.com/go/dce. [] Cisco: Data center: Load balancing data center services, . [] K. C. Claffy, H. werner Braun, and G. C. Polyzos. A parameterizable methodology for Internet traffic flow profiling. JSAC, , . [] W. J. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers, . [] N. G. Duffield, P. Goyal, A. G. Greenberg, P. P. Mishra, K. K. Ramakrishnan, and J. E. van der Merwe. A flexible model for resource management in virtual private network. In SIGCOMM, . [] D. Farinacci, V. Fuller, D. Oran, D. Meyer, and S. Brim. Locator/ID Separation Protocol (LISP). Internet-draft, Dec. . [] A. Greenberg, J. R. Hamilton, D. A. Maltz, P. Patel. The cost of a cloud: research problems in data center networks CCR, (), . [] A. Greenberg, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. Towards a next generation data center architecture: Scalability and commoditization. In PRESTO Workshop at SIGCOMM, . [] C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Lu. Dcell: A scalable and fault-tolerant network structure for data centers. In SIGCOMM, . [] C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Lu. Bcube: A high performance, server-centric network architecture for modular data centers. In SIGCOMM, . [] M. Handley, S. Floyd, J. Padhye, and J. Widmer. TCP friendly rate control (TFRC): Protocol specification. RFC , . [] R. Jain. The Art of Computer Systems Performance Analysis. John Wiley and Sons, Inc., . [] S. Kandula, D. Katabi, B. Davie, and A. Charny. Walking the Tightrope: Responsive yet Stable Traffic Engineering. In SIGCOMM, . [] C. Kim, M. Caesar, and J. Rexford. Floodless in SEATTLE: a scalable ethernet architecture for large enterprises. In SIGCOMM, . [] M. Kodialam, T. V. Lakshman, and S. Sengupta. Efficient and Robust Routing of Highly Variable Traffic. In HotNets, . [] L. Lamport. The part-time parliament. ACM Transactions on Computer Systems, :–, . [] M. Mathis, J. Heffner, and R. Raghunarayan. TCP extended statistics MIB. RFC , . [] S. Sinha, S. Kandula, and D. Katabi. Harnessing TCP’s burstiness with flowlet switching. In HotNets, . [] Y. Zhang and Z. Ge. Finding critical traffic matrices. In DSN, June . [] R. Zhang-Shen and N. McKeown. Designing a Predictable Internet Backbone Network. In HotNets, . SUMMARY VL is a new network architecture that puts an end to the need for oversubscription in the data center network, a result that would be prohibitively expensive with the existing architecture. VL benefits the cloud service programmer. Today, programmers have to be aware of network bandwidth constraints and constrain server to server communications accordingly. VL instead provides programmers the simpler abstraction that all servers assigned to them are plugged into a single layer  switch, with hotspot free performance regardless of where the servers are actually connected in the topology. VL also benefits the data center operator as today’s bandwidth and control plane constraints fragment the server pool, leaving servers (which account for the lion’s share of data center cost) under-utilized even while demand elsewhere in the data center is unmet. Instead, VL enables agility: any service can be assigned to any server, while the network maintains uniform high bandwidth and performance isolation between services. VL is a simple design that can be realized today with available networking technologies, and without changes to switch control and data plane capabilities. The key enablers are an addition to the endsystem networking stack, through well-established and public APIs, and a flat addressing scheme, supported by a directory service. 62