Datacenters
Datacenters
Datacenters
1
Data center networks
10’s to 100’s of thousands of hosts, often closely coupled, in
close proximity:
• e-business (e.g. Amazon)
• content-servers (e.g., YouTube, Akamai, Apple, Microsoft)
• search engines, data mining (e.g., Google)
challenges:
multiple applications, each
serving massive numbers of
clients
managing/balancing load,
avoiding processing,
networking, data bottlenecks
Inside a 40-ft Microsoft container,
Chicago data center
Link Layer
5-2
Data center networks
load balancer: application-layer routing
receives external client requests
directs workload within data center
returns results to external client (hiding
Internet
data center internals from client)
Border router
Load Load
balancer Access router
balancer
Tier-1 switches
B
A C Tier-2 switches
TOR switches
Server racks
1 2 3 4 5 6 7 8
Link Layer 5-3
Data center networks
rich interconnection among switches, racks:
increased throughput between racks (multiple routing paths
possible)
increased reliability via redundancy
Tier-1 switches
Tier-2 switches
TOR switches
Server racks
1 2 3 4 5 6 7 8
Broad questions
How are massive numbers of commodity machines
networked inside a data center?
Virtualization: How to effectively manage physical
machine resources across client virtual
machines?
Operational costs:
• Server equipment
• Power and cooling
5
Source: NRDC research paper 6
Breakdown wrt DC size
Computer
Networking: A Top
Down Approach
6th edition
Jim Kurose, Keith Ross
Addison-Wesley
March 2012
network adapter
card
datagram datagram
controller controller
frame
otherwise
0 0
sender: receiver:
treat segment contents as compute checksum of
sequence of 16-bit received segment
integers check if computed
checksum: addition (1’s checksum equals checksum
complement sum) of field value:
segment contents NO - error detected
sender puts checksum YES - no error detected.
value into UDP But maybe errors
checksum field nonetheless?
D.2r
R = remainder[ ]
G
6-slot 6-slot
frame frame
1 3 4 1 3 4
time
frequency bands
FDM cable
node 2 2 2 2
node 3 3 3 3
C E C S E C E S S
Pros: Cons:
single active node can collisions, wasting slots
continuously transmit at idle slots
full rate of channel nodes may be able to
highly decentralized: only detect collision in less
slots in nodes need to be than time to transmit
in sync
packet
simple clock synchronization
Link Layer 5-32
Slotted ALOHA: efficiency
!
prob that given node has
success in a slot = used for useful
prob that any node has a transmissions 37%
success = of time!
!
prob that given node has
success in a slot = p(1- used for useful
p)N-1 transmissions 37%
prob that any node has a of time!
success = Np(1-p)N-1
Link Layer 5-34
Pure (unslotted) ALOHA
unslotted Aloha: simpler, no synchronization
when frame first arrives
transmit immediately
collision probability increases:
frame sent at t0 collides with other frames sent in [t0-1,t0+1]
= p . (1-p)N-1 . (1-p)N-1
= p . (1-p)2(N-1)
= 1/(2e) = .18
1
efficiency
1 5t prop /ttrans
efficiency goes to 1
as tprop goes to 0
as ttrans goes to infinity
better performance than ALOHA: and simple, cheap,
decentralized!
data
Link Layer 5-45
Cable access network
Internet frames,TV channels, control transmitted
downstream at different frequencies
cable headend
CMTS
…
splitter cable
cable modem … modem
termination system
Downstream channel i
CMTS
Upstream channel j
1A-2F-BB-76-09-AD
LAN
(wired or adapter
wireless)
71-65-F7-2B-08-53
58-23-D7-FA-20-B0
0C-C4-11-6F-E3-98
A B
R
111.111.111.111
222.222.222.222
74-29-9C-E8-FF-55
49-BD-D2-C7-56-2A
222.222.222.220
1A-23-F9-CD-06-9B
IP
Eth
Phy
A B
R
111.111.111.111
222.222.222.222
74-29-9C-E8-FF-55
49-BD-D2-C7-56-2A
222.222.222.220
1A-23-F9-CD-06-9B
IP IP
Eth Eth
Phy Phy
A B
R
111.111.111.111
222.222.222.222
74-29-9C-E8-FF-55
49-BD-D2-C7-56-2A
222.222.222.220
1A-23-F9-CD-06-9B
A B
R
111.111.111.111
222.222.222.222
74-29-9C-E8-FF-55
49-BD-D2-C7-56-2A
222.222.222.220
1A-23-F9-CD-06-9B
A B
R
111.111.111.111
222.222.222.222
74-29-9C-E8-FF-55
49-BD-D2-C7-56-2A
222.222.222.220
1A-23-F9-CD-06-9B
IP
Eth
Phy
A B
R
111.111.111.111
222.222.222.222
74-29-9C-E8-FF-55
49-BD-D2-C7-56-2A
222.222.222.220
1A-23-F9-CD-06-9B
switch
star
bus: coaxial cable
Link Layer 5-67
Ethernet frame structure
sending adapter encapsulates IP datagram (or other
network layer protocol packet) in Ethernet frame
type
dest. source
preamble address address data CRC
(payload)
preamble:
7 bytes with pattern 10101010 followed by one
byte with pattern 10101011
used to synchronize receiver, sender clock rates
type
dest. source
preamble address address data CRC
(payload)
MAC protocol
application and frame format
transport
network 100BASE-TX 100BASE-T2 100BASE-FX
link 100BASE-T4 100BASE-SX 100BASE-BX
physical
A A A’
switch learns which hosts
can be reached through B
C’
which interfaces
when frame received, 6 1 2
switch “learns”
location of sender: 5 4 3
incoming LAN segment
records sender/location B’ C
pair in switch table
A’
A A A’
frame destination, A’,
B
locaton unknown: flood C’
1
destination A location 6 2
known:selectively send A A’
5 4 3
on just one link B’ C
A’ A
A’
S1
S3
A S2
F
D I
B C
G H
E
S4
S1
S3
A S2
F
D I
B C
G H
E
IP subnet
switch(es) supporting
VLAN capabilities can … …
be configured to Electrical Engineering Computer Science
define multiple virtual (VLAN ports 1-8) (VLAN ports 9-15)
… …
2 8 10 16
2 8 10 16 2 4 6 8
… …
type
dest. source
preamble
address address
data (payload) CRC 802.1Q frame
PPP or Ethernet
MPLS header IP header remainder of link-layer frame
header
20 3 1 5
Link Layer 5-89
MPLS capable routers
a.k.a. label-switched router
forward packets to outgoing interface based only on
label value (don’t inspect IP address)
MPLS forwarding table distinct from IP forwarding tables
flexibility: MPLS forwarding decisions can differ
from those of IP
use destination and source addresses to route flows to same
destination differently (traffic engineering)
re-route flows quickly if link fails: pre-computed backup
paths (useful for VoIP)
R6
D
R4 R3
R5
A
R2
RSVP-TE
R6
D
R4
R5 modified
link state A
flooding
R6
0 0
D
1 1
R4 R3
R5
0 0
A
R2 in outR1 out
label label dest
in out out
interface
label label dest 6 - A 0
interface
8 6 A 0
Link Layer 5-94
Link layer, LANs: outline
5.1 introduction, services 5.5 link virtualization:
5.2 error detection, MPLS
correction 5.6 data center networking
5.3 multiple access 5.7 a day in the life of a
protocols web request
5.4 LANs
addressing, ARP
Ethernet
switches
VLANS
Tier-1 switches
B
A C Tier-2 switches
TOR switches
Server racks
1 2 3 4 5 6 7 8
Link Layer 5-97
Data center networks
rich interconnection among switches, racks:
increased throughput between racks (multiple routing paths
possible)
increased reliability via redundancy
Tier-1 switches
Tier-2 switches
TOR switches
Server racks
1 2 3 4 5 6 7 8
Link layer, LANs: outline
5.1 introduction, services 5.5 link virtualization:
5.2 error detection, MPLS
correction 5.6 data center networking
5.3 multiple access 5.7 a day in the life of a
protocols web request
5.4 LANs
addressing, ARP
Ethernet
switches
VLANS
Link Layer5-100
A day in the life: scenario
school network
68.80.2.0/24
web page
Link Layer5-101
A day in the life… connecting to the Internet
DHCP DHCP connecting laptop needs to
UDP
DHCP
DHCP IP
get its own IP address, addr
DHCP Eth of first-hop router, addr of
Phy DNS server: use DHCP
DHCP
DHCP request encapsulated
in UDP, encapsulated in IP,
DHCP
DHCP
DHCP UDP
encapsulated in 802.3
DHCP IP Ethernet
DHCP Eth router
Phy (runs DHCP) Ethernet frame broadcast
(dest: FFFFFFFFFFFF) on
LAN, received at router
running DHCP server
Ethernet demuxed to IP
demuxed, UDP demuxed to
DHCP
Link Layer5-102
A day in the life… connecting to the Internet
DHCP DHCP DHCP server formulates
DHCP UDP DHCP ACK containing
DHCP IP client’s IP address, IP
DHCP Eth address of first-hop router
Phy for client, name & IP
address of DNS server
encapsulation at DHCP
DHCP DHCP server, frame forwarded
DHCP UDP (switch learning) through
DHCP IP LAN, demultiplexing at
DHCP Eth router client
Phy (runs DHCP)
DHCP
DHCP client receives
DHCP ACK reply
Link Layer5-103
A day in the life… ARP (before DNS, before HTTP)
DNS DNS before sending HTTP request, need
DNS UDP IP address of www.google.com:
DNS
ARP
IP DNS
ARP query Eth
Phy DNS query created, encapsulated in
UDP, encapsulated in IP,
encapsulated in Eth. To send frame
ARP
to router, need MAC address of
ARP reply Eth
Phy router interface: ARP
router
ARP query broadcast, received by
(runs DHCP) router, which replies with ARP
reply giving MAC address of
router interface
client now knows MAC address
of first hop router, so can now
send frame containing DNS
query
Link Layer5-104
A day in the life… using DNS DNS
DNS UDP DNS server
DNS IP
DNS DNS DNS Eth
DNS UDP DNS Phy
DNS IP
DNS Eth
Phy
DNS
Comcast network
68.80.0.0/13
router
IP datagram forwarded from
(runs DHCP) campus network into comcast
IP datagram containing DNS network, routed (tables created by
query forwarded via LAN RIP, OSPF, IS-IS and/or BGP
switch from client to 1st hop routing protocols) to DNS server
router demux’ed to DNS server
DNS server replies to client
with IP address of
www.google.com
Link Layer5-105
A day in the life…TCP connection carrying HTTP
HTTP
HTTP
SYNACK
SYN TCP
SYNACK
SYN IP
SYNACK
SYN Eth
Phy
Link Layer5-106
A day in the life… HTTP request/reply
HTTP
HTTP HTTP web page finally (!!!) displayed
HTTP
HTTP TCP
HTTP
HTTP IP
HTTP
HTTP Eth
Phy
Link Layer5-108
Chapter 5: let’s take a breath
journey down protocol stack complete (except
PHY)
solid understanding of networking principles,
practice
….. could stop here …. but lots of interesting
topics!
wireless
multimedia
security
network management
Link Layer5-109
DATACENTER NETWORK DESIGNS
111
Typical DC network components
rich interconnection among switches, racks:
increased throughput between racks (multiple routing paths
possible)
increased reliability via redundancy
Tier-1 or core
switches
Tier-2 or
aggregation
switches
TOR switches
Server racks
1 2 3 4 5 6 7 8
DC network design questions
Core and aggregation switches much faster than ToR
switches
How much faster should core and aggregation switches
need to be than ToR switches?
How many ports do core/aggregation switches need to
support for a given number of ToR switch ports?
How many cables need to be run in total for a N
machine datacenter?
What bisection bandwidth can be achieved?
Q: Why can’t we just build a single BIG switch to
interconnect all machines?
113
DC network topologies
Fat-tree (used ambiguously to mean Clos as well as a
simple hierarchical design)
Clos family
Hypercube
Torus
114
Why simpler hierarchies not good enough?
High cost
High oversubscription (ratio of worst-case aggregate
bandwidth among end-hosts to bisection bandwidth)
115
Fat tree topology
Core branches, i.e., those near the top of the hierarchy,
are fatter or higher in capacity
116
Example: uniform Clos topology [UCSD]
118
VL2: Clos case study (Microsoft)
119
VL2: Addressing and routing
120
Valiant load balancing
Randomization for efficient, load-balanced routing [VLB]
[VLB] Valiant Load-Balancing: Building Networks That Can Support All Traffic Matrices
121
VL2: Directory for AA<->LA mappings
122
BCube: relies on more server ports
123
Other topologies from “supercomputing”
Hypercube
124
Optical in data centers
Optical switching (100’s of Gbps) faster than traditional
switches (40-160Gbps).
Optical cheaper per 10Gbps port
But optical circuit establishment delay high
• MEMS (Micro-electro mechanical systems) reconfiguration
time is ~10ms
Optical enhanced data center designs migrate heavy
flows (elephants) to optical pathways
125
Energy usage numbers
Typical US household: ~1000kWh per month or ~30kW
Typical desktop computer: 80-250 W
Typical 1U rack mounted server: ~300W (can be a few
thousand W for high-end servers)
Switches and networking equipment?
126
Switch power consumption
Generally small fraction (5-25%) of servers in typical
topologies
127
Techniques to reduce energy
Dynamic voltage and frequency scaling (DVFS): reduces
CV2f by reducing voltage V
• Generally not power-proportional, i.e., power does not
proportionally go down with decreased usage
Shutting down (“consolidating”) servers and parts of
network: widely studied by cautiously used if at all in
practice
128