Data Center TCP (DCTCP) : Stanford University Microsod Research
Data Center TCP (DCTCP) : Stanford University Microsod Research
Data Center TCP (DCTCP) : Stanford University Microsod Research
Center
TCP
(DCTCP)
2
ParCCon/Aggregate
ApplicaCon
Structure
1.
TLA
Picasso
Art
is…
Deadline
2.
Art
is
=a
2
lie…
50ms
…..
3.
•
Time
is
money
Strict
deadlines
(SLAs)
Picasso
…..
…..
3. 3.
“Everything
Deadline
=“The
“It
1“I'd
i“Art
s
“Computers
“InspiraCon
0ms
yclike
our
hief
“Bad
is
tyaw
o
eou
lnemy
ork
lie
aive
rCsts
ctan
hat
ian
ds
re
ooes
ilmagine
ife
af
m
c
uopy.
cpakes
seless.
reaCvity
toor
ehat
xist,
m
uis
s
an
trhe
eal.”
i
s
They
but
cit
ulCmate
an
with
Good
m
realize
o
ust
good
nly
lots
afirCsts
gnd
sive
tsohe
educCon.“
ense.“
f
ym
ou
ytsou
ruth.
oney.“
teal.”
waorking.”
nswers.”
Worker
Nodes
3
Workloads
• ParHHon/Aggregate
Delay-‐sensiCve
(Query)
4
Impairments
• Incast
• Queue Buildup
• Buffer Pressure
5
Incast
Worker
1
•
Synchronized
fan-‐in
congesHon:
Caused
by
ParCCon/Aggregate.
Aggregator
Worker
2
Worker 3
6
MLA
Query
CompleCon
Time
(ms)
Incast
in
Bing
Sender
2
•
Measurements
in
Bing
cluster
For
90%
packets:
RTT
<
1ms
For
10%
packets:
1ms
<
RTT
<
15ms
8
Data
Center
Transport
Requirements
3.
High
Throughput
– ConHnuous
data
updates,
large
file
transfers
9
The
DCTCP
Algorithm
TCP
Buffer
Sizing
B
B
Throughput
100% 100%
11
Buffer
Sizing
Impacts
Latency
• Widespread
concepHon:
increase
link
speed
to
reduce
latency
– Eg.
upgrade
from
1Gbps
to
10Gbps
network.
• However,
increasing
link
speed
doesn’t
lower
queuing
delay,
because:
– Switch
buffers
also
need
to
be
10
Cmes
larger
and
10
Cmes
faster.
10G
x
RTT
Buffer
Size
1G
10G
Time
12
Reducing
Buffer
Requirements
• Appenzeller
rule
of
thumb
(SIGCOMM
‘04):
– Large
#
of
flows:
is
enough.
Cwnd
Buffer Size
Throughput
100%
13
Reducing
Buffer
Requirements
• Appenzeller
rule
of
thumb
(SIGCOMM
‘04):
– Large
#
of
flows:
is
enough.
1. React
in
proporHon
to
the
extent
of
congesHon,
not
its
presence.
Reduce
window
size
based
on
fracCon
of
marked
packets.
Sender
side:
– Maintain
running
average
of
frac%on
of
packets
marked
(α).
# of marked ACKs
each RTT : F = ⇒ α ← (1− g)α + gF
Total # of ACKs
α
€
AdapCve
w indow
decreases:
W ← (1− )W
2
– Note:
decrease
factor
between
1
and
2.
16
€
(Kbytes)
DCTCP
vs
TCP
17
Why
it
Works
3.
High
Throughput
ECN
averaging
→
smooth
rate
adjustments,
low
variance.
18
EvaluaCon
• Implemented
in
Windows
stack.
• Real
hardware,
1Gbps
and
10Gbps
experiments
– 90
server
testbed
– Broadcom
Triumph
48
1G
ports
–
4MB
shared
memory
– Cisco
Cat4948
48
1G
ports
–
16MB
shared
memory
– Broadcom
Scorpion
24
10G
ports
–
4MB
shared
memory
• Numerous
micro-‐benchmarks
–
Throughput
and
Queue
Length
–
Fairness
and
Convergence
–
MulC-‐hop
–
Incast
–
Queue
Buildup
–
StaCc
vs
Dynamic
Buffer
Mgmt
–
Buffer
Pressure
• Metric:
–
Flow
compleHon
Hme
for
queries
and
background
flows.
20
Baseline
Background
Flows
Query
Flows
21
Baseline
Background
Flows
Query
Flows
22
Baseline
Background
Flows
Query
Flows
25