Data Center TCP (DCTCP) : Stanford University Microsod Research

Data
Center TCP
(DCTCP)
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye

Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan
Stanford University MicrosoD Research

Case Study: Microso7 Bing
• Measurements from 6000 server producHon cluster
• InstrumentaHon passively collects logs

‒ ApplicaHon-‐level
‒ Socket-‐level
‒ Selected packet-‐level
• More than 150TB of compressed data over a month
2
ParCCon/Aggregate ApplicaCon Structure
1.
TLA
Picasso Art is…
Deadline
2. Art is =a 2 lie…
50ms
…..
3.
• Time is money
 Strict deadlines (SLAs)
Picasso
1. MLA ……… MLA 1. Art is a lie…

• Missed deadline Deadline = 50ms
2. 2. The chief…
 Lower quality result
…..
…..
3. 3.
“Everything
Deadline =“The
“It
1“I'd
i“Art
s
“Computers
“InspiraCon
0ms yclike
our
hief
“Bad
is tyaw
o
eou
lnemy
ork
lie
aive
rCsts
ctan
hat
ian
ds
re
ooes
ilmagine
ife
af m
c uopy.
cpakes
seless.
reaCvity
toor
ehat
xist,
m
uis s an
trhe
eal.”
i s
They
but cit ulCmate
an
with
Good
m
realize
o
ust
good
nly
lots
afirCsts
gnd
sive
tsohe
educCon.“
ense.“
f ym
ou
ytsou
ruth.
oney.“
teal.”
waorking.”
nswers.”
Worker Nodes 3
Workloads
• ParHHon/Aggregate
Delay-‐sensiCve
(Query)
• Short messages [50KB-‐1MB]

Delay-‐sensiCve
(CoordinaCon, Control state)
• Large flows [1MB-‐50MB]

Throughput-‐sensiCve
(Data update)
4
Impairments
• Incast
• Queue Buildup
• Buffer Pressure
5
Incast
Worker 1 • Synchronized fan-‐in congesHon:
 Caused by ParCCon/Aggregate.
Aggregator
Worker 2
Worker 3
RTOmin = 300 ms
Worker 4 TCP Cmeout
6
MLA Query CompleCon Time (ms)
Incast in Bing
• Requests are ji\ered over 10ms window.

Jibering trades off median against high percenCles.
• Ji\ering switched off around 8:30 am.
7
Queue Buildup
Sender 1
• Big flows buildup queues.

 Increased latency for short flows.
Receiver
Sender 2
• Measurements in Bing cluster
 For 90% packets: RTT < 1ms
 For 10% packets: 1ms < RTT < 15ms
8
Data Center Transport Requirements
1. High Burst Tolerance

– Incast due to ParHHon/Aggregate is common.
2. Low Latency

– Short flows, queries
3. High Throughput
– ConHnuous data updates, large file transfers
The challenge is to achieve these three together.
9
The DCTCP Algorithm
TCP Buffer Sizing
• Bandwidth-‐delay product rule of thumb:

– A single flow needs C×RTT buffers for 100% Throughput.
B ≥ C×RTT B < C×RTT

Buffer Size
B
B
Throughput
100% 100%
11
Buffer Sizing Impacts Latency
• Widespread concepHon: increase link speed to reduce latency
– Eg. upgrade from 1Gbps to 10Gbps network.
• However, increasing link speed doesn’t lower queuing delay, because:
– Switch buffers also need to be 10 Cmes larger and 10 Cmes faster.
10G x RTT
Buffer Size
To reduce latency, 1G

we MTT UST reduce the buffering
x R
requirements of the transport protocol.
1G
10G

Time 12
Reducing Buffer Requirements
• Appenzeller rule of thumb (SIGCOMM ‘04):
– Large # of flows: is enough.
Cwnd
Buffer Size
Throughput
100%
13
Reducing Buffer Requirements
• Appenzeller rule of thumb (SIGCOMM ‘04):
– Large # of flows: is enough.
• Can’t rely on stat-‐mux benefit in the DC.

– Measurements show typically 1-‐2 big flows at each server, at most 4.
• Real Rule of Thumb:

– Low Variance in Sending Rates  Small Buffers Suffice.
• Both QCN & DCTCP reduce variance in sending rates.

– QCN: Explicit mulH-‐bit feedback.
– DCTCP: Implicit mulH-‐bit feedback from ECN marks.
14
DCTCP: Two Main Ideas
1. React in proporHon to the extent of congesHon, not its presence.
 Reduce window size based on fracCon of marked packets.
ECN Marks TCP DCTCP
1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40%
0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5%
2. Mark based on instantaneous queue length.

 Fast feedback to be\er deal with bursts.
 Simplifies hardware.
15
DCTCP: Algorithm
Switch side: B Mark K Don’t
Mark
– Mark packets when Queue Length > K.
Sender side:
– Maintain running average of frac%on of packets marked (α).
# of marked ACKs
each RTT : F = ⇒ α ← (1− g)α + gF
Total # of ACKs
α
€
 AdapCve w indow decreases: W ← (1− )W
2
– Note: decrease factor between 1 and 2.
16
€
(Kbytes) DCTCP vs TCP
Setup: Win 7, Broadcom 1Gbps Switch

Scenario: 2 long-‐lived flows, K = 30KB
17
Why it Works
1. High Burst Tolerance

 Large buffer headroom → bursts fit.
 Aggressive marking → sources react before packets are dropped.
2. Low Latency

 Small buffer occupancies → low queuing delay.
3. High Throughput
 ECN averaging → smooth rate adjustments, low variance.
18
EvaluaCon
• Implemented in Windows stack.
• Real hardware, 1Gbps and 10Gbps experiments
– 90 server testbed
– Broadcom Triumph 48 1G ports – 4MB shared memory
– Cisco Cat4948 48 1G ports – 16MB shared memory
– Broadcom Scorpion 24 10G ports – 4MB shared memory
• Numerous micro-‐benchmarks
– Throughput and Queue Length – Fairness and Convergence
– MulC-‐hop – Incast
– Queue Buildup – StaCc vs Dynamic Buffer Mgmt
– Buffer Pressure
• Cluster traffic benchmark

19
Cluster Traffic Benchmark
• Emulate traffic within 1 Rack of Bing cluster
– 45 1G servers, 10G server for external traffic
• Generate query, and background traffic

– Flow sizes and arrival Hmes follow distribuHons seen in Bing
• Metric:
– Flow compleHon Hme for queries and background flows.
We use RTOmin = 10ms for both TCP & DCTCP.
20
Baseline
Background Flows Query Flows
21
Baseline
 Low latency for short flows.
22
Baseline

 High throughput for long flows.
23
Baseline

 High throughput for long flows.
 High burst tolerance for query flows.
24
Scaled Background & Query
10x Background, 10x Query
25

Data Center TCP (DCTCP) : Stanford University Microsod Research

Uploaded by

Copyright:

Available Formats

Data Center TCP (DCTCP) : Stanford University Microsod Research

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Center TCP (DCTCP) : Stanford University Microsod Research

Uploaded by

Copyright:

Available Formats

Data

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye

Stanford University MicrosoD Research

• InstrumentaHon passively collects logs

• More than 150TB of compressed data over a month

1. MLA ……… MLA 1. Art is a lie…

• Short messages [50KB-­‐1MB]

• Large ﬂows [1MB-­‐50MB]

RTOmin = 300 ms

Worker 4 TCP Cmeout

• Requests are ji\ered over 10ms window.

• Big ﬂows buildup queues.

1. High Burst Tolerance

2. Low Latency

The challenge is to achieve these three together.

• Bandwidth-­‐delay product rule of thumb:

B ≥ C×RTT B < C×RTT

To reduce latency, 1G

• Can’t rely on stat-­‐mux beneﬁt in the DC.

• Real Rule of Thumb:

• Both QCN & DCTCP reduce variance in sending rates.

ECN Marks TCP DCTCP

1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40%

0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5%

2. Mark based on instantaneous queue length.

Setup: Win 7, Broadcom 1Gbps Switch

1. High Burst Tolerance

2. Low Latency

• Cluster traﬃc benchmark

• Generate query, and background traﬃc

We use RTOmin = 10ms for both TCP & DCTCP.

 Low latency for short ﬂows.

 Low latency for short ﬂows.

 Low latency for short ﬂows.

You might also like

• Short messages [50KB-‐1MB]

• Large ﬂows [1MB-‐50MB]

• Bandwidth-‐delay product rule of thumb:

• Can’t rely on stat-‐mux beneﬁt in the DC.