Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

3 Transport Layer

Download as pdf or txt
Download as pdf or txt
You are on page 1of 111

Chapter 3 outline

3.1 transport-layer 3.5 connection-oriented


services transport: TCP
3.2 multiplexing and  segment structure
demultiplexing  reliable data transfer
3.3 connectionless  flow control
transport: UDP  connection management
3.4 principles of reliable 3.6 principles of congestion
data transfer control
3.7 TCP congestion control

Kurose & Ross 3-1


TCP/IP Protocol Suite

 TCP/IP protocol stack is a layered architecture. TCP and IP


are the two most important protocols of this stack.

 It was originally developed by the DARPA (Defense Advanced


Research Projects Agency ) for an experimental packet-
switched network .

 It was later included in the Berkeley Software Distribution of


UNIX.

 It maps closely to the OSI layers and it supports all standard


physical and data link protocols.

 It also includes specifications for such common applications as


e-mail, remote login, terminal emulation, and file transfer.
TCP/IP Protocol Suite (Transport Layer)

HTTP SMTP DNS RTP


Distributed
Reliable applications User
stream TCP UDP datagram
service service

Best-effort
connectionless packet IP (ICMP, ARP)
transfer

Network Network Network


Interface 1 Interface 2 Interface 3

Variety of Network Technologies


Transport services and protocols
application
transport
* provide logical communication network
between app processes running on data link
physical
different hosts
* transport protocols run in end
systems
- send side: breaks app
messages into segments,
passes to network layer
- rcv side: reassembles
segments into messages,
application
passes to app layer transport
network
* more than one transport protocol data link
available to apps physical

Internet:TCP and UDP

Kurose & Ross 3-4


Internet transport-layer protocols
application
* reliable, in-order delivery transport
network
(TCP) data link
physical
- congestion control network
network
data link
- flow control data link
physical
physical

- connection setup network


data link
physical
* unreliable, unordered delivery:
UDP network
data link
physical
- no-frills extension of “best- network
data link
effort” IP physical
network
* services not available: data link
physical
application
network transport
- delay guarantees data link network
physical data link
- bandwidth guarantees physical

Kurose & Ross 3-5


Transport vs. network layer
 network layer: logical communication
between hosts

 transport layer: logical communication


between processes
 relies on, enhances, network layer
services

For an application to run, the corresponding application processes at


the end-nodes talk to each other through their respective transport
layers

Kurose & Ross 3-6


TCP/IP Encapsulation

Example application : HTTP


HTTP Request
TCP Header contains
source & destination port TCP HTTP Request
numbers for identifying header
the application

IP Header contains source


and destination IP addresses; IP TCP HTTP Request
transport protocol type (TCP header header
or UDP)

Ethernet Header
Ethernet IP TCP HTTP Request FCS
contains source & header header header
destination MAC
addresses
Chapter 3 outline
3.1 transport-layer 3.5 connection-oriented
services transport: TCP
3.2 multiplexing and  segment structure
demultiplexing  reliable data transfer
3.3 connectionless  flow control
transport: UDP  connection management
3.4 principles of reliable 3.6 principles of congestion
data transfer control
3.7 TCP congestion control

Kurose & Ross 3-8


Transport Layer Multiplexing/Demultiplexing

Demultiplexing at Receiver: Multiplexing at Sender:


Gathering data from multiple
Delivering received segments sockets, enveloping data with
to correct socket header (later used for
demultiplexing)
= socket = process

P1 P2 P4 application
applicationP3 P1 application

transport transport transport

network network network

link link link

physical physical physical

Host 2 Host 3
Host 1
Demultiplexing at the Transport Layer
 Host receives IP datagrams
 Each datagram has source IP
address, destination IP 32 bits
address Source Port # Dest Port #
 Each datagram carries one
transport-layer segment
 Each segment has source, other header fields
destination port number
 Host uses IP addresses & port
numbers to direct segment to
appropriate socket Application
Data
(Message)

0-255 Well-known ports


256-1023 Less well-known ports
TCP/UDP Segment Format
1024-65536 Ephemeral client ports
Connectionless Demultiplexing (UDP)

 Create sockets with port  When host receives UDP


numbers: segment:
DatagramSocket mySocket1 = new  checks destination port
DatagramSocket(12534); number in segment
DatagramSocket mySocket2 = new  directs UDP segment to
DatagramSocket(12535); socket with that port
number
 UDP socket identified by two-  IP datagrams with
tuple different source IP
(dest IP address, dest port number) addresses and/or source
port numbers directed to
same socket if the
destination port number is
the same in the
Note that TCP does this differently, using a destination host
4-tuple (S-IP, SP, D-IP, DP) to identify a
socket!
Connectionless Demultiplexing (UDP)
Example: Server creates socket at 6428 to provide UDP service to some
application DatagramSocket serverSocket = new DatagramSocket(6428);

P2 P1
P1
P3

SP: 6428 SP: 6428


DP: 9157 DP: 5775

SP: 9157 SP: 5775


Client IP: A DP: 6428 Server IP: C DP: 6428 Client IP:B

• Same socket (6428) at server for both clients in this example


• DP specifies the process to which data should be delivered at the Receiver
• SP specifies the process from which data is coming, for the specified source IP
address; acts like a return address for replies/responses if required to be sent back
Connection-oriented Demultiplexing (TCP)

 TCP socket identified by 4-  Server host may support


tuple: many simultaneous TCP
 source IP address sockets:
 source port number  each socket identified by its
 destination IP address own 4-tuple
 destination port number  Web servers have different
sockets for each connecting
 Receiver host uses all four client
values to direct segment to
appropriate socket; socket is  non-persistent HTTP will have
different socket for each
uniquely identified by 4- request
tuple (S-IP, SP, D-IP, DP)
Connection-oriented Demultiplexing (TCP)

P1 P4 P5 P6 P2 P1P3

SP: 5775
DP: 80
S-IP: B
D-IP:C

SP: 9157 SP: 9157


Client IP: A DP: 80 Server IP: C DP: 80 Client IP:B
S-IP: A S-IP: B
D-IP:C D-IP:C

• This is a Web Server example as the segments are being sent to Port 80 of the
server which corresponds to the HTTP Service
• Note that in this case, the server is creating a separate process for each of the
sockets. This would be inefficient (see next slide for a more efficient example with
“threading”)
Connection-oriented Demultiplexing (TCP)
Threaded Web Server

P1 P4 P2 P1P3

SP: 5775
DP: 80
S-IP: B
D-IP:C

SP: 9157 SP: 9157


Client IP: A DP: 80 Server IP: C DP: 80 Client IP:B
S-IP: A S-IP: B
D-IP:C D-IP:C

• This is also a Web Server example as the segments are being sent to Port 80 of the
server which corresponds to the HTTP Service
• Note that in this case, the server is creating one process for all the sockets. A new
thread (kind of like a sub-process) is created for each socket
Chapter 3 outline
3.1 transport-layer 3.5 connection-oriented
services transport: TCP
3.2 multiplexing and  segment structure
demultiplexing  reliable data transfer
3.3 connectionless  flow control
transport: UDP  connection management
3.4 principles of reliable 3.6 principles of congestion
data transfer control
3.7 TCP congestion control

Kurose & Ross 3-16


UDP: User Datagram Protocol [RFC 768]
 “no frills,” “bare bones”  UDP use:
Internet transport protocol  streaming multimedia apps
 “best effort” service, UDP (loss tolerant, rate
segments may be: sensitive)
 lost  DNS
 delivered out-of-order to  SNMP
app  reliable transfer over UDP:
 connectionless:  add reliability at application
 no handshaking between layer
UDP sender, receiver  application-specific error
 each UDP segment recovery!
handled independently of
others

Kurose & Ross 3-17


UDP: User Datagram Protocol
 Commonly used for
streaming multimedia Length, in
applications which tend bytes of
to be loss tolerant but UDP 32 bits
rate sensitive segment,
including Source Port # Dest Port #
 UDP also used for DNS
and SNMP header Length Checksum
 For reliable transfer
over UDP one must add
reliability at the level
of the application layer, Application
e.g. application-specific Data
error recovery!
(Message)

UDP Checksum: Standard Internet


Checksum added by the sender. Used by the
receiver to check for bit errors. (See next UDP Segment Format
slide)
UDP Checksum Calculation
0 8 16 31

Source IP Address

Destination IP Address

00000000 Protocol = 17 UDP Length

UDP Pseudoheader
(used in checksum calculation but never actually transmitted, nor is it
included in the “Length”)

 UDP checksum covers pseudoheader followed by UDP datagram


 IP addresses included to detect against misdelivery
 Receiver recalculates the checksum and silently discards the
datagram if errors detected (i.e. no error message generated)
 Using UDP checksums is optional but hosts are required to have
checksums enabled

Note that IP Address information will come from another layer (Network Layer).
Strictly speaking, this goes against the philosophy of keeping the layers separate
from each other.
Internet Checksum
 Several Internet protocols (e.g. IP, TCP, UDP) use check
bits to detect errors in the IP header (or in the header and
data for TCP/UDP)
 A checksum is calculated for header contents and included
in a special field.
 Checksum recalculated at every router, so algorithm
selected for ease of implementation in software
 Let header consist of L, 16-bit words,
b0, b1, b2, ..., bL-1
 The algorithm appends a 16-bit checksum bL

Kurose & Ross 3-20


Internet Checksum Calculation
10 mod 7 = 3
The checksum bL is calculated as follows: -2 mod 7 = 5
 Treating each 16-bit word as an integer, find
1
x = b0 + b1 + b2+ ...+ bL-1 modulo 216-1 3 2 0
4
 The checksum is then given by: 1
5 7
bL = - x modulo 216-1 6 0
-2
Thus, the headers must satisfy the following pattern:
0 = b0 + b1 + b2+ ...+ bL-1 + bL modulo 216-1

 The checksum calculation is carried out in software using one’s complement


arithmetic

Kurose & Ross 3-21


Internet Checksum Example
(A simple one, using 4-bit words and mod 24-1 arithmetic)
If b0=1100, b1=1010, what is their Internet Checksum b2?
Using mod 15 arithmetic Using Binary Arithmetic
 b0=1100 = 12  Note that 16 mod15 = 1
 b1=1010 = 10  So: 10000 mod15 = 0001
 b0+b1=(12+10) mod 15 = 7 (i.e. leading bit wraps around)
 b2 = -7 mod15 = 8
 Therefore, b2=1000
b0 + b1 = 1100+1010
Binary 1100
=10110
Mod 1010
=10000+0110 10110
x mod y  residue of 1100 12 =0001+0110 1
x/y 1010 10
=0111 - 0111
10110 22
-x mod y  y – x 1 -15 =7 1000
- 0111 07
1000 08
Take 1s complement to find
b2 = - 0111 =1000

Don’t take this example too seriously as it uses only 4-bit words for simplicity! Real
Internet checksum calculations (shown next) uses 16-bit words.
Kurose & Ross 3-22
Internet Checksum Example
(A more complex one, using 16-bit words and mod 216-1 arithmetic)

If b0=1000 0001 1011 1100


& b1=0100 1000 0101 1010
What will be their Internet Checksum b2?

Using mod 216-1=65535 arithmetic


 b0=1000 0001 1011 1100 = 81BCH = 33212
 b1=0100 1000 0101 1010 = 485AH = 18522
 b0+b1=(33212+18522) mod 65535 = 51734
 b2 = - 51734 mod 65535 = 65535-51734 = 13801 = 35E9H
 Therefore, b2 = 0011 0101 1110 1001

Kurose & Ross 3-23


UDP Destination Port Usage

Port 1 Port 2 Port 3

Datagram demultiplexed
to its appropriate port
UDP Demultiplexing
(based on destination port #)

Arrival of UDP Datagram

Error Message sent


back if the Dest. Port
IP Layer # indicated in the
datagram does not
exist!
UDP Port Numbers

Well Known Port Numbers Dynamically Assigned Port


Numbers

Universally assigned and • Ports are not globally known


accepted port #s providing • When a program needs a port,
some designated service. it asks for and gets one from
the network software
Typically, lower port numbers • Destination m/c needs to be
used for this queried to find the port number
at which it may be offering the
Examples - service to be accessed
37 Time • Typically higher port numbers
53 Domain Name Server used for this
67 DHCP Server
68 DHCP Client
Chapter 3 outline
3.1 transport-layer 3.5 connection-oriented
services transport: TCP
3.2 multiplexing and  segment structure
demultiplexing  reliable data transfer
3.3 connectionless  flow control
transport: UDP  connection management
3.4 principles of reliable 3.6 principles of congestion
data transfer control
3.7 TCP congestion control

Kurose & Ross 3-26


Principles of reliable data transfer
 important in application, transport, link layers

 characteristics of unreliable channel will determine


complexity of reliable data transfer protocol (rdt)
Kurose & Ross 3-27
Principles of reliable data transfer
 important in application, transport, link layers

 characteristics of unreliable channel will determine


complexity of reliable data transfer protocol (rdt)
Kurose & Ross 3-28
Principles of reliable data transfer
 important in application, transport, link layers

 characteristics of unreliable channel will determine


complexity of reliable data transfer protocol (rdt)
Kurose & Ross 3-29
Reliable data transfer model
rdt_send(): called from above, deliver_data(): called by
(e.g., by app.). Passed data to rdt to deliver data to upper
deliver to receiver upper layer

send receive
side side

udt_send(): called by rdt, rdt_rcv(): called when packet


to transfer packet over arrives on rcv-side of channel
unreliable channel to receiver

Kurose & Ross 3-30


Stop and Wait
 Source transmits single frame Timeout
timer starts
 Start transmitter timeout timer Frame
damaged/ lost
 Wait for ACK from receiver
 If received frame damaged, discard it
 Transmitter has timeout
 If no ACK within timeout, retransmit
 If ACK damaged, transmitter will not recognize it
 Transmitter will retransmit Ack
damaged
 Receive gets two copies of frame
 Use ACK0 and ACK1 to detect duplicate copy
 Simple but inefficient
 Only 1 frame at a time

Transmit next frame only after ACK is received for the earlier frame

31
Pipelined protocols
pipelining: sender allows multiple, “in-flight”, yet-to-be-
acknowledged pkts
 range of sequence numbers must be increased
 buffering at sender and/or receiver

 two generic forms of pipelined protocols: go-Back-N,


selective repeat
Kurose & Ross 3-32
Pipelining: increased utilization
sender receiver
first packet bit transmitted, t = 0
last bit transmitted, t = L / R

first packet bit arrives


RTT last packet bit arrives, send ACK
last bit of 2nd packet arrives, send ACK
last bit of 3rd packet arrives, send ACK
ACK arrives, send next
packet, t = RTT + L / R
3-packet pipelining increases
utilization by a factor of 3!

U 3L / R .0024
sender = = = 0.00081
RTT + L / R 30.008

Kurose & Ross 3-33


Pipelined protocols: overview
Go-back-N: Selective Repeat:
 sender can have up to N  sender can have up to N
unacked packets in unacked packets in pipeline
pipeline  rcvr sends individual ack for
 receiver only sends each packet
cumulative ack
 doesn’t ack packet if
there’s a gap  sender maintains timer for
 sender has timer for each unacked packet
oldest unacked packet  when timer expires,
 when timer expires, retransmit only that unacked
retransmit all unacked packet
packets

Kurose & Ross 3-34


Go-Back-N: sender
 k-bit seq # in pkt header
 “window” of up to N, consecutive unack’ed pkts allowed

 ACK(n): ACKs all pkts up to, including seq # n - “cumulative ACK”


 may receive duplicate ACKs (see receiver)
 timer for oldest in-flight pkt
 timeout(n): retransmit packet n and all higher seq # pkts in window

Kurose & Ross 3-35


GBN: receiver extended FSM
default
udt_send(sndpkt) rdt_rcv(rcvpkt)
&& notcurrupt(rcvpkt)
 && hasseqnum(rcvpkt,expectedseqnum)
expectedseqnum=1 Wait extract(rcvpkt,data)
sndpkt = deliver_data(data)
make_pkt(expectedseqnum,ACK,chksum) sndpkt = make_pkt(expectedseqnum,ACK,chksum)
udt_send(sndpkt)
expectedseqnum++

ACK-only: always send ACK for correctly-received pkt with


highest in-order seq #
 may generate duplicate ACKs
 need only remember expectedseqnum
 out-of-order pkt:
 discard (don’t buffer): no receiver buffering!
 re-ACK pkt with highest in-order seq #
Kurose & Ross 3-36
GBN in action
sender window (N=4) sender receiver
012345678 send pkt0
012345678 send pkt1
send pkt2 receive pkt0, send ack0
012345678
send pkt3 Xloss receive pkt1, send ack1
012345678
(wait)
receive pkt3, discard,
012345678 rcv ack0, send pkt4 (re)send ack1
012345678 rcv ack1, send pkt5 receive pkt4, discard,
(re)send ack1
ignore duplicate ACK receive pkt5, discard,
(re)send ack1
pkt 2 timeout
012345678 send pkt2
012345678 send pkt3
012345678 send pkt4 rcv pkt2, deliver, send ack2
012345678 send pkt5 rcv pkt3, deliver, send ack3
rcv pkt4, deliver, send ack4
rcv pkt5, deliver, send ack5

Kurose & Ross 3-37


Selective repeat
 receiver individually acknowledges all correctly
received pkts
 buffers pkts, as needed, for eventual in-order delivery
to upper layer
 sender only resends pkts for which ACK not
received
 sender timer for each unACKed pkt
 sender window
 N consecutive seq #’s
 limits seq #s of sent, unACKed pkts

Kurose & Ross 3-38


Selective repeat: sender, receiver windows

Kurose & Ross 3-39


Selective repeat
sender receiver
data from above: pkt n in [rcvbase, rcvbase+N-1]
 if next available seq # in  send ACK(n)
window, send pkt  out-of-order: buffer
timeout(n):  in-order: deliver (also deliver
buffered, in-order pkts),
 resend pkt n, restart timer advance window to next not-
ACK(n) in [sendbase,sendbase+N]: yet-received pkt
 mark pkt n as received pkt n in [rcvbase-N,rcvbase-1]
 if n smallest unACKed pkt,  ACK(n)
advance window base to next
unACKed seq # otherwise:
 ignore

Kurose & Ross 3-40


Selective repeat in action
sender window (N=4) sender receiver
012345678 send pkt0
012345678 send pkt1
send pkt2 receive pkt0, send ack0
012345678
send pkt3 Xloss receive pkt1, send ack1
012345678
(wait)
receive pkt3, buffer,
012345678 rcv ack0, send pkt4 send ack3
012345678 rcv ack1, send pkt5 receive pkt4, buffer,
send ack4
record ack3 arrived receive pkt5, buffer,
send ack5
pkt 2 timeout
012345678 send pkt2
012345678 record ack4 arrived
012345678 rcv pkt2; deliver pkt2,
record ack5 arrived
012345678 pkt3, pkt4, pkt5; send ack2

Q: what happens when ack2 arrives?

Kurose & Ross 3-41


sender window receiver window
Selective repeat: (after receipt) (after receipt)

dilemma 0123012 pkt0


pkt1
0123012 0123012
0123012 pkt2 0123012
example: 0123012
0123012 pkt3
 seq #’s: 0, 1, 2, 3 X
0123012
 window size=3 pkt0 will accept packet
with seq number 0
(a) no problem
 receiver sees no
difference in two receiver can’t see sender side.
scenarios! receiver behavior identical in both cases!
 duplicate data accepted something’s (very) wrong!
as new in (b) pkt0
0123012
0123012 pkt1 0123012

Q: what relationship 0123012 pkt2 0123012

between seq # size and X 0123012

window size to avoid timeout


X
problem in (b)? retransmit pkt0 X
0123012 pkt0
will accept packet
with seq number 0
(b) oops!
Kurose & Ross 3-42
Chapter 3 outline
3.1 transport-layer services 3.5 connection-oriented
3.2 multiplexing and transport: TCP
demultiplexing  segment structure
3.3 connectionless transport:  reliable data transfer
UDP  flow control
3.4 principles of reliable data  connection management
transfer 3.6 principles of congestion
control
3.7 TCP congestion control

Kurose & Ross 3-43


TCP: Overview RFCs: 793,1122,1323, 2018, 2581

 point-to-point:  full duplex data:


 one sender, one receiver  bi-directional data flow in
same connection
 reliable, in-order byte steam:
 MSS: maximum segment
 no “message boundaries” size
 pipelined:  connection-oriented:
 TCP congestion and flow  handshaking (exchange of
control set window size control msgs) inits sender,
receiver state before data
exchange
 flow controlled:
 sender will not
overwhelm receiver

Kurose & Ross 3-44


TCP Segment Format
0 4 10 16 24 31

Source Port Destination Port

Sequence Number

Acknowledgment Number

Header
Header U A P R S F
Reserved R C S S Y I Window Size
Length G K H T N N

Checksum Urgent Pointer

Options Padding

Data

Each TCP segment has header of 20 or more bytes + 0 or more bytes of data
TCP Header

Port Numbers Sequence Number


 A socket identifies a  Byte count
connection endpoint  First byte in segment
 IP address + port  32 bits long
 A connection specified by a  0  SN  232-1
socket pair  Initial sequence number
 Well-known ports selected during connection
 FTP 20 setup
 Telnet 23
 DNS 53
 HTTP 80
TCP Header

Acknowledgement Number Header length


 SN of next byte expected by  4 bits
receiver  Length of header in multiples
 Acknowledges that all prior of 32-bit words
bytes in stream have been  Minimum header length is 20
received correctly bytes
 Valid if ACK flag is set  Maximum header length is 60
bytes
TCP Header

Reserved Control
 6 bits  6 bits
 URG: urgent pointer flag
 Urgent message end = SN + urgent pointer
 ACK: ACK packet flag
 PSH: override TCP buffering
 RST: reset connection
 Upon receipt of RST, connection is
terminated and application layer notified
 SYN: establish connection
 FIN: close connection
TCP Header

Window Size TCP Checksum


 16 bits to advertise window • Internet checksum method
size • Computed over
 Used for flow control TCP pseudo header + TCP
 Sender will accept bytes with segment (header+ data)
SN from ACK to ACK + (See next slide for TCP pseudo
window header)
 Maximum window size is
65535 bytes
TCP Pseudo Header
(for checksum calculation)

0 8 16 31

Source IP address

Destination IP address

00000000 Protocol = 6 TCP Segment Length

 Used in checksum calculation but never actually transmitted, nor is it


included in the “Length”
Usage similar to that of the UDP Pseudoheader
TCP Header

Options Options
 Variable length  Maximum Segment Size
 NOP (No Operation) option is (MSS) option specifices
used to pad TCP header to largest segment a receiver
multiple of 32 bits wants to receive
 Time stamp option is used for  Window Scale option
round trip measurements increases TCP window from
16 to 32 bits
TCP Services

 Provides a full duplex connection-oriented and reliable byte-


stream service using a sliding-window flow control.

 User data are broken into segments not exceeding 64 kbytes


(usually about 1500 bytes) and sent to the destination by
encapsulating them in IP datagrams

 IP provides unreliable packet delivery


 packets can get lost, duplicated or delivered out of
sequence

 Receiver sends an acknowledgment back after receiving a


segment.

 Retransmission of segment if necessary


TCP seq. numbers, ACKs
sequence numbers: outgoing segment from sender
 byte stream “number” of source port # dest port #

first byte in segment’s data sequence number


acknowledgement number
acknowledgements: rwnd
checksum urg pointer
 seq # of next byte expected
window size
from other side N
 cumulative ACK
Q: how receiver handles out-
sender sequence number space
of-order segments
 A: TCP spec doesn’t say, - up sent sent, not-yet usable not
to implementor ACKed ACKed but not usable
(“in-flight”) yet sent

incoming segment to sender


source port # dest port #
sequence number
acknowledgement number
A rwnd
checksum urg pointer

Kurose & Ross 3-53


TCP seq. numbers, ACKs
Host A Host B

User
types
‘C’ Seq=42, ACK=79, data = ‘C’
host ACKs
receipt of
‘C’, echoes
Seq=79, ACK=43, data = ‘C’ back ‘C’
host ACKs
receipt
of echoed
‘C’ Seq=43, ACK=80

simple telnet scenario

Kurose & Ross 3-54


TCP round trip time, timeout
Q: how to set TCP timeout Q: how to estimate RTT?
value?  SampleRTT: measured
 longer than RTT time from segment
 but RTT varies transmission until ACK
receipt
 too short: premature
timeout, unnecessary  ignore retransmissions
retransmissions  SampleRTT will vary, want
estimated RTT “smoother”
 too long: slow reaction to
segment loss  average several recent
measurements, not just
current SampleRTT

Kurose & Ross 3-55


TCP round trip time, timeout
EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT
 exponential weighted moving average
 influence of past sample decreases exponentially fast
 typical value:  = 0.125
RTT: gaia.cs.umass.edu to fantasia.eurecom.fr

350

RTT: gaia.cs.umass.edu to fantasia.eurecom.fr

300
RTT (milliseconds)

250
RTT (milliseconds)

200

150

sampleRTT
EstimatedRTT
100
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
time (seconnds)

SampleRTT Estimated RTT

time (seconds) Kurose & Ross 3-56


TCP round trip time, timeout
 timeout interval: EstimatedRTT plus “safety margin”
 large variation in EstimatedRTT -> larger safety margin
 estimate SampleRTT deviation from EstimatedRTT:

DevRTT = (1-)*DevRTT +
*|SampleRTT-EstimatedRTT|
(typically,  = 0.25)

TimeoutInterval = EstimatedRTT + 4*DevRTT

estimated RTT “safety margin”

Kurose & Ross 3-57


Chapter 3 outline
3.1 transport-layer 3.5 connection-oriented
services transport: TCP
3.2 multiplexing and  segment structure
demultiplexing  reliable data transfer
3.3 connectionless  flow control
transport: UDP  connection management
3.4 principles of reliable 3.6 principles of congestion
data transfer control
3.7 TCP congestion control

Kurose & Ross 3-58


TCP reliable data transfer
 TCP creates rdt service
on top of IP’s unreliable
service
 pipelined segments
 cumulative acks let’s initially consider
 single retransmission simplified TCP sender:
timer  ignore duplicate acks
 retransmissions  ignore flow control,
triggered by: congestion control
 timeout events
 duplicate acks

Kurose & Ross 3-59


TCP sender events:
data rcvd from app: timeout:
 create segment with seq #  retransmit segment that
 seq # is byte-stream caused timeout
number of first data byte  restart timer
in segment ack rcvd:
 start timer if not already  if ack acknowledges
running previously unacked
 think of timer as for segments
oldest unacked segment
 update what is known to
 expiration interval: be ACKed
TimeOutInterval
 start timer if there are
still unacked segments

Kurose & Ross 3-60


TCP sender (simplified)
data received from application above
create segment, seq. #: NextSeqNum
pass segment to IP (i.e., “send”)
NextSeqNum = NextSeqNum + length(data)
if (timer currently not running)
 start timer
NextSeqNum = InitialSeqNum wait
SendBase = InitialSeqNum for
event timeout
retransmit not-yet-acked segment
with smallest seq. #
start timer
ACK received, with ACK field value y
if (y > SendBase) {
SendBase = y
/* SendBase–1: last cumulatively ACKed byte */
if (there are currently not-yet-acked segments)
start timer
else stop timer
} Kurose & Ross 3-61
TCP: retransmission scenarios
Host A Host B Host A Host B

SendBase=92
Seq=92, 8 bytes of data Seq=92, 8 bytes of data

Seq=100, 20 bytes of data


timeout

timeout
ACK=100
X
ACK=100
ACK=120

Seq=92, 8 bytes of data Seq=92, 8


SendBase=100 bytes of data
SendBase=120
ACK=100
ACK=120

SendBase=120

lost ACK scenario premature timeout


Kurose & Ross 3-62
TCP: retransmission scenarios
Host A Host B

Seq=92, 8 bytes of data

Seq=100, 20 bytes of data


ACK=100
timeout

X
ACK=120

Seq=120, 15 bytes of data

cumulative ACK
Kurose & Ross 3-63
TCP ACK generation [RFC 1122, RFC 2581]

event at receiver TCP receiver action


arrival of in-order segment with delayed ACK. Wait up to 500ms
expected seq #. All data up to for next segment. If no next segment,
expected seq # already ACKed send ACK

arrival of in-order segment with immediately send single cumulative


expected seq #. One other ACK, ACKing both in-order segments
segment has ACK pending

arrival of out-of-order segment immediately send duplicate ACK,


higher-than-expect seq. # . indicating seq. # of next expected byte
Gap detected

arrival of segment that immediate send ACK, provided that


partially or completely fills gap segment starts at lower end of gap

Kurose & Ross 3-64


TCP fast retransmit
 time-out period often
relatively long: TCP fast retransmit
 long delay before resending if sender receives 3 ACKs
lost packet
for same data
 detect lost segments via
(“triple duplicate ACKs”),
duplicate ACKs.
resend unacked segment
 sender often sends many with smallest seq #
segments back-to-back
 if segment is lost, there will  likely that unacked
likely be many duplicate segment lost, so don’t
ACKs. wait for timeout

Kurose & Ross 3-65


TCP fast retransmit
Host A Host B

Seq=92, 8 bytes of data


Seq=100, 20 bytes of data
X

ACK=100
timeout

ACK=100
ACK=100
ACK=100
Seq=100, 20 bytes of data

fast retransmit after sender


receipt of triple duplicate ACK Kurose & Ross 3-66
Chapter 3 outline
3.1 transport-layer 3.5 connection-oriented
services transport: TCP
3.2 multiplexing and  segment structure
demultiplexing  reliable data transfer
3.3 connectionless  flow control
transport: UDP  connection management
3.4 principles of reliable 3.6 principles of congestion
data transfer control
3.7 TCP congestion control

Kurose & Ross 3-67


TCP flow control
application
application may process
remove data from application
TCP socket buffers ….
TCP socket OS
receiver buffers
… slower than TCP
receiver is delivering
(sender is sending) TCP
code

IP
flow control code
receiver controls sender, so sender
won’t overflow receiver’s buffer by
transmitting too much, too fast from sender

receiver protocol stack

Kurose & Ross 3-68


TCP flow control
 receiver “advertises” free
buffer space by including
rwnd value in TCP header to application process
of receiver-to-sender
segments
 RcvBuffer size set via RcvBuffer buffered data
socket options (typical default
is 4096 bytes) rwnd free buffer space
 many operating systems
autoadjust RcvBuffer
 sender limits amount of
TCP segment payloads
unacked (“in-flight”) data to
receiver’s rwnd value
receiver-side buffering
 guarantees receive buffer
will not overflow

Kurose & Ross 3-69


Maximum Segment Size

 Maximum Segment Size (MSS) - largest block of data that TCP


sends to other end
 Each end can announce its MSS during connection establishment
 Default is 576 bytes including 20 bytes for IP header and 20
bytes for TCP header
 Slight difference between the MSS of Ethernet and IEEE
802.3.
Ethernet MSS = 1460 bytes
IEEE 802.3 MSS = 1452 bytes
TCP Window Flow Control
Win =Advertised
Host A Host B
Window size

t0

1024 bytes to
transmit t1
1024 bytes to
transmit
t2
128 bytes to
transmit
1024 bytes to
transmit t3

1024 bytes to
Only 512 bytes
transmit
t4 sent as that is the
advertised value
of Win
Nagle Algorithm

 Situation: User types one character at a time


 Transmitter sends TCP segment per character (41B)
 Receiver sends ACK (40B)
 Receiver echoes received character (41B)
 Transmitter ACKs echo (40 B)
 162 bytes transmitted to transfer one character! Problem!

 Solution:
 TCP sends data & waits for ACK
 New characters buffered
 Send new characters when ACK arrives
 Algorithm adjusts to RTT as follows -
• Short RTT send frequently at low efficiency
• Long RTT send less frequently at greater efficiency
Silly Window Syndrome

 Situation:
 Transmitter sends large amount of data
 Receiver’s buffer is depleted slowly, so buffer fills up
 Every time a few bytes read from buffer, a new advertisement to
transmitter is generated
 Sender immediately sends data & fills buffer
 This leads to many small, inefficient segments being transmitted
 Solution:
 Receiver does not advertize window until window is at least ½ of
receiver buffer or is equal to the maximum segment size (MSS)
 Transmitter refrains from sending small segments
Sequence Number Wraparound
(Potential problem at high data rates)

 232 = 4.29x109 bytes = 34.3x109 bits (TCP has 32-bit seq. no.)
Therefore, at 1 Gbps, sequence numbers will wraparound in just 34.3
seconds  transmitter can only transmit for very brief periods

Solution: Use Timestamp Option in TCP option field. Transmitter inserts


32-byte timestamp in transmitted segment. Receiver echoes this in
ACK. This option must be requested in the SYN segment and is
negotiated during the Connection Setup.
 Timestamp + sequence no → 64-bit seq. no (effectively a
much larger sequence number than the original 32-bit)
 Timestamp clock must:
• Tick forward at least once every 231 bits
• Not complete cycle in less than one MSL
• Example: clock tick every 1 ms @ 8 Tbps wraps around in 25
days
Delay-BW Product & Advertised Window
Size
 Suppose RTT=100 ms, R=2.4 Gbps then –
No. of bits in pipe = 3 Mbytes
 If a single TCP process occupies the pipe, then required
advertised window size is RTT x Bit rate = 3 Mbytes
 But, normal maximum window size is only 65535 bytes which
clearly is inadequately small
 Solution: Use the “Window Scale Option” which will allow
the window to be scaled upward by a factor of 214 . Then a
Window Size up to 65535 x 214 = 1 Gbyte will be allowed.
This window scaling option must be requested in the SYN
segment and is negotiated during the Connection Setup.
Chapter 3 outline
3.1 transport-layer 3.5 connection-oriented
services transport: TCP
3.2 multiplexing and  segment structure
demultiplexing  reliable data transfer
3.3 connectionless  flow control
transport: UDP  connection management
3.4 principles of reliable 3.6 principles of congestion
data transfer control
3.7 TCP congestion control

Kurose & Ross 3-76


Connection Management
before exchanging data, sender/receiver “handshake”:
 agree to establish connection (each knowing the other willing to
establish connection)
 agree on connection parameters

application application

connection state: ESTAB connection state: ESTAB


connection variables: connection Variables:
seq # client-to-server seq # client-to-server
server-to-client server-to-client
rcvBuffer size rcvBuffer size
at server,client at server,client

network network

Socket clientSocket = Socket connectionSocket =


newSocket("hostname","port welcomeSocket.accept();
number");
Kurose & Ross 3-77
Agreeing to establish a connection

2-way handshake:
Q: will 2-way handshake
always work in
Let’s talk
network?
ESTAB  variable delays
OK
ESTAB  retransmitted messages (e.g.
req_conn(x)) due to
message loss
 message reordering
choose x
req_conn(x)
 can’t “see” other side
ESTAB
acc_conn(x)
ESTAB

Kurose & Ross 3-78


TCP 3-way handshake

client state server state


LISTEN LISTEN
choose init seq num, x
send TCP SYN msg
SYNSENT SYNbit=1, Seq=x
choose init seq num, y
send TCP SYNACK
msg, acking SYN SYN RCVD
SYNbit=1, Seq=y
ACKbit=1; ACKnum=x+1
received SYNACK(x)
ESTAB indicates server is live;
send ACK for SYNACK;
this segment may contain ACKbit=1, ACKnum=y+1
client-to-server data
received ACK(y)
indicates client is live
ESTAB

Kurose & Ross 3-79


TCP Connection Establishment

 Three-Way Handshake
 A sends a SYN segment specifying the port number of
the other party B , the initial sequence number (ISN)
that A will use and other info (eg. max. segment size)
 B responds with its own SYN segment containing its ISN.
B also acknowledges A’s SYN by ACKing A’s ISN plus one
 A acknowledges B’s SYN by ACKing B’s ISN plus one

 Initial Sequence Number (ISN) may be randomly chosen but


with some important considerations
Initial Sequence Number (ISN)

 Select initial sequence numbers (ISN) to protect against


segments from prior connections (that may circulate in the
network and arrive at a much later time)
 Select ISN to avoid overlap with sequence numbers of prior
connections
 Use local clock to select ISN sequence number
 Time for clock to go through a full cycle should be greater than
the maximum lifetime of a segment (MSL); Typically MSL=120
seconds
 High bandwidth connections pose a problem
Three Way Handshake
(TCP Connection Setup)
Host A Host B

Protects the ISN against responding falsely to old segments from prior connections
TCP: closing a connection
 client, server each close their side of connection
 send TCP segment with FIN bit = 1
 respond to received FIN with ACK
 on receiving FIN, ACK can be combined with own FIN
 simultaneous FIN exchanges can be handled

Kurose & Ross 3-83


TCP: closing a connection
client state server state
ESTAB ESTAB
clientSocket.close()
FIN_WAIT_1 can no longer FINbit=1, seq=x
send but can
receive data CLOSE_WAIT
ACKbit=1; ACKnum=x+1
can still
FIN_WAIT_2 wait for server send data
close

LAST_ACK
FINbit=1, seq=y
TIMED_WAIT can no longer
send data
ACKbit=1; ACKnum=y+1
timed wait
for 2*max CLOSED
segment lifetime

CLOSED

Kurose & Ross 3-84


Closing a TCP Connection
(Graceful Close)
Host A Host B
Host A initiates the TCP
connection termination,
sends its FIN Host B sends ACK
but does not yet
send its own FIN

Host B still
delivers 150
bytes
Host B now sends
its own FIN

Host A ACKs B’s


FIN closing its side Host B gets A’s
of the connection ACK and closes its
side of the
connection
After sending FIN, Host A cannot send any more data but cannot close
the connection as B may still be sending something
TIME_WAIT state

TIME_WAIT State is entered if the host sending a FIN (e.g. Host


A in previous slide) receives an ACK from the other side

 This protects future incarnations of connection from delayed


segments
 TIME_WAIT = 2 x MSL
Maximum Segment Lifetime (MSL) is the maximum time that an IP
packet packet can live in the network

 Only valid segment that can arrive while in TIME_WAIT


state is a FIN retransmission. If such segment arrives,
resent ACK & restart TIME_WAIT timer
 When timer expires, close TCP connection
Chapter 3 outline
3.1 transport-layer 3.5 connection-oriented
services transport: TCP
3.2 multiplexing and  segment structure
demultiplexing  reliable data transfer
3.3 connectionless  flow control
transport: UDP  connection management
3.4 principles of reliable 3.6 principles of congestion
data transfer control
3.7 TCP congestion control

Kurose & Ross 3-87


Principles of congestion control

congestion:
 informally: “too many sources sending too much
data too fast for network to handle”
 different from flow control!
 manifestations:
 lost packets (buffer overflow at routers)
 long delays (queueing in router buffers)

Kurose & Ross 3-88


Different Phases of Congestion Behavior
1. Light traffic
R  Arrival Rate << R
Throughput (bps)

 Low delay
 Can accommodate more
2. Knee (congestion onset)
 Arrival rate approaches R
Arrival  Delay increases rapidly
Rate
 Throughput begins to
saturate
3. Congestion collapse
 Arrival rate > R
Delay (sec)

 Large delays, packet loss


 Useful application
throughput drops
Arrival
R Rate
Causes/costs of congestion: scenario 1

original data: in throughput: out


 two senders, two
receivers Host A

 one router, infinite unlimited shared


buffers output link buffers

 output link capacity: R


 no retransmission
Host B

R/2

delay
out

in R/2 in R/2


 maximum per-connection  large delays as arrival rate, in,
throughput: R/2 approaches capacity
Kurose & Ross 3-90
Causes/costs of congestion: scenario 2
 one router, finite buffers
 sender retransmission of timed-out packet
 application-layer input = application-layer output: in =
out
 transport-layer input includes retransmissions : in‘ in

in : original data


'in: original data, plus out
retransmitted data

Host A

finite shared output


Host B
link buffers
Kurose & Ross 3-91
Causes/costs of congestion: scenario 2
R/2
idealization: perfect
knowledge

out
 sender sends only when
router buffers available
in R/2

in : original data


copy 'in: original data, plus out
retransmitted data

A free buffer space!

finite shared output


Host B
link buffers
Kurose & Ross 3-92
Causes/costs of congestion: scenario 2
Idealization:
known loss packets can be
lost, dropped at router
due to full buffers
 sender only resends if
packet known to be lost

in : original data


copy 'in: original data, plus out
retransmitted data

A no buffer space!

Host B
Kurose & Ross 3-93
Causes/costs of congestion: scenario 2
Idealization: known loss R/2
packets can be lost,
dropped at router due to when sending at R/2,
full buffers some packets are

out
retransmissions but
 sender only resends if asymptotic goodput
packet known to be lost is still R/2

in R/2

in : original data


'in: original data, plus out
retransmitted data

A free buffer space!

Host B
Kurose & Ross 3-94
Causes/costs of congestion: scenario 2
Realistic: duplicates
 packets can be lost, dropped at R/2
router due to full buffers
when sending at R/2,
 sender times out prematurely,
some packets are

out
sending two copies, both of which retransmissions
are delivered including duplicated
that are delivered!

in R/2

in
timeout
copy 'in out

A free buffer space!

Host B
Kurose & Ross 3-95
Causes/costs of congestion: scenario 2
Realistic: duplicates R/2
 packets can be lost, dropped
at router due to full buffers when sending at R/2,
some packets are

out
 sender times out prematurely, retransmissions
including duplicated
sending two copies, both of that are delivered!
which are delivered in R/2

“costs” of congestion:
 more work (retrans) for given “goodput”
 unneeded retransmissions: link carries multiple copies of pkt
 decreasing goodput

Kurose & Ross 3-96


Causes/costs of congestion: scenario 3
Q: what happens as in and in’ increase
 four senders ?
 multihop paths A: as red in’ increases, all arriving blue
 timeout/retransmit pkts at upper queue are dropped, blue
throughput 0

Host A
in : original data out
Host B
'in: original data, plus
retransmitted data
finite shared output
link buffers

Host D
Host C

Kurose & Ross 3-97


Causes/costs of congestion: scenario 3

C/2
out

in’ C/2

another “cost” of congestion:


 when packet dropped, any “upstream transmission
capacity used for that packet was wasted!

Kurose & Ross 3-98


Approaches towards congestion control

two broad approaches towards congestion control:

end-end congestion network-assisted


control: congestion control:
 no explicit feedback from  routers provide feedback to
network end systems
 congestion inferred from  single bit indicating
end-system observed loss, congestion (SNA, DECbit,
delay TCP/IP ECN, ATM)
 approach taken by TCP  explicit rate for sender to
send at

Kurose & Ross 3-99


Chapter 3 outline
3.1 transport-layer 3.5 connection-oriented
services transport: TCP
3.2 multiplexing and  segment structure
demultiplexing  reliable data transfer
3.3 connectionless  flow control
transport: UDP  connection management
3.4 principles of reliable 3.6 principles of congestion
data transfer control
3.7 TCP congestion control

Kurose & Ross 3-100


TCP congestion control: additive increase
multiplicative decrease
 approach: sender increases transmission rate (window size),
probing for usable bandwidth, until loss occurs
 additive increase: increase cwnd by 1 MSS every RTT until
loss detected
 multiplicative decrease: cut cwnd in half after loss

additively increase window size …


…. until loss occurs (then cut window in half)
congestion window size
cwnd: TCP sender

AIMD saw tooth


behavior: probing
for bandwidth

time
Kurose & Ross 3-101
TCP Congestion Control: details
sender sequence number space
cwnd
TCP sending rate:
 roughly: send cwnd bytes,
wait RTT for ACKS, then
last byte last byte send more bytes
ACKed sent, not- sent
yet ACKed
(“in-flight”)

cwnd
 sender limits transmission: rate ~
~ bytes/sec
RTT
LastByteSent- < cwnd
LastByteAcked
 cwnd is dynamic, function of perceived
network congestion

Kurose & Ross 3-102


TCP Slow Start
Host A Host B
 when connection begins,
increase rate exponentially
until first loss event:
 initially cwnd = 1 MSS

RTT
 double cwnd every RTT
 done by incrementing cwnd for
every ACK received
 summary: initial rate is slow
but ramps up exponentially
fast

time

Kurose & Ross 3-103


TCP: detecting, reacting to loss
 loss indicated by timeout:
 cwnd set to 1 MSS;
 window then grows exponentially (as in slow start) to threshold, then
grows linearly
 loss indicated by 3 duplicate ACKs: TCP RENO
 dup ACKs indicate network capable of delivering some segments
 cwnd is cut in half window then grows linearly
 TCP Tahoe always sets cwnd to 1 (timeout or 3 duplicate acks)

Kurose & Ross 3-104


TCP: switching from slow start to CA
Q: when should the
exponential
increase switch to
linear?
A: when cwnd gets to
1/2 of its value
before timeout.

Implementation:
 variable ssthresh
 on loss event, ssthresh
is set to 1/2 of cwnd just
before loss event

Kurose & Ross 3-105


TCP throughput
 avg. TCP thruput as function of window size, RTT?
 ignore slow start, assume always data to send
 W: window size (measured in bytes) where loss occurs
 avg. window size (# in-flight bytes) is ¾ W
 avg. thruput is 3/4W per RTT

3 W
avg TCP thruput = bytes/sec
4 RTT

W/2

Kurose & Ross 3-106


TCP Futures: TCP over “long, fat pipes”

 example: 1500 byte segments, 100ms RTT, want 10 Gbps


throughput
 requires W = 83,333 in-flight segments
 throughput in terms of segment loss probability, L [Mathis
1997]:

1.22 . MSS
TCP throughput =
RTT L
➜ to achieve 10 Gbps throughput, need a loss rate of L
= 2·10-10 – a very small loss rate!
 new versions of TCP for high-speed

Kurose & Ross 3-107


TCP Fairness
fairness goal: if K TCP sessions share same
bottleneck link of bandwidth R, each should have
average rate of R/K

TCP connection 1

bottleneck
router
capacity R
TCP connection 2

Kurose & Ross 3-108


Why is TCP fair?
two competing sessions:
 additive increase gives slope of 1, as throughout increases
 multiplicative decrease decreases throughput proportionally

R equal bandwidth share


Connection 2 throughput

loss: decrease window by factor of 2


congestion avoidance: additive increase
loss: decrease window by factor of 2
congestion avoidance: additive increase

Connection 1 throughput R
Kurose & Ross 3-109
Fairness (more)

Fairness and UDP Fairness, parallel TCP


connections
 multimedia apps often do
 application can open multiple
not use TCP
parallel connections between
 do not want rate throttled by
congestion control two hosts
 instead use UDP:  web browsers do this
 send audio/video at constant  e.g., link of rate R with 9 existing
rate, tolerate packet loss connections:
 new app asks for 1 TCP, gets rate R/10
 new app asks for 11 TCPs, gets R/2

Kurose & Ross 3-110


Chapter 3: summary
 principles behind transport
layer services:
 multiplexing, demultiplexing
 reliable data transfer next:
 flow control  leaving the network
 congestion control “edge” (application,
transport layers)
 instantiation, implementation in
the Internet  into the network
“core”
 UDP
 TCP

Kurose & Ross 3-111

You might also like