Lecture #24 - The Transport Layer: Part 2
TCP Retransmissions
TCP's original design uses a technique of "timeout and retransmit" for unacknowledged segments. This may occur when, for example, the IP layer drops a datagram somewhere in the network. Note that TCP does not use negative acknowledgements.
This means that the TCP stack at the "sending end" has to make a decision about how long it should wait for an ACK before re-transmitting. It is impossible for TCP to initially have this information, since it only communicates with its peer software on the remote system.
TCP maintains awareness of changing network conditions by continually monitoring how long each segment takes to be ACK'd (known as the Round Trip Time or RTT). In otherwords, TCP wants to know how the internetwork interconnecting the two TCP hosts is performing. It keeps two counters: the smoothed round trip time (SRTT) and the mean deviation of the RTT (SMDEV). Each of these is updated each time an ACK is received.
Timeouts
The SRTT is updated for each new ACK received by:
SRTT = a * SRTT + (1 - a)M
Where M is the time for the ACK to arrive, and a (alpha) is a tuneable parameter, typically 0.9.
The SMDEV is updated for each new value of M as follows:
SMDEV = a * SMDEV + (1 - a)|SRTT - M|
Where the value of a is not necessarily the same as for the calculation of SRTT.
The general algorithm for TCP timeout is[1]:
RTO = SRTT + 4 * SMDEV
This algorithm was developed by Van Jacobson and Mike Karel[2]. The problem of calculating a reasonable RTO is also complicated by other factors. Phil Karn proposed a modification to this algorithm to achieve a rational RTO in the face of varying network load, where occasionally segments are retransmitted due to timeouts.
Karn's algorithm simply adds the following:
- Do not include RTTs for retransmitted segments in the sample.
- On each successive retransmission double the current timeout value.
[1] Initially the number of SMDEVs was specified as 2, but it was later found that 4 gave better performance.
[2] See Jacobson's infamous paper entitled "Congestion Avoidance and Control".
Congestion Control
Initial versions of TCP implemented flow control via the use of the receiver's advertised window (the Window field in the TCP header). This allows the receiver to govern the number of segments received. However, this does not allow for the state of the underlying network to be monitored.
When packets arrive at a router faster than it is able to forward them, the datagrams need to be buffered or queued until they can be sent on to the next hop. If the buffer completely fills up datagrams have to be discarded since there is no space to store them. The queueing delay or the dropped datagram can result in TCP timing out and retransmitting the segment, potentially leading to further congestion. This can eventually lead to congestion collapse, whereby the network is unable to do any useful work.
As a result, TCP needs to be able to detect network congestion and react appropriately. The way in which this is done is through the use of congestion control algorithms.
Slow-Start and Congestion-Avoidance
Routers can send ICMP source quench messages in order to tell a sender to slow down, however this on its own is not overly effective.
As a result, we need to govern the rate at which TCP injects segments into the network. In order to achieve this TCP maintains a congestion window (cwnd). This is simply the number of segments that it is permitted to have "in flight", that is sent but not yet acknowledged.
In 1988 Van Jacobson introduced two congestion control algorithms to TCP, known as slow-start and congestion-avoidance. Slow-start is used to get to equalibrium, congestion-avoidance is then used to maintain it.
Slow-Start
Since a new TCP has no information about the underlying network, we need to probe the network for capacity. As a result, we start with a congestion window size of one segment[3], before increasing it in an exponential manner. This is simply implemented by increasing the congestion window size by one segment for each acknowledgment received.
At some point the network will be forced to drop one or more segments. When this occurs slow-start sets a variable known as slow-start threshold (ssthresh) to half the current size of the congestion window, the congestion window is reset to one segment and slow-start restarts. This implements a multiplicative decrease and ensures that congestion does not persist.
When the congestion window reaches or exceeds the value of ssthresh, slow-start is terminated and congestion avoidance is commenced.
[3] A later RFC, RFC2414, permits an initial window size of up to 4 segments, providing the initial window size does not exceed 4KB.
Congestion Avoidance
In the congestion avoidance phase the congestion window is increased by one segment each time the entire window is transmitted and acknowledged. This results in a steady linear growth that probes for additional bandwidth. If one or more segments are lost the same behaviour occurs as for slow-start (ssthresh set to half the current cwnd, cwnd reset to one, slow-start restarted).
Duplicate Acknowledgments
With the early versions of TCP (known as TCP Tahoe), a single packet loss would result in a retransmission timeout, resulting in degraded performance. Due to the sliding window design of TCP, duplicate acknowledgments (ie. two or more acknowledgments acknowledging the same sequence number) indicate that a segment is missing from the window.
When this occurs it implies that data is successfully leaving the network, hence a packet has arrived at the receiver, resulting in the acknowledgment being generated. It also implies that packets have arrived out or order, or a packet has gone missing.
Later versions of TCP (known as TCP Reno) make use of this information and provide fast retransmit and fast recovery algorithms. Upon the receipt of three duplicate acknowledgments the TCP segment is presumed to have be lost and is immediately retransmitted. The fast recovery process then works to maintain a reasonably sized congestion window. This prevents the need for a timeout, resulting in a complete restart of slow-start.
User Datagram Protocol
UDP provides a connectionless alternative transport service to TCP for applications where reliable stream service is not needed. UDP datagrams can be droppped, duplicated or delivered out of order, exactly as for IP.
The UDP transport service add to IP the ability to deliver a datagram to a specified destination process using a port abstraction, in an analogous way to that used by TCP.
Examples of applications where UDP is used include:
-
Any application where loss of a datagram is not critical because
later datagrams will imply the missing information. Examples include:
- A "timekeeper" host sends the current "wallclock" time to a slave host. (NB: This is not usually done in this way, but it'll serve as an example!).
- Routers broadcast copies of their routing tables every 30 seconds.
- An application process which performs its own error correction is used.
- The reduced overhead of connectionless operation suits some time-critical applications where (occasional) loss of data may be unimportant. An example is voice communications over the Internet.
UDP Datagram
- UDP Destination Port
- Defines (or names) the process which is to receive this datagram on the destination host.
- UDP Source Port
- This field is optional: it may be zero, or have a non-zero value to indicate a port to which reply UDP datagrams may be directed.
- UDP Message Length
- Specifies the length of the UDP datagram, including the eight bytes of header.
- UDP Checksum
- A "pseudo header" is prefixed to the datagram, and the checksum covers this plus the entire datagram itself. This is similar to the technique used in TCP. Computation of a checksum is optional in UDP.
References
Stallings: Chapter 20.2 (pp. 674-693)
RFC793 - Transmission Control Protocol (TCP)
RFC2414 - Increasing TCP's Initial Window
Copyright © 2006 Phil Rice