When packets traverse an internetwork they will most likely utilise links with varying bandwidth. Additionally, the Round Trip Time (RTT) will also vary depending on the physical distance, processing delays and transmission rate.
This is one of the greatest features of internetworking - we can interconnect networks of varying speeds and technologies. However, it is also what creates one of the biggest challenges for transport layer protocols.
The slowest link in the network path is referred to as the bottleneck link - effectively limiting the end-to-end transmission rate. If traffic arrives at this link at a rate faster than it can send it, packets will need to be queued, possibly leading to buffer exhaustion. Once queueing starts additional delays will result and the router will be forced to discard packets once the queue reaches maximum capacity.
When a router has more than two interfaces, incoming traffic may need to be aggregated from multiple links and transmitted out a single interface. In this situation congestion can occur unless the outbound interface has a bandwidth which exceeds the sum of the bandwidth of the inbound interfaces.
TCP is a self-clocking protocol. That is, once at equilibrium it will only transmit segments when it receives acknowledgments. Obviously the rate at which segments are received and acknowledgments are returned is governed by the bandwidth and RTT of the network. The challenge is getting to equilibrium - in other words, how fast should we send data?
When TCP was first introduced in 1981, it did not implement effective congestion control. In October 1986 the Internet suffered from the first of a series of "congestion collapses" - the internetwork had become congested and senders were continuing to resend their data in an attempt to get it through. This only serves to exacerbate the problem.
As a result, Van Jacobson and Mike Karels suggested changes to the way in which TCP responsed to congestion. The outcome was the addition of the slow-start and congestion-avoidance algorithms to TCP.
Central to the slow-start and congestion-avoidance algorithms is the use of a congestion window (or cwnd) variable. The congestion window specifies the maximum number of segments that can be "in flight" (sent but not yet acknowledged) at any point in time.
When a TCP connection is first established we have no information about the network which lies between the hosts. As a result, we have to probe the network for capacity. This is achieved via the slow-start algorithm, which opens the congestion window in an exponential manner.
This is implemented as follows:
If slow-start terminates due to packet loss (detected by the retransmit timer firing), we set ssthresh equal to half of the current cwnd value (the last known value that worked) and restart the slow-start process.
If slow-start terminated due to cwnd exceeding ssthresh we enter the congestion-avoidance phase.
The purpose of slow-start is to quickly open the congestion window to a point that allows us to utilise the network bandwidth. Once this point has been determined we should continue to transmit data at this rate, however there may be unused capacity still available. As a result, congestion-avoidance continues to increase the congestion window at a linear rate of one segment per RTT.
At some point congestion-avoidance will increase the congestion window to the point where the network is forced to drop packets. This will result in TCP setting ssthresh to half the current congestion window size and re-entering the slow-start phase.
Whilst Jacobson's congestion control algorithms proved to be highly beneficial and have prevented further congestion collapse, a number of performance issues were also introduced.
Jacobson based TCP variants assume that all packet loss is due to congestion. This largely holds true for wired networks, however it is very common for radio based communication to result in packet loss due to corruption. Furthermore, any packet loss will result in the congestion window being reset to one segment and slow-start being resumed, which impacts on performance.
Two additional changes known as fast-retransmit and fast-recovery were proposed in order to partly deal with these issues. It is quite common for a single packet to be lost (either due to corruption or congestion). This will result in the TCP receiver continuing to send acknowledgments for the same sequence number, for each segment received after the missing packet. If three consecutive duplicate acknowledgments are received TCP deems that the packet has been lost and immediately resends it. This is the fast-retransmit process and it potentially avoids a costly retransmit timeout from occurring, which would lead to slow-start being resumed.
The fast-recovery process is more complex, however after a packet is resent by fast-retransmit it effectively sets the congestion window to half the number of segments that are currently "in flight". For each acknowledgment received that acknowledges new data, cwnd is increased by one segment and a new segment is injected into the network. This process allows TCP to recover much faster from a loss event, potentially avoiding the need to resume the slow-start phase.
How large does the congestion window need to be in order for TCP to fully utilise the network capacity?
The answer to this question lies in a metric known as the network Bandwidth Delay Product or BDP. This is effectively the bandwidth of the bottleneck link (in bits per second) multipled by the time taken for a packet to travel from the sender to the receiver and back again (the Round Trip Time or RTT). This tells us how many bits need to be "in flight" in order to fully utilise the network.
For example, a 100Mbps Ethernet link with a RTT of 0.3ms:
BDP = 100,000,000 * 0.0003 = 30,000 bits or 3,750 bytes
Or a 512Kbps link with a 20ms RTT:
BDP = 512,000 * 0.020 = 10,240 bits or 1,280 bytes
Or a 2Mbps geostationary link with a 560ms RTT:
BDP = 2,000,000 * 0.560 = 1,120,000 bits or 140,000 bytes
This is a critical aspect to the operation of TCP's congestion control. The larger the BDP the larger cwnd has to be in order to allow TCP to fully utilise the network capacity. Furthermore the larger the RTT the longer it will take for TCP to reach equilibrium since the window increase is tied to the network RTT. Also the exponential increase results in larger bursts of traffic being injected into the network - in fact for a congestion window size of W bytes the buffer size at the first router needs to be at least W/2 bytes.
Another major challenge is that of fairness - the available bandwidth of the bottleneck link should be shared fairly amongst consuming flows. Likewise an existing flow should yield bandwidth to a new flow.
For TCP flows that have the same RTT and BDP, the slow-start and congestion-avoidance algorithms will result in the congestion window of each connection oscillating, which leads to a equal share of bandwidth over some period of time. However, Jacobson based TCP is RTT biased and a flow with a large RTT will get a lesser share of a bottleneck link than a flow with a smaller RTT.
Other protocols will generate IP packets, in addition to standard TCP traffic. ICMP is used as part of IP communication and other non-TCP traffic will also share the network. Examples of this include UDP based traffic and routing protocols (such as IGRP/EIGRP and OSPF).
From TCP's perspective this traffic is considered to be background traffic (aka noise) and in most situations the slow-start and congestion-avoidance algorithms will simply work around it. However, from a non-TCP perspective the aggressiveness and burstiness that results from TCP's congestion control algorithms can be a significant problem.
One of the biggest challenges in congestion control is creating protocols that play nicely with each other. Applications that use UDP as a transport protocol (or are directly encapsulated within IP) need to provide their own congestion control, since UDP does not provide this as part of the service. Obviously this should not be more aggressive than the approach used by TCP, however if it is less aggressive than TCP the protocol will not gain a fair share of the bandwidth.
Many multimedia and realtime applications generate a very different traffic profile to TCP - it is often consistent traffic with a Constant Bit Rate (CBR). Additionally the requirements for delivery of this traffic are almost the opposite to TCP. It needs to be delivered quickly, with minimal delay and lost packets are not retransmitted.
One of the more commonly used protocols for streaming video and telephony data is the Realtime Transport Protocol or RTP. RTP implements little in the way of congestion control, instead simply transmitting the stream at a constant bit rate. The RFC for RTP (RFC3550) states that implementations should monitor packet loss to ensure that packet loss remains within "acceptable parameters". The aim is to achieve equivalent throughput to a TCP flow operating under the same conditions.
A recent development is the Datagram Congestion Control Protocol or DCCP. This protocol provides a congestion-controlled transport using unreliable datagrams - in effect a UDP with built in congestion control. DCCP implements multiple congestion control mechanisms, one of which is selected by the application layer protocol depending on its requirements.
Currently two congestion control mechanisms are supported - "TCP-like Congestion Control" (RFC4341) and "TCP-Friendly Rate Control (TFRC)" (RFC4342). The first is a Variable Bit Rate (VBR) service which implements congestion control similar to TCP (hence the name), whilst the latter implements a CBR service which monitors packet loss and makes use of Explicit Congestion Notification (ECN).
Van Jacobson's infamous paper:
Congestion Avoidance and Control
(214KB PDF)
RFC 2814 - Congestion Control Principles