A Semi-Reliable, Congestion Controlled Transport Over UDP for RELOAD Skype RELOAD [ref] is the P2PSIP WG's peer-to-peer overlay network protocol. One of the main components of the protocol is the "overlay link" protocol, which is the overlay network's hop-by-hop protocol, manifested as a transport protocol as viewed in the Internet. This draft presents the requirements for the overlay link protocol and a proposed solution of a congestion-controlled semi-reliable transport protocol implemented over UDP. The purpose of this draft is to solicit comments on the problem and solutions from the transport community. This protocol will eventually be defined in the RELOAD base protocol draft. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
RELOAD defines a protocol for creating, maintaining, and using a peer-to-peer overlay network. It is implemented as simple request/response protocol. The architecture of RELOAD is:
| Storage | | || Transport | +---------+ | |+------------------+ ^ | | ^ ^ | | | | v v Application | | | +-------------------+ | (Routing) | | | Topology | | | | | Plugin | | | | +-------------------+ | | | ^ | | v v | Network | +------------------+ | | | Forwarding & | | | | Link Management | | | +------------------+ | | ---------------------------------- Transport | Link | +-------+ +------+ | | |TLS | |DTLS | ... | | +-------+ +------+ --------------+-----------------+------------------------------------ Network | | Link | ]]>
Typical transactions are small. Overlay maintenance messages are typically sent between neighbors (single hop). Application data messages typically travel across the overlay network (multiple hops). In typical traffic patterns, the majority of the traffic sent across a link in the overlay will be application data. Large individual messages or streams of data across the overlay are not expected (and would be banned under typical configurations). The "Message Transport" layer handles end-to-end reliability with exponential backoff. (Exponential backoff should have been added to the -03 revision, but was missed. The initial RTO should be the average response latency of recent requests across the overlay.) Fragmentation is handled at the "Forwarding and Link Management" layer. The overlay's "Link" layer is responsible for queueing discipline for the link. If support for larger messages across the overlay is desired, the "Message Transport" protocol would need to be extended to support end-to-end congestion control and segment retransmission across the overlay network. Within the current architecture of RELOAD, this extension is not required as applications use the overlay network to locate resources and to establish direct connections with those resources (using conventional application protocols running directly across the Internet rather than through the overlay), therefore large messages are not sent across the overlay.
The "Overlay Link" protocol provides a service delivering fragments between nodes (peers) in the overlay network. The traffic it handles is similar to Internet traffic except that the typical exchange between nodes on the overlay is shorter (a single small request/response) than typical Internet traffic. The requirements for this layer are essentially similar to the requirements for an Internet link layer: Semi-reliability, to avoid unnecessary end-to-end retransmissions datagram-based (no ordering requirements) Because it is implemented across the Internet, there is an additional requirement that it implements Congestion control Overlay-level end-to-end congestion control is handled by the Message Transport layer's retransmission backoff and the queuing discipline at the overlay link layer. As the typical transactions are small, we anticipate introducing additional congestion control by limiting the rate at which a node may introduce new messages into the overlay, likely in a similar AIMD form to that proposed below. There is an additional requirement on RELOAD that the protocol must be capable of establishing connections between nodes behind NATs. Combining that with the currently available NAT traversal protocols (ICE [ref]), this essentially requires at least one overlay link protocol to be running over UDP, although other overlay link protocols relying on TCP or SCTP should be developed and may eventually supersede a UDP-based protocol as NAT traversal techniques and support in deployed NATs emerge for TCP and SCTP.
There is a separate proposal for TCP over UDP [ref], which intends to rely on established implementations of TCP, and benefit from the years of knowledge developing TCP. However, the stream-oriented nature of TCP is not required for an overlay-link protocol and may reduce performance of the overlay due to unnecessary head-of-line blocking. Another alternative solution is ICE-TCP [ice-tcp], although progress seems to be slow on that draft, and even if native TCP NAT traversal becomes more successful, head-of-line blocking may still be a concern. SCTP has a number of desireable characteristics for an overlay link protocol. However, at present time neither native SCTP support in NATs [sctpnat] nor SCTP over UDP [sctp-udp] appear to be on track to be fully deployable (and supported by hardware in the case of native SCTP through NATs) for use in RELOAD.
The proposed solution utilizes a simple receiver that is intended to be compatible with a variety of sender algorithms. The RELOAD draft currently proposes a stop and wait sender algorithm, but here the congestion-controlled semi-reliable AIMD algorithm proposal is presented. Please note that, unlike in TCP, the sequence numbers are neither cumulative nor are retransmissions made using the same sequence number. The following text is taken from the RELOAD base draft [ref], with editing for clarification and to improve its stand-alone presentation.
When RELOAD is carried over DTLS or another unreliable protocol, it needs to be used with a reliability and congestion control mechanism, which is provided on a hop-by-hop basis. The basic principle is that each fragment, regardless of if it carries a request or response, will get an ACK and be reliably retransmitted. The receiver's job is very simple, limited to just sending ACKs. All the complexity is at the sender's side. This allows the sending implementation to trade off performance versus implementation complexity without affecting the wire protocol. In order to support unreliable links, each fragment is wrapped in a very simple framing layer (FramedMessage) which is only used for each hop. This layer contains a sequence number that can then be used for ACKs.
The definition of FramedMessage is:
; case ack: uint32 ack_sequence; uint32 received; }; } FramedMessage; ]]>
The type field of the PDU is set to indicate whether the fragment is data or an acknowledgement. If the fragment is of type "data", then the remainder of the PDU is as follows: the sequence number. This increments by 1 for each framed fragment sent over this transport session, regardless of whether the fragment is an initial transmission or a retransmission. the message that is being transmitted. Each connection has its own sequence number space. Initially the value is zero and it increments by exactly one for each fragment sent over that connection, including retransmissions. (Note that because all connections are encrypted with DTLS, there is no danger of collision here.) When the receiver receives a fragment, it MUST immediately send an ACK fragment. The receiver MUST keep track of the 32 most recent sequence numbers received on this association in order to generate the appropriate ACK. If the PDU is of type "ack", the contents are as follows: The sequence number of the fragment being acknowledged. A bitmask indicating if each of the previous 32 sequence numbers before this packet had been received. When a packet is received with a sequence number N, the receiver looks at the sequence number of the previously 32 packets received on this connection. Call the previously received packet number M. And for each of the previous 32 packets, if the sequence number M is less than N but greater than N-32, the N-M bit of the received bitmask is set to one otherwise it is zero. Note that a bit being set to one indicates a particular packet was received, but if the bit is set to zero it only means it is unknown if it was received or not. It might have been received but not in the 32 most recently received window. The received field bits in the ACK provide a very high degree of redundancy for the sender to figure out which packets the receiver received and can then estimate packet loss rates. If the sender also keeps track of the time at which recent sequence numbers were sent, the RTT can be estimated.
The RTO is an estimate of the round-trip time (RTT). Implementations can use a static value for RTO or a dynamic estimate which will result in better performance. For implementations that use a static value, the default value for RTO is 500 ms. Nodes MAY use smaller values of RTO if it is known that all nodes are are within the local network. The default RTO MAY be chosen larger, and this is RECOMMENDED if it is known in advance (such as on high latency access links) that the round-trip time is larger. Implementations that use a dynamic estimate to compute the RTO MUST use the algorithm described in RFC 2988, with the exceptions that value of RTO SHOULD NOT be rounded up to the nearest second but instead rounded up to the nearest millisecond. The RTT of a successful STUN transaction from the ICE stage is used as the initial measurement for formula 2.2 of RFC 2988. The sender keeps track of the time each fragment was sent for all recently sent fragments. Any time an ACK is received, the sender can compute the RTT for that fragment by looking at the time the ACK was received and when time the fragment was sent. This is used as a subsequent RTT measurement for formula 2.3 of RFC 2988 to update the RTO estimate. (Note that because retransmissions receive new sequence numbers, all received ACKs are used.) The value for RTO is calculated separately for each DTLS session.
NOTE: this section is currently more descriptive than normative. Final copy of this will produce a normative section that is separate from the descriptive. This section specifies a sender retransmission algorithm based on the the AIMD algorithm in TCP. The algorithm here is only the AIMD portion of TCP. All other features are restricted to simplify the implementation, i.e. no slow start (initial window is 1) and no fast recovery. Note that because this is a datagram rather than stream-based protocol (i.e. not sliding window, no need to pause for previously lost packets), the motivation for these features are not as strong. Slow start MAY be implemented. A fragment is considered received when an ACK arrives for any of that fragment's transmissions, i.e. the sender must track all sequence numbers used to transmit the fragment. In order to simplify the implementation, this specification allows the sender to treat each unACKed fragment with separate timers used for retransmission. An implementation SHOULD implement fast retransmission, in which case a fragment is retransmitted (with a new sequence number) as soon as ACKs for three higher sequence numbers than the fragment's most recent transmission have been received, or when the fragment's timer expires. A sender implementing fast retransmission MAY maintain a single timer. Fragments are transmitted up to 5 times. When retransmitting based on timeouts, the RTO is doubled after each retransmission. (TODO: more details of timer management needed) For example, assuming an RTO of 500 ms, a fragment would be sent at times 0 ms, 500 ms, 1000 ms, 2000 ms, and 4000 ms. Retransmissions continue until a response is received or until the fragment has been transmitted 5 times, at which point the fragment is dropped after the final timeout. The sender allows w unacknowledged fragments to be outstanding at any given time. w is initially set to one. Every RTO interval that w ACKs are received, w is increased by one. If fewer than w-1 ACKs are received, but no loss has been observed, w is decreased by 1. When a loss is observed, w is halved. After reducing w, if there are more than w fragments for which an ACK is pending, no further retransmissions of the most recently initiated fragments in excess of w are performed until they fit in the window w, at which point those fragments begin the retransmission algorithm as if they were new fragments. New fragments are transmitted as normal if there are less than w outstanding fragments. w is held fixed for one RTO after being halved. After that point, the algorithm resumes adjusting w accordingly. If w drops to one and the one pending fragment is not ACKed by the other side after 5 requests are sent, the link is considered to have failed. Otherwise, unACKed fragments are simply dropped after all their transmissions are complete, and a new fragment replaces it in the window if there is room.
This document makes no request of IANA.
As all sessions of this protocol are encrypted within a DTLS connection, security risks should be minimal.
Cullen Jennings contributed some text used in this draft from an earlier version of the RELOAD document, and authorized that text to be incorporated here under the terms of the current IPR declaration.