A Semi-Reliable, Congestion Controlled
Transport Over UDP for RELOADSkypeRELOAD [ref] is the P2PSIP WG's peer-to-peer overlay network
protocol. One of the main components of the protocol is the "overlay
link" protocol, which is the overlay network's hop-by-hop protocol,
manifested as a transport protocol as viewed in the Internet. This draft
presents the requirements for the overlay link protocol and a proposed
solution of a congestion-controlled semi-reliable transport protocol
implemented over UDP. The purpose of this draft is to solicit comments
on the problem and solutions from the transport community. This protocol
will eventually be defined in the RELOAD base protocol draft.The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119.RELOAD defines a protocol for creating, maintaining, and using a
peer-to-peer overlay network. It is implemented as simple
request/response protocol. The architecture of RELOAD is:| Storage |
| || Transport | +---------+
| |+------------------+ ^
| | ^ ^ |
| | | v v
Application | | | +-------------------+
| (Routing) | | | Topology |
| | | | Plugin |
| | | +-------------------+
| | | ^
| | v v
| Network | +------------------+
| | | Forwarding & |
| | | Link Management |
| | +------------------+
| | ----------------------------------
Transport | Link | +-------+ +------+
| | |TLS | |DTLS | ...
| | +-------+ +------+
--------------+-----------------+------------------------------------
Network |
|
Link |
]]>Typical transactions are small. Overlay maintenance messages are
typically sent between neighbors (single hop). Application data messages
typically travel across the overlay network (multiple hops). In typical
traffic patterns, the majority of the traffic sent across a link in the
overlay will be application data. Large individual messages or streams
of data across the overlay are not expected (and would be banned under
typical configurations). The "Message Transport" layer handles
end-to-end reliability with exponential backoff. (Exponential backoff
should have been added to the -03 revision, but was missed. The initial
RTO should be the average response latency of recent requests across the
overlay.) Fragmentation is handled at the "Forwarding and Link
Management" layer. The overlay's "Link" layer is responsible for
queueing discipline for the link.If support for larger messages across the overlay is desired, the
"Message Transport" protocol would need to be extended to support
end-to-end congestion control and segment retransmission across the
overlay network. Within the current architecture of RELOAD, this
extension is not required as applications use the overlay network to
locate resources and to establish direct connections with those
resources (using conventional application protocols running directly
across the Internet rather than through the overlay), therefore large
messages are not sent across the overlay.The "Overlay Link" protocol provides a service delivering fragments
between nodes (peers) in the overlay network. The traffic it handles is
similar to Internet traffic except that the typical exchange between
nodes on the overlay is shorter (a single small request/response) than
typical Internet traffic. The requirements for this layer are
essentially similar to the requirements for an Internet link layer: Semi-reliability, to avoid unnecessary end-to-end retransmissions
datagram-based (no ordering requirements) Because it is implemented across the Internet, there is an additional
requirement that it implements Congestion control Overlay-level end-to-end congestion control is handled by the Message
Transport layer's retransmission backoff and the queuing discipline at
the overlay link layer. As the typical transactions are small, we
anticipate introducing additional congestion control by limiting the
rate at which a node may introduce new messages into the overlay, likely
in a similar AIMD form to that proposed below. There is an additional requirement on RELOAD that the protocol must
be capable of establishing connections between nodes behind NATs.
Combining that with the currently available NAT traversal protocols (ICE
[ref]), this essentially requires at least one overlay link protocol to
be running over UDP, although other overlay link protocols relying on
TCP or SCTP should be developed and may eventually supersede a UDP-based
protocol as NAT traversal techniques and support in deployed NATs emerge
for TCP and SCTP.There is a separate proposal for TCP over UDP [ref], which intends to
rely on established implementations of TCP, and benefit from the years
of knowledge developing TCP. However, the stream-oriented nature of TCP
is not required for an overlay-link protocol and may reduce performance
of the overlay due to unnecessary head-of-line blocking. Another alternative solution is ICE-TCP [ice-tcp], although progress
seems to be slow on that draft, and even if native TCP NAT traversal
becomes more successful, head-of-line blocking may still be a
concern.SCTP has a number of desireable characteristics for an overlay link
protocol. However, at present time neither native SCTP support in NATs
[sctpnat] nor SCTP over UDP [sctp-udp] appear to be on track to be fully
deployable (and supported by hardware in the case of native SCTP through
NATs) for use in RELOAD.The proposed solution utilizes a simple receiver that is intended to
be compatible with a variety of sender algorithms. The RELOAD draft
currently proposes a stop and wait sender algorithm, but here the
congestion-controlled semi-reliable AIMD algorithm proposal is
presented.Please note that, unlike in TCP, the sequence numbers are neither
cumulative nor are retransmissions made using the same sequence number.
The following text is taken from the RELOAD base draft [ref], with
editing for clarification and to improve its stand-alone
presentation.When RELOAD is carried over DTLS or another unreliable protocol, it
needs to be used with a reliability and congestion control mechanism,
which is provided on a hop-by-hop basis. The basic principle is that
each fragment, regardless of if it carries a request or response, will
get an ACK and be reliably retransmitted. The receiver's job is very
simple, limited to just sending ACKs. All the complexity is at the
sender's side. This allows the sending implementation to trade off
performance versus implementation complexity without affecting the
wire protocol.In order to support unreliable links, each fragment is wrapped in a
very simple framing layer (FramedMessage) which is only used for each
hop. This layer contains a sequence number that can then be used for
ACKs.The definition of FramedMessage is:;
case ack:
uint32 ack_sequence;
uint32 received;
};
} FramedMessage;
]]>The type field of the PDU is set to indicate whether the fragment
is data or an acknowledgement.If the fragment is of type "data", then the remainder of the PDU
is as follows: the sequence number. This increments by 1 for each framed
fragment sent over this transport session, regardless of whether
the fragment is an initial transmission or a retransmission.the message that is being transmitted.Each connection has its own sequence number space. Initially the
value is zero and it increments by exactly one for each fragment
sent over that connection, including retransmissions. (Note that
because all connections are encrypted with DTLS, there is no danger
of collision here.)When the receiver receives a fragment, it MUST immediately send
an ACK fragment. The receiver MUST keep track of the 32 most recent
sequence numbers received on this association in order to generate
the appropriate ACK.If the PDU is of type "ack", the contents are as follows: The sequence number of the fragment being acknowledged.A bitmask indicating if each of the previous 32 sequence
numbers before this packet had been received. When a packet is
received with a sequence number N, the receiver looks at the
sequence number of the previously 32 packets received on this
connection. Call the previously received packet number M. And
for each of the previous 32 packets, if the sequence number M is
less than N but greater than N-32, the N-M bit of the received
bitmask is set to one otherwise it is zero.Note that a bit being set to one indicates a particular
packet was received, but if the bit is set to zero it only means
it is unknown if it was received or not. It might have been
received but not in the 32 most recently received window.The received field bits in the ACK provide a very high degree of
redundancy for the sender to figure out which packets the receiver
received and can then estimate packet loss rates. If the sender also
keeps track of the time at which recent sequence numbers were sent,
the RTT can be estimated.The RTO is an estimate of the round-trip time (RTT).
Implementations can use a static value for RTO or a dynamic estimate
which will result in better performance. For implementations that
use a static value, the default value for RTO is 500 ms. Nodes MAY
use smaller values of RTO if it is known that all nodes are are
within the local network. The default RTO MAY be chosen larger, and
this is RECOMMENDED if it is known in advance (such as on high
latency access links) that the round-trip time is larger.Implementations that use a dynamic estimate to compute the RTO
MUST use the algorithm described in RFC 2988, with the exceptions that value of RTO
SHOULD NOT be rounded up to the nearest second but instead rounded
up to the nearest millisecond. The RTT of a successful STUN
transaction from the ICE stage is used as the initial measurement
for formula 2.2 of RFC 2988. The sender keeps track of the time each
fragment was sent for all recently sent fragments. Any time an ACK
is received, the sender can compute the RTT for that fragment by
looking at the time the ACK was received and when time the fragment
was sent. This is used as a subsequent RTT measurement for formula
2.3 of RFC 2988 to update the RTO estimate. (Note that because
retransmissions receive new sequence numbers, all received ACKs are
used.)The value for RTO is calculated separately for each DTLS
session.NOTE: this section is currently more descriptive than normative.
Final copy of this will produce a normative section that is separate
from the descriptive.This section specifies a sender retransmission algorithm based on
the the AIMD algorithm in TCP. The algorithm here is only the AIMD
portion of TCP. All other features are restricted to simplify the
implementation, i.e. no slow start (initial window is 1) and no fast
recovery. Note that because this is a datagram rather than
stream-based protocol (i.e. not sliding window, no need to pause for
previously lost packets), the motivation for these features are not
as strong. Slow start MAY be implemented.A fragment is considered received when an ACK arrives for any of
that fragment's transmissions, i.e. the sender must track all
sequence numbers used to transmit the fragment.In order to simplify the implementation, this specification
allows the sender to treat each unACKed fragment with separate
timers used for retransmission. An implementation SHOULD implement
fast retransmission, in which case a fragment is retransmitted (with
a new sequence number) as soon as ACKs for three higher sequence
numbers than the fragment's most recent transmission have been
received, or when the fragment's timer expires. A sender
implementing fast retransmission MAY maintain a single timer.Fragments are transmitted up to 5 times. When retransmitting
based on timeouts, the RTO is doubled after each retransmission.
(TODO: more details of timer management needed) For example,
assuming an RTO of 500 ms, a fragment would be sent at times 0 ms,
500 ms, 1000 ms, 2000 ms, and 4000 ms. Retransmissions continue
until a response is received or until the fragment has been
transmitted 5 times, at which point the fragment is dropped after
the final timeout.The sender allows w unacknowledged fragments to be outstanding at
any given time. w is initially set to one. Every RTO interval that w
ACKs are received, w is increased by one. If fewer than w-1 ACKs are
received, but no loss has been observed, w is decreased by 1. When a
loss is observed, w is halved. After reducing w, if there are more
than w fragments for which an ACK is pending, no further
retransmissions of the most recently initiated fragments in excess
of w are performed until they fit in the window w, at which point
those fragments begin the retransmission algorithm as if they were
new fragments. New fragments are transmitted as normal if there are
less than w outstanding fragments. w is held fixed for one RTO after
being halved. After that point, the algorithm resumes adjusting w
accordingly.If w drops to one and the one pending fragment is not ACKed by
the other side after 5 requests are sent, the link is considered to
have failed. Otherwise, unACKed fragments are simply dropped after
all their transmissions are complete, and a new fragment replaces it
in the window if there is room.This document makes no request of IANA.As all sessions of this protocol are encrypted within a DTLS
connection, security risks should be minimal.Cullen Jennings contributed some text used in this draft from an
earlier version of the RELOAD document, and authorized that text to be
incorporated here under the terms of the current IPR declaration.