Network Working Group B. Lowekamp Internet-DraftBSkype Intended status: ExperimentalNovember 06,September 14, 2009 Expires:May 10,March 18, 2010 A Semi-Reliable, Congestion Controlled Transport Over UDP for RELOAD draft-lowekamp-tsvwg-p2psip-aimd-00 Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire onMay 10,March 18, 2010. Abstract RELOAD [ref] is the P2PSIP WG's peer-to-peer overlay network protocol. One of the main components of the protocol is the "overlay link" protocol, which is the overlay network's hop-by-hop protocol, manifested as a transport protocol as viewed in the Internet.The purpose of thisThis draftis to presentpresents the requirements for the overlay link protocol and a proposed solution of a congestion-controlled semi-reliable transport protocol implementedusingover UDP. The purpose of this draft is to solicit comments on the problem and solutions from the transport community. This protocol will eventually be defined in the RELOAD base protocol draft. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Overlay Link Protocol Requirements . . . . . . . . . . . . . .. . . .4 3.new section . . .Other Alternatives . . . . . . . . . . . . . . . . . . . . . .45 4. AIMD-based Protocol . . . . . . . . . . . . . . . . . . . . . 5 4.1. Reliability for Unreliable Links . . . . . . . . . . . . . 5 4.1.1. Framed Message Format . . . . . . . . . . . . . . . .56 4.1.2. RTO Calculation . . . . . . . . . . . . . . . . . . .78 4.1.3. AIMD Retransmission Scheme . . . . . . . . . . . . . . 8 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . .910 8. Normative References . . . . . . . . . . . . . . . . . . . . .910 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . .910 Intellectual Property and Copyright Statements . . . . . . . . . .1011 1. Introduction RELOAD defines a protocol for creating, maintaining, and using a peer-to-peer overlay network. It is implemented as simple request/ response protocol. The architecture of RELOAD is: | Internet Model | Real | Equivalent | Reload Internet | in Overlay | Architecture --------------+-----------------+------------------------------------ | | +-------+ +-------+ | Application | | SIP | | XMPP | ... | | | Usage | | Usage | | | +-------+ +-------+ | | ---------------------------------- | |+------------------+ +---------+ | Transport || Message |<--->| Storage | | || Transport | +---------+ | |+------------------+ ^ | | ^ ^ | | | | v v Application | | | +-------------------+ | (Routing) | | | Topology | | | | | Plugin | | | | +-------------------+ | | | ^ | | v v | Network | +------------------+ | | | Forwarding & | | | | Link Management | | | +------------------+ | | ---------------------------------- Transport | Link | +-------+ +------+ | | |TLS | |DTLS | ... | | +-------+ +------+ --------------+-----------------+------------------------------------ Network | | Link | Typical transactions are small. Overlay maintenance messages are typically sent between neighbors (single hop). Application data messages typically travel across the overlay network (multiple hops). In typical traffic patterns, the majority of the traffic sent across a link in the overlay will be application data. Large individual messages or streams of data across the overlay are not expected (and would be banned under typical configurations). The "Message Transport" layer handles end-to-end reliability with exponential backoff. (Exponential backoff should have been added to thecurrent-03 revision, but was missed. The initial RTO should be the average response latency of recent requests across the overlay.) Fragmentation is handled at the "Forwarding and Link Management" layer. The overlay's "Link" layer is responsible for queueing discipline for the link. If support for larger messages across the overlay is desired, the "Message Transport" protocol would need to be extended to support end-to-end congestion control and segment retransmission across the overlay network. Within the current architecture of RELOAD, this extension is not required as applications use the overlay network to locate resources and to establish direct connections with those resources (using conventional application protocols running directly across the Internet rather than through the overlay), therefore large messages are not sent across the overlay. 2. Overlay Link Protocol Requirements The "Overlay Link" protocol provides a service deliveringpacketsfragments between nodes (peers) in the overlay network. The traffic it handles is similar to Internet traffic except that the typical exchange between nodes on the overlay is shorter (a singlerequest/response)small request/ response) than typical Internet traffic. The requirements for this layer are essentially similar to the requirements for an Internet link layer:*o Semi-reliability, to avoid unnecessary end-to-end retransmissions*o datagram-based (no ordering requirements)SinceBecause it is implemented across the Internet, there is an additional requirement that it implements*o Congestion control Overlay-level end-to-end congestion control is handled by theend-to-endMessage Transport layer's retransmission backoff andby queue managementthe queuing discipline at the overlay link layer. As the typical transactions are small, we anticipatethatintroducing additional congestion control by limiting the rate at which a node may introduce new messages into the overlay, likely in a similar AIMD form to that proposed below. There is an additional requirement on RELOAD that the protocol must be capable of establishing connections between nodes behind NATs. Combining that with the currently available NAT traversal protocols (ICE [ref]), this essentially requires at least one overlay link protocol to be running over UDP, although other overlay link protocols relying on TCP or SCTP should be developed and may eventually supersede a UDP-based protocol as NAT traversal techniques and support in deployed NATs emerge for TCP and SCTP. 3.new sectionOther Alternatives There is a separate proposal for TCP over UDP[ref]. This proposal hopes[ref], which intends to rely on established implementations of TCP,althoughand benefit from the years of knowledge developing TCP. However, the stream-oriented nature of TCP is notnecessarily appropriate for this purpose. Suggestions of other established (particularly in RFCs) alternativesrequired fora suitablean overlay-link protocolis available, would be appreciated. Other alternativesand may reduce performance of the overlay due to unnecessary head-of-line blocking. Another alternative solutionincludeis ICE-TCP [ice-tcp], although progress seems to be slow on that draft, and even if native TCP NAT traversal becomes more successful, head-of-line blocking may still be a concern. SCTP has a number of desireable characteristics for an overlay link protocol. However, at present time neither native SCTP support inNATS [sctpnat], orNATs [sctpnat] nor SCTP over UDP[sctp-udp], although at the present times none of those solutions[sctp-udp] appear to be on track to be fully deployable (and supported by hardware in the case of native SCTP through NATs) for use in RELOAD. 4. AIMD-based Protocol The proposed solutionconsists ofutilizes a simplereceiver,receiver that is intended to be compatible with a variety of sender algorithms. The RELOAD draft currently proposes a stop and wait sender algorithm, but here thecongestion- controlledcongestion-controlled semi-reliable AIMD algorithm proposal is presented. Please note that, unlike in TCP, the sequence numbers are neither cumulative nor are retransmissions made using the same sequence number. The following text is taken from the RELOAD base draft [ref], with editing for clarification and to improve its stand-alone presentation. 4.1. Reliability for Unreliable Links When RELOAD is carried over DTLS or another unreliablelinkprotocol, it needs to be used with a reliability and congestion control mechanism, which is provided on a hop-by-hop basis. The basic principle is that each fragment, regardless of if it carries a request or response, will get an ACK and be reliably retransmitted. The receiver's job is very simple, limited to just sending ACKs. All the complexity is at thesendersender's side. This allows the sending implementation to trade off performance versus implementation complexity without affecting the wire protocol. In order to support unreliable links, each fragment is wrapped in a very simple framing layer (FramedMessage) which is only used for each hop. This layer contains a sequence numberwhichthat can then be used for ACKs. 4.1.1. Framed Message Format The definition of FramedMessage is: enum {data (128), ack (129), (255)} FramedMessageType; struct { FramedMessageType type; select (type) { case data: uint32 sequence; opaque message<0..2^24-1>; case ack: uint32 ack_sequence; uint32 received; }; } FramedMessage; The type field of the PDU is set to indicate whether the fragment is data or an acknowledgement. If the fragment is of type "data", then the remainder of the PDU is as follows: sequence the sequence number. This increments by 1 for each framed fragment sent over this transport session, regardless of whether the fragment is an initial transmission or a retransmission. message the message that is being transmitted. Each connection hasitits own sequence number space. Initially the value is zero and it increments by exactly one for each fragment sent over that connection, including retransmissions. (Note that because all connections are encrypted with DTLS, there is no danger of collision here.) When the receiverreceivereceives a fragment, itSHOULDMUST immediately send an ACK fragment. The receiver MUST keep track of the 32 most recent sequence numbers received on this association in order to generate the appropriate ACK. If the PDU is of type "ack", the contents are as follows: ack_sequence The sequence number of the fragment being acknowledged. received A bitmask indicating if each of the previous 32 sequence numbers before this packet had been received. When a packet is received with a sequence number N, the receiver looks at the sequence number of the previously 32 packets received on this connection. Call the previously received packet number M. And for each of the previous 32 packets, if the sequence number M is less than N but greater than N-32, the N-M bit of the received bitmask is set to one otherwise it is zero. Note that a bit being set to one indicates a particular packet wasreceivedreceived, but if the bit is set to zero it only means it is unknown if it was received or not. It might have been received but not in the 32 most recently received window. The received field bits in the ACK provide a very high degree of redundancy for the sender to figure out which packets the receiver received and can then estimate packet loss rates. If the sender also keeps track of the time at which recent sequence numbers were sent, the RTT can be estimated. 4.1.2. RTO Calculation The RTO is an estimate of the round-trip time (RTT). Implementations can use a static value for RTO or a dynamic estimate which will result in better performance. For implementations that use a static value, the default value for RTO is 500 ms. Nodes MAY use smaller values of RTO if it is known that all nodes are are within the local network. The default RTO MAY be chosen larger, and this is RECOMMENDED if it is known in advance (such as on high latency access links) that the round-trip time is larger. Implementations that use a dynamic estimate to compute the RTO MUST use the algorithm described in RFC 2988[RFC2988], with the exceptions that value of RTO SHOULD NOT be rounded up to the nearest second but instead rounded up to the nearest millisecond. The RTT of a successful STUN transaction from the ICE stage is used as the initial measurement for formula 2.2 of RFC 2988. The sender keeps track of the time each fragment was sent for all recently sent fragments. Any time an ACK is received, the sender can compute the RTT for that fragment by looking at the time the ACK was received and when time the fragment was sent. This is used as a subsequent RTT measurement for formula 2.3 of RFC 2988 to update the RTO estimate. (Note that because retransmissions receive new sequence numbers, all received ACKs are used.) The value for RTO is calculated separately for each DTLS session. 4.1.3. AIMD Retransmission Scheme NOTE: this section is currently more descriptive than normative. Final copy of this will produce a normative section that is separate from the descriptive. This section specifies a sender retransmission algorithm based on the the AIMD algorithm in TCP. The algorithm here is only the AIMD portion of TCP. All other features are restricted to simplify the implementation, i.e. no slow start (initial window is 1) and no fast recovery. Note that because this is a datagram rather than stream- based protocol (i.e. not sliding window, no need to pause for previously lost packets), the motivation for these features are not as strong. Slow start MAY be implemented. A fragment is considered received when an ACK arrives for any of that fragment's transmissions, i.e. the sender must track all sequence numbers used to transmit the fragment. In order to simplify the implementation, this specification allows the sender to treat each unACKed fragment with separate timers used for retransmission. An implementation SHOULD implement fast retransmission, in which case a fragment is retransmitted (with a new sequence number) as soon as ACKs for three higher sequence numbers than the fragment's most recent transmission have been received, or when the fragment's timer expires. A sender implementing fast retransmission MAY maintain a single timer. Fragments are transmitted up to 5 times. When retransmitting based on timeouts, the RTO is doubled after each retransmission. (TODO: more details of timer management needed) For example, assuming an RTO of 500 ms, a fragment would be sent at times 0 ms, 500 ms, 1000 ms, 2000 ms, and 4000 ms. Retransmissions continue until a response is received or until the fragment has been transmitted 5 times, at which point the fragment is dropped after the final timeout. The sender allows w unacknowledged fragments to be outstanding at any given time. w is initially set to one. Every RTO interval that w ACKs are received, w is increased by one. If fewer than w-1 ACKs are received, but no loss has been observed, w is decreased by 1. When a loss is observed, w is halved. After reducing w, if there are more than w fragments for which an ACK is pending, no further retransmissions of the most recently initiated fragments in excess of w are performed until they fit in the window w, at which point those fragments begin the retransmission algorithm as if they were new fragments. New fragments are transmitted as normal if there are less than w outstanding fragments. w is held fixed for one RTO after being halved. After that point, the algorithm resumes adjusting w accordingly. If w drops to one and the one pending fragment is not ACKed by the other side after 5 requests are sent, the link is considered to have failed. Otherwise, unACKed fragments are simply dropped after all their transmissions are complete, and a new fragment replaces it in the window if there is room. 5. IANA Considerations This document makes no request of IANA.Note to RFC Editor: this section may be removed on publication as an RFC.6. Security Considerations As all sessions of this protocol are encrypted within a DTLS connection, security risks should be minimal. 7. Acknowledgements Cullen Jennings contributed some text used in this draft from an earlier version of the RELOAD document, and authorized that text to be incorporated here under the terms of the current IPR declaration. 8. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission Timer", RFC 2988, November 2000. Author's Address Bruce B. LowekampBSkype Full Copyright Statement Copyright (C) The IETF Trust (2009). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org.