Network Working Group B. Lowekamp Internet-Draft B Intended status: Experimental November 06, 2009 Expires: May 10, 2010 A Semi-Reliable, Congestion Controlled Transport Over UDP for RELOAD draft-lowekamp-tsvwg-p2psip-aimd-00 Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on May 10, 2010. Abstract RELOAD [ref] is the P2PSIP WG's peer-to-peer overlay network protocol. One of the main components of the protocol is the "overlay link" protocol, which is the overlay network's hop-by-hop protocol, manifested as a transport protocol as viewed in the Internet. The purpose of this draft is to present the requirements for the overlay link protocol and a proposed solution of a congestion-controlled semi-reliable transport protocol implemented using UDP. This protocol will eventually be defined in the RELOAD base protocol draft. Lowekamp Expires May 10, 2010 [Page 1] Internet-Draft RELOAD-AIMD November 2009 Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Overlay Link Requirements . . . . . . . . . . . . . . . . . . 4 3. new section . . . . . . . . . . . . . . . . . . . . . . . . . 4 4. AIMD-based Protocol . . . . . . . . . . . . . . . . . . . . . 5 4.1. Reliability for Unreliable Links . . . . . . . . . . . . . 5 4.1.1. Framed Message Format . . . . . . . . . . . . . . . . 5 4.1.2. RTO Calculation . . . . . . . . . . . . . . . . . . . 7 4.1.3. AIMD Retransmission Scheme . . . . . . . . . . . . . . 8 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 9 8. Normative References . . . . . . . . . . . . . . . . . . . . . 9 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 9 Intellectual Property and Copyright Statements . . . . . . . . . . 10 Lowekamp Expires May 10, 2010 [Page 2] Internet-Draft RELOAD-AIMD November 2009 1. Introduction RELOAD defines a protocol for creating, maintaining, and using a peer-to-peer overlay network. It is implemented as simple request/ response protocol. The architecture of RELOAD is: | Internet Model | Real | Equivalent | Reload Internet | in Overlay | Architecture --------------+-----------------+------------------------------------ | | +-------+ +-------+ | Application | | SIP | | XMPP | ... | | | Usage | | Usage | | | +-------+ +-------+ | | ---------------------------------- | |+------------------+ +---------+ | Transport || Message |<--->| Storage | | || Transport | +---------+ | |+------------------+ ^ | | ^ ^ | | | | v v Application | | | +-------------------+ | (Routing) | | | Topology | | | | | Plugin | | | | +-------------------+ | | | ^ | | v v | Network | +------------------+ | | | Forwarding & | | | | Link Management | | | +------------------+ | | ---------------------------------- Transport | Link | +-------+ +------+ | | |TLS | |DTLS | ... | | +-------+ +------+ --------------+-----------------+------------------------------------ Network | | Link | Typical transactions are small. Overlay maintenance messages are typically sent between neighbors (single hop). Application data messages typically travel across the overlay network (multiple hops). In typical traffic patterns, the majority of the traffic sent across a link in the overlay will be application data. Large individual messages or streams of data across the overlay are not expected (and would be banned under typical configurations). The "Message Transport" layer handles end-to-end reliability with exponential Lowekamp Expires May 10, 2010 [Page 3] Internet-Draft RELOAD-AIMD November 2009 backoff. (Exponential backoff should have been added to the current revision, but was missed. The initial RTO should be the average response latency of recent requests across the overlay.) Fragmentation is handled at the "Forwarding and Link Management" layer. The overlay's "Link" layer is responsible for queueing discipline for the link. 2. Overlay Link Requirements The "Overlay Link" protocol provides a service delivering packets between nodes (peers) in the overlay network. The traffic it handles is similar to Internet traffic except that the typical exchange between nodes on the overlay is shorter (a single request/response) than typical Internet traffic. The requirements for this layer are essentially similar to the requirements for an Internet link layer: * Semi-reliability, to avoid unnecessary end-to-end retransmissions * datagram-based (no ordering requirements) Since it is implemented across the Internet, there is an additional requirement that it implements * Congestion control Overlay-level congestion control is handled by the end-to-end retransmission backoff and by queue management at the overlay link layer. As the typical transactions are small, we anticipate that introducing additional congestion control by limiting the rate at which a node may introduce new messages into the overlay, likely in a similar AIMD form to that proposed below. There is an additional requirement on RELOAD that the protocol must be capable of establishing connections between nodes behind NATs. Combining that with the currently available NAT traversal protocols (ICE [ref]), this essentially requires at least one overlay link protocol to running over UDP, although other overlay link protocols relying on TCP or SCTP should be developed and eventually supersede a UDP-based protocol as NAT traversal techniques and support in deployed NATs emerge for TCP and SCTP. 3. new section There is a separate proposal for TCP over UDP [ref]. This proposal hopes to rely on established implementations of TCP, although the stream-oriented nature of TCP is not necessarily appropriate for this purpose. Suggestions of other established (particularly in RFCs) alternatives for a suitable protocol is available, would be appreciated. Other alternatives solution include ICE-TCP [ice-tcp], native SCTP support in NATS [sctpnat], or SCTP over UDP [sctp-udp], although at the present times none of those solutions appear to be on track for use in RELOAD. Lowekamp Expires May 10, 2010 [Page 4] Internet-Draft RELOAD-AIMD November 2009 4. AIMD-based Protocol The proposed solution consists of a simple receiver, intended to be compatible with a variety of sender algorithms. The RELOAD currently proposes a stop and wait algorithm, but here the congestion- controlled semi-reliable AIMD algorithm proposal is presented. Please note that, unlike in TCP, the sequence numbers are neither cumulative nor are retransmissions made using the same sequence number. The following text is taken from the RELOAD base draft [ref], with editing for clarification and to improve its stand-alone presentation. 4.1. Reliability for Unreliable Links When RELOAD is carried over DTLS or another unreliable link protocol, it needs to be used with a reliability and congestion control mechanism, which is provided on a hop-by-hop basis. The basic principle is that each fragment, regardless of if it carries a request or response, will get an ACK and be reliably retransmitted. The receiver's job is very simple, limited to just sending ACKs. All the complexity is at the sender side. This allows the sending implementation to trade off performance versus implementation complexity without affecting the wire protocol. In order to support unreliable links, each fragment is wrapped in a very simple framing layer (FramedMessage) which is only used for each hop. This layer contains a sequence number which can then be used for ACKs. 4.1.1. Framed Message Format The definition of FramedMessage is: enum {data (128), ack (129), (255)} FramedMessageType; struct { FramedMessageType type; select (type) { case data: uint32 sequence; opaque message<0..2^24-1>; case ack: uint32 ack_sequence; uint32 received; }; Lowekamp Expires May 10, 2010 [Page 5] Internet-Draft RELOAD-AIMD November 2009 } FramedMessage; The type field of the PDU is set to indicate whether the fragment is data or an acknowledgement. If the fragment is of type "data", then the remainder of the PDU is as follows: sequence the sequence number. This increments by 1 for each framed fragment sent over this transport session, regardless of whether the fragment is an initial transmission or a retransmission. message the message that is being transmitted. Each connection has it own sequence number space. Initially the value is zero and it increments by exactly one for each fragment sent over that connection, including retransmissions. When the receiver receive a fragment, it SHOULD immediately send an ACK fragment. The receiver MUST keep track of the 32 most recent sequence numbers received on this association in order to generate the appropriate ACK. If the PDU is of type "ack", the contents are as follows: ack_sequence The sequence number of the fragment being acknowledged. received Lowekamp Expires May 10, 2010 [Page 6] Internet-Draft RELOAD-AIMD November 2009 A bitmask indicating if each of the previous 32 sequence numbers before this packet had been received. When a packet is received with a sequence number N, the receiver looks at the sequence number of the previously 32 packets received on this connection. Call the previously received packet number M. And for each of the previous 32 packets, if the sequence number M is less than N but greater than N-32, the N-M bit of the received bitmask is set to one otherwise it is zero. Note that a bit being set indicates a particular packet was received but if the bit is set to zero it only means it is unknown if it was received or not. It might have been received but not in the 32 most recently received window. The received field bits in the ACK provide a very high degree of redundancy for the sender to figure out which packets the receiver received and can then estimate packet loss rates. If the sender also keeps track of the time at which recent sequence numbers were sent, the RTT can be estimated. 4.1.2. RTO Calculation The RTO is an estimate of the round-trip time (RTT). Implementations can use a static value for RTO or a dynamic estimate which will result in better performance. For implementations that use a static value, the default value for RTO is 500 ms. Nodes MAY use smaller values of RTO if it is known that all nodes are are within the local network. The default RTO MAY be chosen larger, and this is RECOMMENDED if it is known in advance (such as on high latency access links) that the round-trip time is larger. Implementations that use a dynamic estimate to compute the RTO MUST use the algorithm described in RFC 2988[RFC2988], with the exceptions that value of RTO SHOULD NOT be rounded up to the nearest second but instead rounded up to the nearest millisecond. The RTT of a successful STUN transaction from the ICE stage is used as the initial measurement for formula 2.2 of RFC 2988. The sender keeps track of the time each fragment was sent for all recently sent fragments. Any time an ACK is received, the sender can compute the RTT for that fragment by looking at the time the ACK was received and when time the fragment was sent. This is used as a subsequent RTT measurement for formula 2.3 of RFC 2988 to update the RTO estimate. (Note that because retransmissions receive new sequence numbers, all received ACKs are used.) The value for RTO is calculated separately for each DTLS session. Lowekamp Expires May 10, 2010 [Page 7] Internet-Draft RELOAD-AIMD November 2009 4.1.3. AIMD Retransmission Scheme NOTE: this section is currently more descriptive than normative. Final copy of this will produce a normative section that is separate from the descriptive. This section specifies a sender retransmission algorithm based on the the AIMD algorithm in TCP. The algorithm here is only the AIMD portion of TCP. All other features are restricted to simplify the implementation, i.e. no slow start (initial window is 1) and no fast recovery. Note that because this is a datagram rather than stream- based protocol (i.e. not sliding window, no need to pause for previously lost packets), the motivation for these features are not as strong. Slow start MAY be implemented. A fragment is considered received when an ACK arrives for any of that fragment's transmissions, i.e. the sender must track all sequence numbers used to transmit the fragment. In order to simplify the implementation, this specification allows the sender to treat each unACKed fragment with separate timers used for retransmission. An implementation SHOULD implement fast retransmission, in which case a fragment is retransmitted (with a new sequence number) as soon as ACKs for three higher sequence numbers have been received, or when the fragment's timer expires. A sender implementing fast retransmission MAY maintain a single timer. Fragments are transmitted up to 5 times. When retransmitting based on timeouts, the RTO is doubled after each retransmission. (TODO: more details of timer management needed) For example, assuming an RTO of 500 ms, a fragment would be sent at times 0 ms, 500 ms, 1000 ms, 2000 ms, and 4000 ms. Retransmissions continue until a response is received or until the fragment has been transmitted 5 times, at which point the fragment is dropped after the final timeout. The sender allows w unacknowledged fragments to be outstanding at any given time. w is initially set to one. Every RTO interval that w ACKs are received, w is increased by one. If fewer than w-1 ACKs are received, but no loss has been observed, w is decreased by 1. When a loss is observed, w is halved. After reducing w, if there are more than w fragments for which an ACK is pending, no further retransmissions of the most recently initiated fragments in excess of w are performed until they fit in the window w, at which point those fragments begin the retransmission algorithm as if they were new fragments. New fragments are transmitted as normal if there are less than w outstanding fragments. w is held fixed for one RTO after being halved. After that point, the algorithm resumes adjusting w accordingly. Lowekamp Expires May 10, 2010 [Page 8] Internet-Draft RELOAD-AIMD November 2009 If w drops to one and the one pending fragment is not ACKed by the other side after 5 requests are sent, the link is considered to have failed. Otherwise, unACKed fragments are simply dropped after all their transmissions are complete, and a new fragment replaces it in the window if there is room. 5. IANA Considerations This document makes no request of IANA. Note to RFC Editor: this section may be removed on publication as an RFC. 6. Security Considerations 7. Acknowledgements 8. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission Timer", RFC 2988, November 2000. Author's Address Bruce B. Lowekamp B Lowekamp Expires May 10, 2010 [Page 9] Internet-Draft RELOAD-AIMD November 2009 Full Copyright Statement Copyright (C) The IETF Trust (2009). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Lowekamp Expires May 10, 2010 [Page 10]