Network Working Group                                        B. Lowekamp
Internet-Draft                                                         B                                                     Skype
Intended status: Experimental                          November 06,                         September 14, 2009
Expires: May 10, March 18, 2010

  A Semi-Reliable, Congestion Controlled Transport Over UDP for RELOAD
                  draft-lowekamp-tsvwg-p2psip-aimd-00

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on May 10, March 18, 2010.

Abstract

   RELOAD [ref] is the P2PSIP WG's peer-to-peer overlay network
   protocol.  One of the main components of the protocol is the "overlay
   link" protocol, which is the overlay network's hop-by-hop protocol,
   manifested as a transport protocol as viewed in the Internet.  The
   purpose of this  This
   draft is to present presents the requirements for the overlay link protocol and a
   proposed solution of a congestion-controlled semi-reliable transport
   protocol implemented using over UDP.  The purpose of this draft is to
   solicit comments on the problem and solutions from the transport
   community.  This protocol will eventually be defined in the RELOAD
   base protocol draft.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Overlay Link Protocol Requirements . . . . . . . . . . . . . . . . . .  4
   3.  new section  . . .  Other Alternatives . . . . . . . . . . . . . . . . . . . . . .  4  5
   4.  AIMD-based Protocol  . . . . . . . . . . . . . . . . . . . . .  5
     4.1.  Reliability for Unreliable Links . . . . . . . . . . . . .  5
       4.1.1.  Framed Message Format  . . . . . . . . . . . . . . . .  5  6
       4.1.2.  RTO Calculation  . . . . . . . . . . . . . . . . . . .  7  8
       4.1.3.  AIMD Retransmission Scheme . . . . . . . . . . . . . .  8
   5.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . .  9
   6.  Security Considerations  . . . . . . . . . . . . . . . . . . .  9
   7.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . .  9 10
   8.  Normative References . . . . . . . . . . . . . . . . . . . . .  9 10
   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . .  9 10
   Intellectual Property and Copyright Statements . . . . . . . . . . 10 11

1.  Introduction

   RELOAD defines a protocol for creating, maintaining, and using a
   peer-to-peer overlay network.  It is implemented as simple request/
   response protocol.  The architecture of RELOAD is:

                 | Internet Model  |
     Real        |   Equivalent    |          Reload
   Internet      |   in Overlay    |       Architecture
   --------------+-----------------+------------------------------------
                 |                 |    +-------+  +-------+
                 |  Application    |    | SIP   |  | XMPP  |  ...
                 |                 |    | Usage |  | Usage |
                 |                 |    +-------+  +-------+
                 |                 |  ----------------------------------
                 |                 |+------------------+     +---------+
                 |   Transport     ||     Message      |<--->| Storage |
                 |                 ||    Transport     |     +---------+
                 |                 |+------------------+           ^
                 |                 |       ^       ^               |
                 |                 |       |       v               v
   Application   |                 |       |     +-------------------+
                 |   (Routing)     |       |     |    Topology       |
                 |                 |       |     |     Plugin        |
                 |                 |       |     +-------------------+
                 |                 |       |         ^
                 |                 |       v         v
                 |    Network      |    +------------------+
                 |                 |    |  Forwarding &    |
                 |                 |    | Link Management  |
                 |                 |    +------------------+
                 |                 |  ----------------------------------
   Transport     |      Link       |     +-------+  +------+
                 |                 |     |TLS    |  |DTLS  |  ...
                 |                 |     +-------+  +------+
   --------------+-----------------+------------------------------------
      Network    |
                 |
        Link     |

   Typical transactions are small.  Overlay maintenance messages are
   typically sent between neighbors (single hop).  Application data
   messages typically travel across the overlay network (multiple hops).
   In typical traffic patterns, the majority of the traffic sent across
   a link in the overlay will be application data.  Large individual
   messages or streams of data across the overlay are not expected (and
   would be banned under typical configurations).  The "Message
   Transport" layer handles end-to-end reliability with exponential
   backoff.  (Exponential backoff should have been added to the current -03
   revision, but was missed.  The initial RTO should be the average
   response latency of recent requests across the overlay.)
   Fragmentation is handled at the "Forwarding and Link Management"
   layer.  The overlay's "Link" layer is responsible for queueing
   discipline for the link.

   If support for larger messages across the overlay is desired, the
   "Message Transport" protocol would need to be extended to support
   end-to-end congestion control and segment retransmission across the
   overlay network.  Within the current architecture of RELOAD, this
   extension is not required as applications use the overlay network to
   locate resources and to establish direct connections with those
   resources (using conventional application protocols running directly
   across the Internet rather than through the overlay), therefore large
   messages are not sent across the overlay.

2.  Overlay Link Protocol Requirements

   The "Overlay Link" protocol provides a service delivering packets fragments
   between nodes (peers) in the overlay network.  The traffic it handles
   is similar to Internet traffic except that the typical exchange
   between nodes on the overlay is shorter (a single request/response) small request/
   response) than typical Internet traffic.  The requirements for this
   layer are essentially similar to the requirements for an Internet
   link layer: *

   o  Semi-reliability, to avoid unnecessary end-to-end retransmissions *

   o  datagram-based (no ordering requirements) Since

   Because it is implemented across the Internet, there is an additional
   requirement that it implements *

   o  Congestion control

   Overlay-level end-to-end congestion control is handled by the end-to-end Message
   Transport layer's retransmission backoff and by queue
   management the queuing discipline
   at the overlay link layer.  As the typical transactions are small, we
   anticipate that introducing additional congestion control by limiting the
   rate at which a node may introduce new messages into the overlay,
   likely in a similar AIMD form to that proposed below.

   There is an additional requirement on RELOAD that the protocol must
   be capable of establishing connections between nodes behind NATs.
   Combining that with the currently available NAT traversal protocols
   (ICE [ref]), this essentially requires at least one overlay link
   protocol to be running over UDP, although other overlay link
   protocols relying on TCP or SCTP should be developed and may
   eventually supersede a UDP-based protocol as NAT traversal techniques
   and support in deployed NATs emerge for TCP and SCTP.

3.  new section  Other Alternatives

   There is a separate proposal for TCP over UDP [ref].  This proposal
   hopes [ref], which intends to
   rely on established implementations of TCP, although and benefit from the
   years of knowledge developing TCP.  However, the stream-oriented
   nature of TCP is not necessarily appropriate for this
   purpose.  Suggestions of other established (particularly in RFCs)
   alternatives required for a suitable an overlay-link protocol is available, would be
   appreciated.  Other alternatives and may
   reduce performance of the overlay due to unnecessary head-of-line
   blocking.

   Another alternative solution include is ICE-TCP [ice-tcp], although progress
   seems to be slow on that draft, and even if native TCP NAT traversal
   becomes more successful, head-of-line blocking may still be a
   concern.

   SCTP has a number of desireable characteristics for an overlay link
   protocol.  However, at present time neither native SCTP support in NATS [sctpnat], or
   NATs [sctpnat] nor SCTP over UDP [sctp-udp],
   although at the present times none of those solutions [sctp-udp] appear to be on track to
   be fully deployable (and supported by hardware in the case of native
   SCTP through NATs) for use in RELOAD.

4.  AIMD-based Protocol

   The proposed solution consists of utilizes a simple receiver, receiver that is intended to
   be compatible with a variety of sender algorithms.  The RELOAD draft
   currently proposes a stop and wait sender algorithm, but here the congestion-
   controlled
   congestion-controlled semi-reliable AIMD algorithm proposal is
   presented.

   Please note that, unlike in TCP, the sequence numbers are neither
   cumulative nor are retransmissions made using the same sequence
   number.  The following text is taken from the RELOAD base draft
   [ref], with editing for clarification and to improve its stand-alone
   presentation.

4.1.  Reliability for Unreliable Links

   When RELOAD is carried over DTLS or another unreliable link protocol, it
   needs to be used with a reliability and congestion control mechanism,
   which is provided on a hop-by-hop basis.  The basic principle is that
   each fragment, regardless of if it carries a request or response,
   will get an ACK and be reliably retransmitted.  The receiver's job is
   very simple, limited to just sending ACKs.  All the complexity is at
   the sender sender's side.  This allows the sending implementation to trade
   off performance versus implementation complexity without affecting
   the wire protocol.

   In order to support unreliable links, each fragment is wrapped in a
   very simple framing layer (FramedMessage) which is only used for each
   hop.  This layer contains a sequence number which that can then be used for
   ACKs.

4.1.1.  Framed Message Format

   The definition of FramedMessage is:

         enum {data (128), ack (129), (255)} FramedMessageType;

         struct {
           FramedMessageType       type;

           select (type) {
             case data:
               uint32              sequence;
               opaque              message<0..2^24-1>;

             case ack:
               uint32              ack_sequence;
               uint32              received;
           };
         } FramedMessage;

   The type field of the PDU is set to indicate whether the fragment is
   data or an acknowledgement.

   If the fragment is of type "data", then the remainder of the PDU is
   as follows:

   sequence

      the sequence number.  This increments by 1 for each framed
      fragment sent over this transport session, regardless of whether
      the fragment is an initial transmission or a retransmission.

   message

      the message that is being transmitted.

   Each connection has it its own sequence number space.  Initially the
   value is zero and it increments by exactly one for each fragment sent
   over that connection, including retransmissions.  (Note that because
   all connections are encrypted with DTLS, there is no danger of
   collision here.)

   When the receiver receive receives a fragment, it SHOULD MUST immediately send an
   ACK fragment.  The receiver MUST keep track of the 32 most recent
   sequence numbers received on this association in order to generate
   the appropriate ACK.

   If the PDU is of type "ack", the contents are as follows:

   ack_sequence

      The sequence number of the fragment being acknowledged.

   received

      A bitmask indicating if each of the previous 32 sequence numbers
      before this packet had been received.  When a packet is received
      with a sequence number N, the receiver looks at the sequence
      number of the previously 32 packets received on this connection.
      Call the previously received packet number M. And for each of the
      previous 32 packets, if the sequence number M is less than N but
      greater than N-32, the N-M bit of the received bitmask is set to
      one otherwise it is zero.

      Note that a bit being set to one indicates a particular packet was
      received
      received, but if the bit is set to zero it only means it is
      unknown if it was received or not.  It might have been received
      but not in the 32 most recently received window.

   The received field bits in the ACK provide a very high degree of
   redundancy for the sender to figure out which packets the receiver
   received and can then estimate packet loss rates.  If the sender also
   keeps track of the time at which recent sequence numbers were sent,
   the RTT can be estimated.

4.1.2.  RTO Calculation

   The RTO is an estimate of the round-trip time (RTT).  Implementations
   can use a static value for RTO or a dynamic estimate which will
   result in better performance.  For implementations that use a static
   value, the default value for RTO is 500 ms.  Nodes MAY use smaller
   values of RTO if it is known that all nodes are are within the local
   network.  The default RTO MAY be chosen larger, and this is
   RECOMMENDED if it is known in advance (such as on high latency access
   links) that the round-trip time is larger.

   Implementations that use a dynamic estimate to compute the RTO MUST
   use the algorithm described in RFC 2988[RFC2988], with the exceptions
   that value of RTO SHOULD NOT be rounded up to the nearest second but
   instead rounded up to the nearest millisecond.  The RTT of a
   successful STUN transaction from the ICE stage is used as the initial
   measurement for formula 2.2 of RFC 2988.  The sender keeps track of
   the time each fragment was sent for all recently sent fragments.  Any
   time an ACK is received, the sender can compute the RTT for that
   fragment by looking at the time the ACK was received and when time
   the fragment was sent.  This is used as a subsequent RTT measurement
   for formula 2.3 of RFC 2988 to update the RTO estimate.  (Note that
   because retransmissions receive new sequence numbers, all received
   ACKs are used.)

   The value for RTO is calculated separately for each DTLS session.

4.1.3.  AIMD Retransmission Scheme

   NOTE: this section is currently more descriptive than normative.
   Final copy of this will produce a normative section that is separate
   from the descriptive.

   This section specifies a sender retransmission algorithm based on the
   the AIMD algorithm in TCP.  The algorithm here is only the AIMD
   portion of TCP.  All other features are restricted to simplify the
   implementation, i.e. no slow start (initial window is 1) and no fast
   recovery.  Note that because this is a datagram rather than stream-
   based protocol (i.e. not sliding window, no need to pause for
   previously lost packets), the motivation for these features are not
   as strong.  Slow start MAY be implemented.

   A fragment is considered received when an ACK arrives for any of that
   fragment's transmissions, i.e. the sender must track all sequence
   numbers used to transmit the fragment.

   In order to simplify the implementation, this specification allows
   the sender to treat each unACKed fragment with separate timers used
   for retransmission.  An implementation SHOULD implement fast
   retransmission, in which case a fragment is retransmitted (with a new
   sequence number) as soon as ACKs for three higher sequence numbers
   than the fragment's most recent transmission have been received, or
   when the fragment's timer expires.  A sender implementing fast
   retransmission MAY maintain a single timer.

   Fragments are transmitted up to 5 times.  When retransmitting based
   on timeouts, the RTO is doubled after each retransmission.  (TODO:
   more details of timer management needed) For example, assuming an RTO
   of 500 ms, a fragment would be sent at times 0 ms, 500 ms, 1000 ms,
   2000 ms, and 4000 ms.  Retransmissions continue until a response is
   received or until the fragment has been transmitted 5 times, at which
   point the fragment is dropped after the final timeout.

   The sender allows w unacknowledged fragments to be outstanding at any
   given time. w is initially set to one.  Every RTO interval that w
   ACKs are received, w is increased by one.  If fewer than w-1 ACKs are
   received, but no loss has been observed, w is decreased by 1.  When a
   loss is observed, w is halved.  After reducing w, if there are more
   than w fragments for which an ACK is pending, no further
   retransmissions of the most recently initiated fragments in excess of
   w are performed until they fit in the window w, at which point those
   fragments begin the retransmission algorithm as if they were new
   fragments.  New fragments are transmitted as normal if there are less
   than w outstanding fragments. w is held fixed for one RTO after being
   halved.  After that point, the algorithm resumes adjusting w
   accordingly.

   If w drops to one and the one pending fragment is not ACKed by the
   other side after 5 requests are sent, the link is considered to have
   failed.  Otherwise, unACKed fragments are simply dropped after all
   their transmissions are complete, and a new fragment replaces it in
   the window if there is room.

5.  IANA Considerations

   This document makes no request of IANA.

   Note to RFC Editor: this section may be removed on publication as an
   RFC.

6.  Security Considerations

   As all sessions of this protocol are encrypted within a DTLS
   connection, security risks should be minimal.

7.  Acknowledgements

   Cullen Jennings contributed some text used in this draft from an
   earlier version of the RELOAD document, and authorized that text to
   be incorporated here under the terms of the current IPR declaration.

8.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC2988]  Paxson, V. and M. Allman, "Computing TCP's Retransmission
              Timer", RFC 2988, November 2000.

Author's Address

   Bruce B. Lowekamp
   B
   Skype

Full Copyright Statement

   Copyright (C) The IETF Trust (2009).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.