]> Binary Syntax for SIP Common Log Format Tekelec
17210 Campbell Rd. Suite 250 Dallas TX 75252 US adam@nostrum.com
Real Time Applications and Infrastructure This document proposes a binary syntax for the SIP common log format (CLF). It does not cover semantic issues, and is meant to be evaluated in the context of the other efforts discussing SIP CLF.
The Common Log File (CLF) format for the Session Initiation Protocol (SIP) proposes a syntax for logging SIP messages received and sent by SIP clients, servers, and proxies. The syntax proposed by that document has been inspired by the common HTTP log format. However, experience with that format has shown that dealing with large quantities of log data can be very processor intensive, as doing so necessary requires reading and parsing every byte in the log file(s) of interest. This document counter-proposes a format that is no more difficult to generate by logging entities, while being radically faster to process. In particular, the format is optimized for both rapidly scanning through log records, as well as quickly locating commonly-accessed data fields. Both operations can be performed in constant time (as compared with O(n) time associated with the current format, where n is the length of the log record). Further, the format proposed by this document retains the ability to be read by humans and processed using traditional Unix text processing tools, such as sed, awk, perl, cut, and grep.
Each data record is encoded according to the following format. Note that indications of "hexadecimal encoded" indicate that the value is to be written out in human-readable base-16 numbers using the ASCII characters 0x30 through 0x39 and 0x41 through 0x46 ('0' through '9' and 'A' through 'F'). Similarly, indications of "decimal encoded" indicate that the value is to be written out in human readable base-10 number using the ASCII characters 0x30 through 0x39 ('0' through '9'). In both encodings, numbers always take up the number of bytes indicated, and are padded on the left with ASCII '0' characters to fill the entire space.
many times | Length (cont) | 0x2C | | / as necessary +--------+--------+--------+ + / | Value | / +--------+--------+--------+--------+ / | 0x0A | +--------+ ]]>
First, an 80-byte header indicates meta-data about the record. Note that the field lengths encoded in the header do not include the ASCII tab characters used to separate fields from each other. 0x41 for this document Request/Response flag (R = request, r = response) Retransmission flag (o = original transmission; d = duplicate transmission; s = server is stateless [i.e., retransmissions are not detected]) Sent/Received flag (r = message received, s = message sent) Hexadecimal-Encoded Total length of this log record, including "Flags" and "Record Length" fields, and terminating line-feed Bytes 12 through 72 contain hexadecimal-encoded pointer/length pairs that point to the values of variable-length mandatory fields. The "Pointer" fields indicate absolute byte values within the record, and must be >= 103. They point to the start of the corresponding value within the "Mandatory Fields" area. The "Length" fields indicate the length of the corresponding value. The final pointer, "TLV Start Pointer," points to the ASCII Tab (0x09) character for the first entry in the Tag/Length/Value area; if no such entries are present, this value is set to zero. Note that the "Length" fields do not include the tab delimiters between fields. Further note that there are no delimiters between these pointer/length values -- they are packed together as a single, 68-character hexadecimal encoded string. Following the pointer/length pairs, several fixed-length fields are encoded. As before, all fields are completely filled, pre-pending values with '0' characters as necessary. Seconds since midnight, January 1st, 1970, GMT; decimal encoded Microseconds since the time in Date/Time field; decimal encoded CSeq number from the SIP message; decimal encoded Set to the value of the response code for responses. Set to 0 for requests. Decimal encoded. Contains actual values for the mandatory fields. This data must appear in the order listed, and each field must be present. Fields are separated by a single ASCII Tab character (0x09). Any tab characters present in the data to be written will be replaced by an ASCII space character (0x20) prior to being logged. The transaction identifier associated with the server transaction. Implementations MAY reuse the server transaction identifier (the topmost branch-id of the incoming request, with or without the magic cookie), or they MAY generate a unique identification string for a server transaction (this identifier needs to be locally unique to the server only.) This identifier is used to correlate ACKs and CANCELs to an INVITE transaction; it is also used to aid in forking. This field is used to associate client transactions with a server transaction for forking proxies or B2BUAs. In requests, the method from the start line. In responses, the method found in the CSeq header field. Value of the To header field, possibly with the tag parameter removed. (Whether to remove the tag parameter is left up to the logging entity). Value of the To header field tag parameter. If no To header field tag parameter is present, the pointer field is ignored; the length field is set to 0; and the field in the mandatory section is encoded as a single ASCII dash (0x2D). Value of the From header field, possibly with the tag parameter removed. (Whether to remove the tag parameter is left up to the logging entity) Value of the From header field tag parameter. The value of the Call-ID header field After the "Mandatory Fields" section, Tag/Length/Value groups appear zero or more times. The location within the log record is indicated by the "TLV Start Ptr" field. They are used to log information that is not mandatory for all messages (although specific TLVs are mandatory in request logs). indicates the type of value coded by this TLV; hexadecimal encoded. Currently defined tags are: Contact value (can be repeated) Contains entire value of Contact header field Request URI (mandatory in request) Contains Request URI in start line Remote Host (mandatory in request) The DNS name of IP address from which the message was received (if "sent/received flag" is 0) of the IP address to which the message is being send (if "sent/received flag" is 1) Authenticated User Contains the user name by which the user has been authenticated Complete SIP Message (optional, should be omitted by default) Contains complete SIP message. Can be repeated multiple times to accommodate SIP messages that exceed 65535 bytes in length. indicates the length of the value coded in this TLV, hexadecimal encoded. This length does NOT include the TLV header. contains the actual value of this TLV. As with the mandatory fields, ASCII Tab characters (0x09) are replaced with ASCII space characters (0x20).
The following demonstrates approximately how a single log record appears in a logging file. Due to internet-draft conventions, this log entry has been split into ten lines, instead of the two lines that actually appear in a log file; and the tab characters have been padded out using spaces to simulate their appearance in a text terminal.
314159 Alice 9fxced76sl 3848276298220188511@atlanta.example.com 0000,0034, 0001,001A,sip:bob@biloxi.example.com 0002,000C,192.168.9.12 ]]>
A uuencoded version of this log entry (without the changes required to format it for an internet-draft) follows.
C8W:FAY:3DM.0DM"4E.5DE4 M10E";V(@/'-I<#IB;V)`8FEL;WAI+F5X86UP;&4N8V]M/@DS,30Q-3D)06QI M8V4@/'-I<#IA;&EC94!A=&QA;G1A+F5X86UP;&4N8V]M/@DY9GAC960W-G-L M"3,X-#@R-S8R.3@R,C`Q.#@U,3%`871L86YT82YE>&%M<&QE+F-O;0DP,#`P M+#`P,S0L/'-I<#IA;&EC94!C;&EE;G0N871L86YT82YE>&%M<&QE+F-O;3MT M&DN97AA;7!L =92YC;VT),#`P,BPP,#!#+#$Y,BXQ-C@N.2XQ,@H` ` end ]]>
This format has been designed to allow text tools to easily process logs without needing to understand the indexing format. Index lines may be rapidly discarded by checking the first character of the line: index lines will always start with an alphabetical character, while field lines will start with a numerical character. Within a field line, script tools can quickly split fields at the tab characters. The first 11 fields are positional, and the meaning of any subsequent fields can be determined by checking the first four characters of the field. Alternately, these non-positional fields can be located using a regular expression. For example, the "Request URI" in a request can be found by searching for the perl regex /\t0001,....,([^\t]*)/. Note also that requests can be distinguished from responses by checking the third positional field -- for requests, it will always be set to "000"; any other value indicates a response.
&draft-gurbani-sipping-clf;
Cullen put me up to this. Tom Taylor suggested the technique of combining the length field structure from the binary format with the human-readable ASCII format to allow both rapid processing by advanced tools, and easy processing by simpler, text-centric tools. Dean Willis suggested the use of tab delimiters as a means to avoid the need to escape values within a field. Vijay Gurbani provided significant feedback, and wrote the original proof-of-concept program which was adapted to produce the examples in this document.