2 * ----------------------------------------------------------------------
6 (**********************************************************************)
7 (* Collection of auxiliary functions to parse MIME headers *)
8 (**********************************************************************)
13 string -> start_pos:int -> end_pos:int ->
14 ((string * string) list * int)
15 (* let params, i2 = scan_header s i0 i1:
19 * Scans the MIME header that begins at position i0 in the string s
20 * and that must end somewhere before position i1. It is intended
21 * that in i1 the character position following the end of the body of the
22 * MIME message is passed.
23 * Returns the parameters of the header as (name,value) pairs (in
24 * params), and in i2 the position of the character following
25 * directly after the header (i.e. after the blank line separating
26 * the header from the body).
27 * The following normalizations have already been applied:
28 * - The names are all in lowercase
29 * - Newline characters (CR and LF) have been removed (unless
30 * ?unfold:false has been passed)
31 * - Whitespace at the beginning and at the end of values has been
32 * removed (unless ?unfold:false is specified)
33 * The rules of RFC 2047 have NOT been applied.
34 * The function fails if the header violates the header format
35 * strongly. (Some minor deviations are tolerated, e.g. it is sufficient
36 * to separate lines by only LF instead of CRLF.)
40 * unfold: If true (the default), folded lines are concatenated and
41 * returned as one line. This means that CR and LF characters are
42 * deleted and that whitespace at the beginning and the end of the
44 * You may set ?unfold:false to locate individual characters in the
45 * parameter value exactly.
47 * ABOUT MIME MESSAGE FORMAT:
49 * This is the modern name for messages in "E-Mail format". Messages
50 * consist of a header and a body; the first empty line separates both
51 * parts. The header contains lines "param-name: param-value" where
52 * the param-name must begin on column 0 of the line, and the ":"
53 * separates the name and the value. So the format is roughly:
55 * param1-name: param1-value
57 * paramN-name: paramN-value
61 * This function wants in i0 the position of the first character of
62 * param1-name in the string, and in i1 the position of the character
63 * following the body. It returns as i2 the position where the body
64 * begins. Furthermore, in 'params' all parameters are returned that
65 * exist in the header.
69 * Note that parameter values are restricted; you cannot represent
70 * arbitrary strings. The following problems can arise:
71 * - Values cannot begin with whitespace characters, because there
72 * may be an arbitrary number of whitespaces between the ':' and the
74 * - Values (and names of parameters, too) must only be formed of
75 * 7 bit ASCII characters. (If this is not enough, the MIME standard
76 * knows the extension RFC 2047 that allows that header values may
77 * be composed of arbitrary characters of arbitrary character sets.)
78 * - Header values may be broken into several lines, the continuation
79 * lines must begin with whitespace characters. This means that values
80 * must not contain line breaks as semantical part of the value.
81 * And it may mean that ONE whitespace character is not distinguishable
82 * from SEVERAL whitespace characters.
83 * - Header lines must not be longer than 76 characters. Values that
84 * would result into longer lines must be broken into several lines.
85 * This means that you cannot represent strings that contain too few
86 * whitespace characters.
87 * - Some gateways pad the lines with spaces at the end of the lines.
89 * This implementation of a MIME scanner tolerates a number of
90 * deviations from the standard: long lines are not rejected; 8 bit
91 * values are accepted; lines may be ended only with LF instead of
93 * Furthermore, header values are transformed:
94 * - leading and trailing spaces are always removed
95 * - CRs and LFs are deleted; it is guaranteed that there is at least
96 * one space or tab where CR/LFs are deleted.
97 * Last but not least, the names of the header values are converted
98 * to lowercase; MIME specifies that they are case-independent.
100 * COMPATIBILITY WITH THE STANDARD
102 * This function can parse all MIME headers that conform to RFC 822.
103 * But there may be still problems, as RFC 822 allows some crazy
104 * representations that are actually not used in practice.
105 * In particular, RFC 822 allows it to use backslashes to "indicate"
106 * that a CRLF sequence is semantically meant as line break. As this
107 * function normally deletes CRLFs, it is not possible to recognize such
108 * indicators in the result of the function.
111 (**********************************************************************)
113 (* The following types and functions allow it to build scanners for
114 * structured MIME values in a highly configurable way.
116 * WHAT ARE STRUCTURED VALUES?
118 * RFC 822 (together with some other RFCs) defines lexical rules
119 * how formal MIME header values should be divided up into tokens. Formal
120 * MIME headers are those headers that are formed according to some
121 * grammar, e.g. mail addresses or MIME types.
122 * Some of the characters separate phrases of the value; these are
123 * the "special" characters. For example, '@' is normally a special
124 * character for mail addresses, because it separates the user name
125 * from the domain name. RFC 822 defines a fixed set of special
126 * characters, but other RFCs use different sets. Because of this,
127 * the following functions allow it to configure the set of special characters.
128 * Every sequence of characters may be embraced by double quotes,
129 * which means that the sequence is meant as literal data item;
130 * special characters are not recognized inside a quoted string. You may
131 * use the backslash to insert any character (including double quotes)
132 * verbatim into the quoted string (e.g. "He said: \"Give it to me!\"").
133 * The sequence of a backslash character and another character is called
135 * Structured values may contain comments. The beginning of a comment
136 * is indicated by '(', and the end by ')'. Comments may be nested.
137 * Comments may contain quoted pairs. A
138 * comment counts as if a space character were written instead of it.
139 * Control characters are the ASCII characters 0 to 31, and 127.
140 * RFC 822 demands that MIME headers are 7 bit ASCII strings. Because
141 * of this, this function also counts the characters 128 to 255 as
142 * control characters.
143 * Domain literals are strings embraced by '[' and ']'; such literals
144 * may contain quoted pairs. Today, domain literals are used to specify
146 * Every character sequence not falling in one of the above categories
147 * is an atom (a sequence of non-special and non-control characters).
148 * When recognized, atoms may be encoded in a character set different than
149 * US-ASCII; such atoms are called encoded words (see RFC 2047).
151 * EXTENDED INTERFACE:
153 * In order to scan a string containing a MIME value, you must first
154 * create a mime_scanner using the function create_mime_scanner.
155 * The scanner contains the reference to the scanned string, and a
156 * specification how the string is to be scanned. The specification
157 * consists of the lists 'specials' and 'scan_options'.
159 * The character list 'specials' specifies the set of special characters.
160 * These characters are returned as Special c token; the following additional
164 * If ' ' in specials: A space character is returned as Special ' '.
165 * Note that there may also be an effect on how comments are returned
167 * If ' ' not in specials: Spaces are ignored.
170 * If '\t' in specials: A tab character is returned as Special '\t'.
171 * If '\t' not in specials: Tabs are ignored.
173 * If '\r' in specials: A CR character is returned as Special '\r'.
174 * If '\r' not in specials: CRs are ignored.
176 * If '\n' in specials: A LF character is returned as Special '\n'.
177 * If '\n' not in specials: LFs are ignored.
180 * If '(' in specials: Comments are not recognized. The character '('
181 * is returned as Special '('.
182 * If '(' not in specials: Comments are recognized. How comments are
183 * returned, depends on the following:
184 * If Return_comments in scan_options: Outer comments are returned as
185 * Comment (note that inner comments count but
186 * are not returned as tokens)
187 * If otherwise ' ' in specials: Outer comments are returned as
189 * Otherwise: Comments are recognized but ignored.
192 * If '"' in specials: Quoted strings are not recognized, and double quotes
193 * are returned as Special '"'.
194 * If '"' not in specials: Quoted strings are returned as QString tokens.
197 * If '[' in specials: Domain literals are not recognized, and left brackets
198 * are returned as Special '['.
199 * If '[' not in specials: Domain literals are returned as DomainLiteral
202 * Note that the rule for domain literals is completely new in netstring-0.9.
203 * It may cause incompatibilities with previous versions if '[' is not
206 * The general rule for special characters: Every special character c is
207 * returned as Special c, and any additional scanning functionality
208 * for this character is turned off.
210 * If recognized, quoted strings are returned as QString s, where
211 * s is the string without the embracing quotes, and with already
212 * decoded quoted pairs.
214 * Control characters c are returned as Control c.
216 * If recognized, comments may either be returned as spaces (in the case
217 * you are not interested in the contents of comments), or as Comment tokens.
218 * The contents of comments are not further scanned; you must start a
219 * subscanner to analyze comments as structured values.
221 * If recognized, domain literals are returned as DomainLiteral s, where
222 * s is the literal without brackets, and with decoded quoted pairs.
224 * Atoms are returned as Atom s where s is a longest sequence of
225 * atomic characters (all characters which are neither special nor control
226 * characters nor delimiters for substructures). If the option
227 * Recognize_encoded_words is on, atoms which look like encoded words
228 * are returned as EncodedWord tokens. (Important note: Neither '?' nor
229 * '=' must be special in order to enable this functionality.)
231 * After the mime_scanner has been created, you can scan the tokens by
232 * invoking scan_token which returns one token at a time, or by invoking
233 * scan_token_list which returns all following tokens.
235 * There are two token types: s_token is the base type and is intended to
236 * be used for pattern matching. s_extended_token is a wrapper that
237 * additionally contains information where the token occurs.
241 * Instead of creating a mime_scanner and calling the scan functions,
242 * you may also invoke scan_structured_value. This function returns the
243 * list of tokens directly; however, it is restricted to s_token.
247 * scan_structured_value "user@domain.com" [ '@'; '.' ] []
248 * = [ Atom "user"; Special '@'; Atom "domain"; Special '.'; Atom "com" ]
250 * scan_structured_value "user @ domain . com" [ '@'; '.' ] []
251 * = [ Atom "user"; Special '@'; Atom "domain"; Special '.'; Atom "com" ]
253 * scan_structured_value "user(Do you know him?)@domain.com" [ '@'; '.' ] []
254 * = [ Atom "user"; Special '@'; Atom "domain"; Special '.'; Atom "com" ]
256 * scan_structured_value "user(Do you know him?)@domain.com" [ '@'; '.' ]
257 * [ Return_comments ]
258 * = [ Atom "user"; Comment; Special '@'; Atom "domain"; Special '.';
261 * scan_structured_value "user (Do you know him?) @ domain . com"
262 * [ '@'; '.'; ' ' ] []
263 * = [ Atom "user"; Special ' '; Special ' '; Special ' '; Special '@';
264 * Special ' '; Atom "domain";
265 * Special ' '; Special '.'; Special ' '; Atom "com" ]
267 * scan_structured_value "user (Do you know him?) @ domain . com"
268 * [ '@'; '.'; ' ' ] [ Return_comments ]
269 * = [ Atom "user"; Special ' '; Comment; Special ' '; Special '@';
270 * Special ' '; Atom "domain";
271 * Special ' '; Special '.'; Special ' '; Atom "com" ]
273 * scan_structured_value "user @ domain . com" [ '@'; '.'; ' ' ] []
274 * = [ Atom "user"; Special ' '; Special '@'; Special ' '; Atom "domain";
275 * Special ' '; Special '.'; Special ' '; Atom "com" ]
277 * scan_structured_value "user(Do you know him?)@domain.com" ['@'; '.'; '(']
279 * = [ Atom "user"; Special '('; Atom "Do"; Atom "you"; Atom "know";
280 * Atom "him?)"; Special '@'; Atom "domain"; Special '.'; Atom "com" ]
282 * scan_structured_value "\"My.name\"@domain.com" [ '@'; '.' ] []
283 * = [ QString "My.name"; Special '@'; Atom "domain"; Special '.';
286 * scan_structured_value "=?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?="
288 * = [ Atom "=?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?=" ]
290 * scan_structured_value "=?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?="
291 * [ ] [ Recognize_encoded_words ]
292 * = [ EncodedWord("ISO-8859-1", "Q", "Keld_J=F8rn_Simonsen") ]
300 | EncodedWord of (string * string * string)
304 | DomainLiteral of string
308 (* - Words are: Atom, EncodedWord, QString.
309 * - Atom s: The character sequence forming the atom is contained in s
310 * - EncodedWord(charset, encoding, encoded_string) means:
311 * * charset is the (uppercase) character set
312 * * encoding is either "Q" or "B"
313 * * encoded_string: contains the text of the word; the text is represented
314 * as octet string following the conventions for character set charset and
315 * then encoded either as "Q" or "B" string.
316 * - QString s: Here, s are the characters inside the double quotes after
317 * decoding any quoted pairs (backslash + character pairs)
318 * - Control c: The control character c
319 * - Special c: The special character c
320 * - DomainLiteral s: s contains the characters inside the brackets after
321 * decoding any quoted pairs
322 * - Comment: if the option Return_comments is specified, this token
323 * represents the whole comment.
324 * - End: Is returned after the last token
329 No_backslash_escaping
330 (* Do not handle backslashes in quoted string and comments as escape
331 * characters; backslashes are handled as normal characters.
332 * For example: "C:\dir\file" will be returned as
333 * QString "C:\dir\file", and not as QString "C:dirfile".
334 * - This is a common error in many MIME implementations.
337 (* Comments are returned as token Comment (unless '(' is included
338 * in the list of special characters, in which case comments are
339 * not recognized at all).
340 * You may get the exact location of the comment by applying
341 * get_pos and get_length to the extended token.
343 | Recognize_encoded_words
344 (* Enables that encoded words are recognized and returned as
345 * EncodedWord(charset,encoding,content) instead of Atom.
348 type s_extended_token
349 (* An opaque type containing s_token plus:
350 * - where the token occurs
351 * - RFC-2047 access functions
354 val get_token : s_extended_token -> s_token
355 (* Return the s_token within the s_extended_token *)
357 val get_decoded_word : s_extended_token -> string
358 val get_charset : s_extended_token -> string
359 (* Return the decoded word (the contents of the word after decoding the
360 * "Q" or "B" representation), and the character set of the decoded word
362 * These functions not only work for EncodedWord:
363 * - Atom: Returns the atom without decoding it
364 * - QString: Returns the characters inside the double quotes, and
365 * decodes any quoted pairs (backslash + character)
366 * - Control: Returns the one-character string
367 * - Special: Returns the one-character string
368 * - DomainLiteral: Returns the characters inside the brackets, and
369 * decodes any quoted pairs
370 * - Comment: Returns ""
371 * The character set is "US-ASCII" for these tokens.
374 val get_pos : s_extended_token -> int
375 (* Return the byte position where the token starts in the string
376 * (the first byte has position 0)
379 val get_line : s_extended_token -> int
380 (* Return the line number where the token starts (numbering begins
384 val get_column : s_extended_token -> int
385 (* Return the column of the line where the token starts (first column
389 val get_length : s_extended_token -> int
390 (* Return the length of the token in bytes *)
392 val separates_adjacent_encoded_words : s_extended_token -> bool
393 (* True iff the current token is white space (Special ' ', Special '\t',
394 * Special '\r' or Special '\n') and the last non-white space token
395 * was EncodedWord and the next non-white space token will be
397 * Such spaces do not count and must be ignored by any application.
403 val create_mime_scanner :
404 specials:char list ->
405 scan_options:s_option list ->
411 (* Creates a new mime_scanner scanning the passed string.
412 * specials: The list of characters recognized as special characters.
413 * scan_options: The list of global options modifying the behaviour
415 * pos: The position of the byte where the scanner starts in the
416 * passed string. Defaults to 0.
417 * line: The line number of this byte. Defaults to 1.
418 * column: The column number of this byte. Default to 0.
420 * The optional parameters pos, line, column are intentionally after
421 * scan_options and before the string argument, so you can specify
422 * scanners by partially applying arguments to create_mime_scanner
423 * which are not yet connected with a particular string:
424 * let my_scanner_spec = create_mime_scanner my_specials my_options in
426 * let my_scanner = my_scanner_spec my_string in
430 val get_pos_of_scanner : mime_scanner -> int
431 val get_line_of_scanner : mime_scanner -> int
432 val get_column_of_scanner : mime_scanner -> int
433 (* Return the current position, line, and column of a mime_scanner.
434 * The primary purpose of these functions is to simplify switching
435 * from one mime_scanner to another within a string:
437 * let scanner1 = create_mime_scanner ... s in
438 * ... now scanning some tokens from s using scanner1 ...
439 * let scanner2 = create_mime_scanner ...
440 * ?pos:(get_pos_of_scanner scanner1)
441 * ?line:(get_line_of_scanner scanner1)
442 * ?column:(get_column_of_scanner scanner1)
444 * ... scanning more tokens from s using scanner2 ...
446 * RESTRICTION: These functions are not available if the option
447 * Recognize_encoded_words is on. The reason is that this option
448 * enables look-ahead scanning; please use the location of the last
449 * scanned token instead.
450 * It is currently not clear whether a better implementation is needed
451 * (costs a bit more time).
453 * Note: To improve the performance of switching, it is recommended to
454 * create scanner specs in advance (see the example my_scanner_spec
458 val scan_token : mime_scanner -> (s_extended_token * s_token)
459 (* Returns the next token, or End if there is no more token. *)
461 val scan_token_list : mime_scanner -> (s_extended_token * s_token) list
462 (* Returns all following tokens as a list (excluding End) *)
464 val scan_structured_value : string -> char list -> s_option list -> s_token list
465 (* This function is included for backwards compatibility, and for all
466 * cases not requiring extended tokens.
468 * It scans the passed string according to the list of special characters
469 * and the list of options, and returns the list of all tokens.
472 val specials_rfc822 : char list
473 val specials_rfc2045 : char list
474 (* The sets of special characters defined by the RFCs 822 and 2045.
476 * CHANGE in netstring-0.9: '[' and ']' are no longer special because
477 * there is now support for domain literals.
478 * '?' and '=' are not special in the rfc2045 version because there is
479 * already support for encoded words.
483 (**********************************************************************)
485 (* Widely used scanners: *)
488 val scan_encoded_text_value : string -> s_extended_token list
489 (* Scans a "text" value. The returned token list contains only
490 * Special, Atom and EncodedWord tokens.
491 * Spaces, TABs, CRs, LFs are returned unless
492 * they occur between adjacent encoded words in which case
497 val scan_value_with_parameters : string -> s_option list ->
498 (string * (string * string) list)
499 (* let name, params = scan_value_with_parameters s options:
501 * name ; p1=v1 ; p2=v2 ; ...
502 * The scan is done with the set of special characters [';', '='].
505 val scan_mime_type : string -> s_option list ->
506 (string * (string * string) list)
507 (* let name, params = scan_mime_type s options:
508 * Scans MIME types like
509 * text/plain; charset=iso-8859-1
510 * The name of the type and the names of the parameters are converted
515 (**********************************************************************)
517 (* Scanners for MIME bodies *)
519 val scan_multipart_body : string -> start_pos:int -> end_pos:int ->
521 ((string * string) list * string) list
522 (* let [params1, value1; params2, value2; ...]
523 * = scan_multipart_body s i0 i1 b
525 * Scans the string s that is the body of a multipart message.
526 * The multipart message begins at position i0 in s and i1 the position
527 * of the character following the message. In b the boundary string
528 * must be passed (this is the "boundary" parameter of the multipart
529 * MIME type, e.g. multipart/mixed;boundary="some string" ).
530 * The return value is the list of the parts, where each part
531 * is returned as pair (params, value). The left component params
532 * is the list of name/value pairs of the header of the part. The
533 * right component is the RAW content of the part, i.e. if the part
534 * is encoded ("content-transfer-encoding"), the content is returned
535 * in the encoded representation. The caller must himself decode
537 * The material before the first boundary and after the last
538 * boundary is not returned.
542 * The MIME standard defines a way to group several message parts to
543 * a larger message (for E-Mails this technique is known as "attaching"
544 * files to messages); these are the so-called multipart messages.
545 * Such messages are recognized by the major type string "multipart",
546 * e.g. multipart/mixed or multipart/form-data. Multipart types MUST
547 * have a boundary parameter because boundaries are essential for the
549 * Multipart messages have a format like
552 * Content-type: multipart/xyz; boundary="abc"
555 * Body begins here ("prologue")
557 * ...Header part 1...
561 * ...Header part 2...
570 * The parts are separated by boundary lines which begin with "--" and
571 * the string passed as boundary parameter. (Note that there may follow
572 * arbitrary text on boundary lines after "--abc".) The boundary is
573 * chosen such that it does not occur as prefix of any line of the
574 * inner parts of the message.
575 * The parts are again MIME messages, with header and body. Note
576 * that it is explicitely allowed that the parts are even multipart
578 * The texts before the first boundary and after the last boundary
580 * Note that multipart messages as a whole MUST NOT be encoded.
581 * Only the PARTS of the messages may be encoded (if they are not
582 * multipart messages themselves).
584 * Please read RFC 2046 if want to know the gory details of this
588 val scan_multipart_body_and_decode : string -> start_pos:int -> end_pos:int ->
590 ((string * string) list * string) list
591 (* Same as scan_multipart_body, but decodes the bodies of the parts
592 * if they are encoded using the methods "base64" or "quoted printable".
593 * Fails, if an unknown encoding is used.
596 val scan_multipart_body_from_netstream
599 create:((string * string) list -> 'a) ->
600 add:('a -> Netstream.t -> int -> int -> unit) ->
603 (* scan_multipart_body_from_netstream s b create add stop:
605 * Reads the MIME message from the netstream s block by block. The
606 * parts are delimited by the boundary b.
608 * Once a new part is detected and begins, the function 'create' is
609 * called with the MIME header as argument. The result p of this function
610 * may be of any type.
612 * For every chunk of the part that is being read, the function 'add'
613 * is invoked: add p s k n.
614 * Here, p is the value returned by the 'create' invocation for the
615 * current part. s is the netstream. The current window of s contains
616 * the read chunk completely; the chunk begins at position k of the
617 * window (relative to the beginning of the window) and has a length
620 * When the part has been fully read, the function 'stop' is
621 * called with p as argument.
623 * That means, for every part the following is executed:
631 * IMPORTANT PRECONDITION:
632 * - The block size of the netstream s must be at least
633 * String.length b + 3
636 * - Exceptions can happen because of ill-formed input, and within
637 * the callbacks of the functions 'create', 'add', 'stop'.
638 * - If the exception happens while part p is being read, and the
639 * 'create' function has already been called (successfully), the
640 * 'stop' function is also called (you have the chance to close files).
645 * The functions are thread-safe as long as the threads do not share
649 (* ======================================================================
653 * Revision 1.1 2000/11/17 09:57:27 lpadovan
656 * Revision 1.8 2000/08/13 00:04:36 gerd
657 * Encoded_word -> EncodedWord
660 * Revision 1.7 2000/08/07 00:25:00 gerd
661 * Major update of the interface for structured field lexing.
663 * Revision 1.6 2000/06/25 22:34:43 gerd
664 * Added labels to arguments.
666 * Revision 1.5 2000/06/25 21:15:48 gerd
667 * Checked thread-safety.
669 * Revision 1.4 2000/05/16 22:29:12 gerd
670 * New "option" arguments specifying the level of MIME
673 * Revision 1.3 2000/04/15 13:09:01 gerd
674 * Implemented uploads to temporary files.
676 * Revision 1.2 2000/03/02 01:15:30 gerd
679 * Revision 1.1 2000/02/25 15:21:12 gerd