X-Git-Url: http://matita.cs.unibo.it/gitweb/?a=blobdiff_plain;f=helm%2FDEVEL%2Fpxp%2Fnetstring%2Fmimestring.mli;fp=helm%2FDEVEL%2Fpxp%2Fnetstring%2Fmimestring.mli;h=0000000000000000000000000000000000000000;hb=c7514aaa249a96c5fdd39b1123fbdb38d92f20b6;hp=39634b59c79200c274785fd17db6103bec9718d9;hpb=1c7fb836e2af4f2f3d18afd0396701f2094265ff;p=helm.git diff --git a/helm/DEVEL/pxp/netstring/mimestring.mli b/helm/DEVEL/pxp/netstring/mimestring.mli deleted file mode 100644 index 39634b59c..000000000 --- a/helm/DEVEL/pxp/netstring/mimestring.mli +++ /dev/null @@ -1,683 +0,0 @@ -(* $Id$ - * ---------------------------------------------------------------------- - * - *) - -(**********************************************************************) -(* Collection of auxiliary functions to parse MIME headers *) -(**********************************************************************) - - -val scan_header : - ?unfold:bool -> - string -> start_pos:int -> end_pos:int -> - ((string * string) list * int) - (* let params, i2 = scan_header s i0 i1: - * - * DESCRIPTION - * - * Scans the MIME header that begins at position i0 in the string s - * and that must end somewhere before position i1. It is intended - * that in i1 the character position following the end of the body of the - * MIME message is passed. - * Returns the parameters of the header as (name,value) pairs (in - * params), and in i2 the position of the character following - * directly after the header (i.e. after the blank line separating - * the header from the body). - * The following normalizations have already been applied: - * - The names are all in lowercase - * - Newline characters (CR and LF) have been removed (unless - * ?unfold:false has been passed) - * - Whitespace at the beginning and at the end of values has been - * removed (unless ?unfold:false is specified) - * The rules of RFC 2047 have NOT been applied. - * The function fails if the header violates the header format - * strongly. (Some minor deviations are tolerated, e.g. it is sufficient - * to separate lines by only LF instead of CRLF.) - * - * OPTIONS: - * - * unfold: If true (the default), folded lines are concatenated and - * returned as one line. This means that CR and LF characters are - * deleted and that whitespace at the beginning and the end of the - * string is removed. - * You may set ?unfold:false to locate individual characters in the - * parameter value exactly. - * - * ABOUT MIME MESSAGE FORMAT: - * - * This is the modern name for messages in "E-Mail format". Messages - * consist of a header and a body; the first empty line separates both - * parts. The header contains lines "param-name: param-value" where - * the param-name must begin on column 0 of the line, and the ":" - * separates the name and the value. So the format is roughly: - * - * param1-name: param1-value - * ... - * paramN-name: paramN-value - * - * body - * - * This function wants in i0 the position of the first character of - * param1-name in the string, and in i1 the position of the character - * following the body. It returns as i2 the position where the body - * begins. Furthermore, in 'params' all parameters are returned that - * exist in the header. - * - * DETAILS - * - * Note that parameter values are restricted; you cannot represent - * arbitrary strings. The following problems can arise: - * - Values cannot begin with whitespace characters, because there - * may be an arbitrary number of whitespaces between the ':' and the - * value. - * - Values (and names of parameters, too) must only be formed of - * 7 bit ASCII characters. (If this is not enough, the MIME standard - * knows the extension RFC 2047 that allows that header values may - * be composed of arbitrary characters of arbitrary character sets.) - * - Header values may be broken into several lines, the continuation - * lines must begin with whitespace characters. This means that values - * must not contain line breaks as semantical part of the value. - * And it may mean that ONE whitespace character is not distinguishable - * from SEVERAL whitespace characters. - * - Header lines must not be longer than 76 characters. Values that - * would result into longer lines must be broken into several lines. - * This means that you cannot represent strings that contain too few - * whitespace characters. - * - Some gateways pad the lines with spaces at the end of the lines. - * - * This implementation of a MIME scanner tolerates a number of - * deviations from the standard: long lines are not rejected; 8 bit - * values are accepted; lines may be ended only with LF instead of - * CRLF. - * Furthermore, header values are transformed: - * - leading and trailing spaces are always removed - * - CRs and LFs are deleted; it is guaranteed that there is at least - * one space or tab where CR/LFs are deleted. - * Last but not least, the names of the header values are converted - * to lowercase; MIME specifies that they are case-independent. - * - * COMPATIBILITY WITH THE STANDARD - * - * This function can parse all MIME headers that conform to RFC 822. - * But there may be still problems, as RFC 822 allows some crazy - * representations that are actually not used in practice. - * In particular, RFC 822 allows it to use backslashes to "indicate" - * that a CRLF sequence is semantically meant as line break. As this - * function normally deletes CRLFs, it is not possible to recognize such - * indicators in the result of the function. - *) - -(**********************************************************************) - -(* The following types and functions allow it to build scanners for - * structured MIME values in a highly configurable way. - * - * WHAT ARE STRUCTURED VALUES? - * - * RFC 822 (together with some other RFCs) defines lexical rules - * how formal MIME header values should be divided up into tokens. Formal - * MIME headers are those headers that are formed according to some - * grammar, e.g. mail addresses or MIME types. - * Some of the characters separate phrases of the value; these are - * the "special" characters. For example, '@' is normally a special - * character for mail addresses, because it separates the user name - * from the domain name. RFC 822 defines a fixed set of special - * characters, but other RFCs use different sets. Because of this, - * the following functions allow it to configure the set of special characters. - * Every sequence of characters may be embraced by double quotes, - * which means that the sequence is meant as literal data item; - * special characters are not recognized inside a quoted string. You may - * use the backslash to insert any character (including double quotes) - * verbatim into the quoted string (e.g. "He said: \"Give it to me!\""). - * The sequence of a backslash character and another character is called - * a quoted pair. - * Structured values may contain comments. The beginning of a comment - * is indicated by '(', and the end by ')'. Comments may be nested. - * Comments may contain quoted pairs. A - * comment counts as if a space character were written instead of it. - * Control characters are the ASCII characters 0 to 31, and 127. - * RFC 822 demands that MIME headers are 7 bit ASCII strings. Because - * of this, this function also counts the characters 128 to 255 as - * control characters. - * Domain literals are strings embraced by '[' and ']'; such literals - * may contain quoted pairs. Today, domain literals are used to specify - * IP addresses. - * Every character sequence not falling in one of the above categories - * is an atom (a sequence of non-special and non-control characters). - * When recognized, atoms may be encoded in a character set different than - * US-ASCII; such atoms are called encoded words (see RFC 2047). - * - * EXTENDED INTERFACE: - * - * In order to scan a string containing a MIME value, you must first - * create a mime_scanner using the function create_mime_scanner. - * The scanner contains the reference to the scanned string, and a - * specification how the string is to be scanned. The specification - * consists of the lists 'specials' and 'scan_options'. - * - * The character list 'specials' specifies the set of special characters. - * These characters are returned as Special c token; the following additional - * rules apply: - * - * - Spaces: - * If ' ' in specials: A space character is returned as Special ' '. - * Note that there may also be an effect on how comments are returned - * (see below). - * If ' ' not in specials: Spaces are ignored. - * - * - Tabs, CRs, LFs: - * If '\t' in specials: A tab character is returned as Special '\t'. - * If '\t' not in specials: Tabs are ignored. - * - * If '\r' in specials: A CR character is returned as Special '\r'. - * If '\r' not in specials: CRs are ignored. - * - * If '\n' in specials: A LF character is returned as Special '\n'. - * If '\n' not in specials: LFs are ignored. - * - * - Comments: - * If '(' in specials: Comments are not recognized. The character '(' - * is returned as Special '('. - * If '(' not in specials: Comments are recognized. How comments are - * returned, depends on the following: - * If Return_comments in scan_options: Outer comments are returned as - * Comment (note that inner comments count but - * are not returned as tokens) - * If otherwise ' ' in specials: Outer comments are returned as - * Special ' ' - * Otherwise: Comments are recognized but ignored. - * - * - Quoted strings: - * If '"' in specials: Quoted strings are not recognized, and double quotes - * are returned as Special '"'. - * If '"' not in specials: Quoted strings are returned as QString tokens. - * - * - Domain literals: - * If '[' in specials: Domain literals are not recognized, and left brackets - * are returned as Special '['. - * If '[' not in specials: Domain literals are returned as DomainLiteral - * tokens. - * - * Note that the rule for domain literals is completely new in netstring-0.9. - * It may cause incompatibilities with previous versions if '[' is not - * special. - * - * The general rule for special characters: Every special character c is - * returned as Special c, and any additional scanning functionality - * for this character is turned off. - * - * If recognized, quoted strings are returned as QString s, where - * s is the string without the embracing quotes, and with already - * decoded quoted pairs. - * - * Control characters c are returned as Control c. - * - * If recognized, comments may either be returned as spaces (in the case - * you are not interested in the contents of comments), or as Comment tokens. - * The contents of comments are not further scanned; you must start a - * subscanner to analyze comments as structured values. - * - * If recognized, domain literals are returned as DomainLiteral s, where - * s is the literal without brackets, and with decoded quoted pairs. - * - * Atoms are returned as Atom s where s is a longest sequence of - * atomic characters (all characters which are neither special nor control - * characters nor delimiters for substructures). If the option - * Recognize_encoded_words is on, atoms which look like encoded words - * are returned as EncodedWord tokens. (Important note: Neither '?' nor - * '=' must be special in order to enable this functionality.) - * - * After the mime_scanner has been created, you can scan the tokens by - * invoking scan_token which returns one token at a time, or by invoking - * scan_token_list which returns all following tokens. - * - * There are two token types: s_token is the base type and is intended to - * be used for pattern matching. s_extended_token is a wrapper that - * additionally contains information where the token occurs. - * - * SIMPLE INTERFACE - * - * Instead of creating a mime_scanner and calling the scan functions, - * you may also invoke scan_structured_value. This function returns the - * list of tokens directly; however, it is restricted to s_token. - * - * EXAMPLES - * - * scan_structured_value "user@domain.com" [ '@'; '.' ] [] - * = [ Atom "user"; Special '@'; Atom "domain"; Special '.'; Atom "com" ] - * - * scan_structured_value "user @ domain . com" [ '@'; '.' ] [] - * = [ Atom "user"; Special '@'; Atom "domain"; Special '.'; Atom "com" ] - * - * scan_structured_value "user(Do you know him?)@domain.com" [ '@'; '.' ] [] - * = [ Atom "user"; Special '@'; Atom "domain"; Special '.'; Atom "com" ] - * - * scan_structured_value "user(Do you know him?)@domain.com" [ '@'; '.' ] - * [ Return_comments ] - * = [ Atom "user"; Comment; Special '@'; Atom "domain"; Special '.'; - * Atom "com" ] - * - * scan_structured_value "user (Do you know him?) @ domain . com" - * [ '@'; '.'; ' ' ] [] - * = [ Atom "user"; Special ' '; Special ' '; Special ' '; Special '@'; - * Special ' '; Atom "domain"; - * Special ' '; Special '.'; Special ' '; Atom "com" ] - * - * scan_structured_value "user (Do you know him?) @ domain . com" - * [ '@'; '.'; ' ' ] [ Return_comments ] - * = [ Atom "user"; Special ' '; Comment; Special ' '; Special '@'; - * Special ' '; Atom "domain"; - * Special ' '; Special '.'; Special ' '; Atom "com" ] - * - * scan_structured_value "user @ domain . com" [ '@'; '.'; ' ' ] [] - * = [ Atom "user"; Special ' '; Special '@'; Special ' '; Atom "domain"; - * Special ' '; Special '.'; Special ' '; Atom "com" ] - * - * scan_structured_value "user(Do you know him?)@domain.com" ['@'; '.'; '('] - * [] - * = [ Atom "user"; Special '('; Atom "Do"; Atom "you"; Atom "know"; - * Atom "him?)"; Special '@'; Atom "domain"; Special '.'; Atom "com" ] - * - * scan_structured_value "\"My.name\"@domain.com" [ '@'; '.' ] [] - * = [ QString "My.name"; Special '@'; Atom "domain"; Special '.'; - * Atom "com" ] - * - * scan_structured_value "=?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?=" - * [ ] [ ] - * = [ Atom "=?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?=" ] - * - * scan_structured_value "=?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?=" - * [ ] [ Recognize_encoded_words ] - * = [ EncodedWord("ISO-8859-1", "Q", "Keld_J=F8rn_Simonsen") ] - * - *) - - - -type s_token = - Atom of string - | EncodedWord of (string * string * string) - | QString of string - | Control of char - | Special of char - | DomainLiteral of string - | Comment - | End - -(* - Words are: Atom, EncodedWord, QString. - * - Atom s: The character sequence forming the atom is contained in s - * - EncodedWord(charset, encoding, encoded_string) means: - * * charset is the (uppercase) character set - * * encoding is either "Q" or "B" - * * encoded_string: contains the text of the word; the text is represented - * as octet string following the conventions for character set charset and - * then encoded either as "Q" or "B" string. - * - QString s: Here, s are the characters inside the double quotes after - * decoding any quoted pairs (backslash + character pairs) - * - Control c: The control character c - * - Special c: The special character c - * - DomainLiteral s: s contains the characters inside the brackets after - * decoding any quoted pairs - * - Comment: if the option Return_comments is specified, this token - * represents the whole comment. - * - End: Is returned after the last token - *) - - -type s_option = - No_backslash_escaping - (* Do not handle backslashes in quoted string and comments as escape - * characters; backslashes are handled as normal characters. - * For example: "C:\dir\file" will be returned as - * QString "C:\dir\file", and not as QString "C:dirfile". - * - This is a common error in many MIME implementations. - *) - | Return_comments - (* Comments are returned as token Comment (unless '(' is included - * in the list of special characters, in which case comments are - * not recognized at all). - * You may get the exact location of the comment by applying - * get_pos and get_length to the extended token. - *) - | Recognize_encoded_words - (* Enables that encoded words are recognized and returned as - * EncodedWord(charset,encoding,content) instead of Atom. - *) - -type s_extended_token - (* An opaque type containing s_token plus: - * - where the token occurs - * - RFC-2047 access functions - *) - -val get_token : s_extended_token -> s_token - (* Return the s_token within the s_extended_token *) - -val get_decoded_word : s_extended_token -> string -val get_charset : s_extended_token -> string - (* Return the decoded word (the contents of the word after decoding the - * "Q" or "B" representation), and the character set of the decoded word - * (uppercase). - * These functions not only work for EncodedWord: - * - Atom: Returns the atom without decoding it - * - QString: Returns the characters inside the double quotes, and - * decodes any quoted pairs (backslash + character) - * - Control: Returns the one-character string - * - Special: Returns the one-character string - * - DomainLiteral: Returns the characters inside the brackets, and - * decodes any quoted pairs - * - Comment: Returns "" - * The character set is "US-ASCII" for these tokens. - *) - -val get_pos : s_extended_token -> int - (* Return the byte position where the token starts in the string - * (the first byte has position 0) - *) - -val get_line : s_extended_token -> int - (* Return the line number where the token starts (numbering begins - * usually with 1) - *) - -val get_column : s_extended_token -> int - (* Return the column of the line where the token starts (first column - * is number 0) - *) - -val get_length : s_extended_token -> int - (* Return the length of the token in bytes *) - -val separates_adjacent_encoded_words : s_extended_token -> bool - (* True iff the current token is white space (Special ' ', Special '\t', - * Special '\r' or Special '\n') and the last non-white space token - * was EncodedWord and the next non-white space token will be - * EncodedWord. - * Such spaces do not count and must be ignored by any application. - *) - - -type mime_scanner - -val create_mime_scanner : - specials:char list -> - scan_options:s_option list -> - ?pos:int -> - ?line:int -> - ?column:int -> - string -> - mime_scanner - (* Creates a new mime_scanner scanning the passed string. - * specials: The list of characters recognized as special characters. - * scan_options: The list of global options modifying the behaviour - * of the scanner - * pos: The position of the byte where the scanner starts in the - * passed string. Defaults to 0. - * line: The line number of this byte. Defaults to 1. - * column: The column number of this byte. Default to 0. - * - * The optional parameters pos, line, column are intentionally after - * scan_options and before the string argument, so you can specify - * scanners by partially applying arguments to create_mime_scanner - * which are not yet connected with a particular string: - * let my_scanner_spec = create_mime_scanner my_specials my_options in - * ... - * let my_scanner = my_scanner_spec my_string in - * ... - *) - -val get_pos_of_scanner : mime_scanner -> int -val get_line_of_scanner : mime_scanner -> int -val get_column_of_scanner : mime_scanner -> int - (* Return the current position, line, and column of a mime_scanner. - * The primary purpose of these functions is to simplify switching - * from one mime_scanner to another within a string: - * - * let scanner1 = create_mime_scanner ... s in - * ... now scanning some tokens from s using scanner1 ... - * let scanner2 = create_mime_scanner ... - * ?pos:(get_pos_of_scanner scanner1) - * ?line:(get_line_of_scanner scanner1) - * ?column:(get_column_of_scanner scanner1) - * s in - * ... scanning more tokens from s using scanner2 ... - * - * RESTRICTION: These functions are not available if the option - * Recognize_encoded_words is on. The reason is that this option - * enables look-ahead scanning; please use the location of the last - * scanned token instead. - * It is currently not clear whether a better implementation is needed - * (costs a bit more time). - * - * Note: To improve the performance of switching, it is recommended to - * create scanner specs in advance (see the example my_scanner_spec - * above). - *) - -val scan_token : mime_scanner -> (s_extended_token * s_token) - (* Returns the next token, or End if there is no more token. *) - -val scan_token_list : mime_scanner -> (s_extended_token * s_token) list - (* Returns all following tokens as a list (excluding End) *) - -val scan_structured_value : string -> char list -> s_option list -> s_token list - (* This function is included for backwards compatibility, and for all - * cases not requiring extended tokens. - * - * It scans the passed string according to the list of special characters - * and the list of options, and returns the list of all tokens. - *) - -val specials_rfc822 : char list -val specials_rfc2045 : char list - (* The sets of special characters defined by the RFCs 822 and 2045. - * - * CHANGE in netstring-0.9: '[' and ']' are no longer special because - * there is now support for domain literals. - * '?' and '=' are not special in the rfc2045 version because there is - * already support for encoded words. - *) - - -(**********************************************************************) - -(* Widely used scanners: *) - - -val scan_encoded_text_value : string -> s_extended_token list - (* Scans a "text" value. The returned token list contains only - * Special, Atom and EncodedWord tokens. - * Spaces, TABs, CRs, LFs are returned unless - * they occur between adjacent encoded words in which case - * they are ignored. - *) - - -val scan_value_with_parameters : string -> s_option list -> - (string * (string * string) list) - (* let name, params = scan_value_with_parameters s options: - * Scans phrases like - * name ; p1=v1 ; p2=v2 ; ... - * The scan is done with the set of special characters [';', '=']. - *) - -val scan_mime_type : string -> s_option list -> - (string * (string * string) list) - (* let name, params = scan_mime_type s options: - * Scans MIME types like - * text/plain; charset=iso-8859-1 - * The name of the type and the names of the parameters are converted - * to lower case. - *) - - -(**********************************************************************) - -(* Scanners for MIME bodies *) - -val scan_multipart_body : string -> start_pos:int -> end_pos:int -> - boundary:string -> - ((string * string) list * string) list - (* let [params1, value1; params2, value2; ...] - * = scan_multipart_body s i0 i1 b - * - * Scans the string s that is the body of a multipart message. - * The multipart message begins at position i0 in s and i1 the position - * of the character following the message. In b the boundary string - * must be passed (this is the "boundary" parameter of the multipart - * MIME type, e.g. multipart/mixed;boundary="some string" ). - * The return value is the list of the parts, where each part - * is returned as pair (params, value). The left component params - * is the list of name/value pairs of the header of the part. The - * right component is the RAW content of the part, i.e. if the part - * is encoded ("content-transfer-encoding"), the content is returned - * in the encoded representation. The caller must himself decode - * the content. - * The material before the first boundary and after the last - * boundary is not returned. - * - * MULTIPART MESSAGES - * - * The MIME standard defines a way to group several message parts to - * a larger message (for E-Mails this technique is known as "attaching" - * files to messages); these are the so-called multipart messages. - * Such messages are recognized by the major type string "multipart", - * e.g. multipart/mixed or multipart/form-data. Multipart types MUST - * have a boundary parameter because boundaries are essential for the - * representation. - * Multipart messages have a format like - * - * ...Header... - * Content-type: multipart/xyz; boundary="abc" - * ...Header... - * - * Body begins here ("prologue") - * --abc - * ...Header part 1... - * - * ...Body part 1... - * --abc - * ...Header part 2... - * - * - * ...Body part 2 - * --abc - * ... - * --abc-- - * Epilogue - * - * The parts are separated by boundary lines which begin with "--" and - * the string passed as boundary parameter. (Note that there may follow - * arbitrary text on boundary lines after "--abc".) The boundary is - * chosen such that it does not occur as prefix of any line of the - * inner parts of the message. - * The parts are again MIME messages, with header and body. Note - * that it is explicitely allowed that the parts are even multipart - * messages. - * The texts before the first boundary and after the last boundary - * are ignored. - * Note that multipart messages as a whole MUST NOT be encoded. - * Only the PARTS of the messages may be encoded (if they are not - * multipart messages themselves). - * - * Please read RFC 2046 if want to know the gory details of this - * brain-dead format. - *) - -val scan_multipart_body_and_decode : string -> start_pos:int -> end_pos:int -> - boundary:string -> - ((string * string) list * string) list - (* Same as scan_multipart_body, but decodes the bodies of the parts - * if they are encoded using the methods "base64" or "quoted printable". - * Fails, if an unknown encoding is used. - *) - -val scan_multipart_body_from_netstream - : Netstream.t -> - boundary:string -> - create:((string * string) list -> 'a) -> - add:('a -> Netstream.t -> int -> int -> unit) -> - stop:('a -> unit) -> - unit - (* scan_multipart_body_from_netstream s b create add stop: - * - * Reads the MIME message from the netstream s block by block. The - * parts are delimited by the boundary b. - * - * Once a new part is detected and begins, the function 'create' is - * called with the MIME header as argument. The result p of this function - * may be of any type. - * - * For every chunk of the part that is being read, the function 'add' - * is invoked: add p s k n. - * Here, p is the value returned by the 'create' invocation for the - * current part. s is the netstream. The current window of s contains - * the read chunk completely; the chunk begins at position k of the - * window (relative to the beginning of the window) and has a length - * of n bytes. - * - * When the part has been fully read, the function 'stop' is - * called with p as argument. - * - * That means, for every part the following is executed: - * - let p = create h - * - add p s k1 n1 - * - add p s k2 n2 - * - ... - * - add p s kN nN - * - stop p - * - * IMPORTANT PRECONDITION: - * - The block size of the netstream s must be at least - * String.length b + 3 - * - * EXCEPTIONS: - * - Exceptions can happen because of ill-formed input, and within - * the callbacks of the functions 'create', 'add', 'stop'. - * - If the exception happens while part p is being read, and the - * 'create' function has already been called (successfully), the - * 'stop' function is also called (you have the chance to close files). - *) - - -(* THREAD-SAFETY: - * The functions are thread-safe as long as the threads do not share - * values. - *) - -(* ====================================================================== - * History: - * - * $Log$ - * Revision 1.1 2000/11/17 09:57:27 lpadovan - * Initial revision - * - * Revision 1.8 2000/08/13 00:04:36 gerd - * Encoded_word -> EncodedWord - * Bugfixes. - * - * Revision 1.7 2000/08/07 00:25:00 gerd - * Major update of the interface for structured field lexing. - * - * Revision 1.6 2000/06/25 22:34:43 gerd - * Added labels to arguments. - * - * Revision 1.5 2000/06/25 21:15:48 gerd - * Checked thread-safety. - * - * Revision 1.4 2000/05/16 22:29:12 gerd - * New "option" arguments specifying the level of MIME - * compatibility. - * - * Revision 1.3 2000/04/15 13:09:01 gerd - * Implemented uploads to temporary files. - * - * Revision 1.2 2000/03/02 01:15:30 gerd - * Updated. - * - * Revision 1.1 2000/02/25 15:21:12 gerd - * Initial revision. - * - * - *)