2 * ----------------------------------------------------------------------
3 * PXP: The polymorphic XML parser for Objective Caml.
4 * Copyright by Gerd Stolpmann. See LICENSE for details.
9 exception Not_competent;;
10 (* Raised by the 'open_in' method if the object does not know how to
11 * handle the passed external ID.
14 exception Not_resolvable of exn;;
15 (* Indicates that one resolver was competent, but there was an error
16 * while resolving the external ID. The passed exception explains the
18 * Not_resolvable(Not_found) serves as indicator for an unknown reason.
22 (* The class type 'resolver' is the official type of all "resolvers".
23 * Resolvers take file names (or better, external identifiers) and
24 * return lexbufs, scanning the file for tokens. Resolvers may be
25 * cloned, and clones can interpret relative file names relative to
28 * Example of the latter:
30 * Resolver r reads from file:/dir/f1.xml
33 * &e; -----> Entity e is bound to "subdir/f2.xml"
34 * </tag> Step (1): let r' = "clone of r"
35 * Step (2): open file "subdir/f2.xml"
37 * r' must still know the directory of the file r is reading, otherwise
38 * it would not be able to resolve "subdir/f2.xml" = "file:/dir/subdir/f2.xml".
40 * Actually, this example can be coded as:
42 * let r = new resolve_as_file in
43 * let lbuf = r # open_in "file:/dir/f1.xml" in
44 * ... read from lbuf ...
45 * let r' = r # clone in
46 * let lbuf' = r' # open_in "subdir/f2.xml" in
47 * ... read from lbuf' ...
49 * ... read from lbuf ...
55 (* A resolver can open an input source, and returns this source as
58 * After creating a resolver, one must invoke the two methods
59 * init_rep_encoding and init_warner to set the internal encoding of
60 * strings and the warner object, respectively. This is normally
61 * done by the parsing functions in Pxp_yacc.
62 * It is not necessary to invoke these two methods for a fresh
65 * It is possible that the character encoding of the source and the
66 * internal encoding of the parser are different. To cope with this,
67 * one of the tasks of the resolver is to recode the characters of
68 * the input source into the internal character encoding.
70 * Note that there are several ways of determining the encoding of the
71 * input: (1) It is possible that the transport protocol (e.g. HTTP)
72 * transmits the encoding, and (2) it is possible to inspect the beginning
73 * of the file, and to analyze:
74 * (2.1) The first two bytes indicate whether UTF-16 is used
75 * (2.2) Otherwise, one can assume that an ASCII-compatible character
76 * set is used. It is now possible to read the XML declaration
77 * <?xml ... encoding="xyz" ...?>. The encoding found here is
79 * (2.3) If the XML declaration is missing, the encoding is UTF-8.
80 * The resolver needs only to distinguish between cases (1), (2.1),
82 * The details of analyzing whether (2.2) or (2.3) applies are programmed
83 * elsewhere, and the resolver will be told the result (see below).
85 * A resolver is like a file: it must be opened before one can work
86 * with it, and it should be closed after all operations on it have been
87 * done. The method 'open_in' is called with the external ID as argument
88 * and it must return the lexbuf reading from the external resource.
89 * The method 'close_in' does not require an argument.
91 * It is allowed to re-open a resolver after it has been closed. It is
92 * forbidden to open a resolver again while it is open.
93 * It is allowed to close a resolver several times: If 'close_in' is
94 * invoked while the resolver is already closed, nothing happens.
96 * The method 'open_in' may raise Not_competent to indicate that this
97 * resolver is not able to open this type of IDs.
99 * The method 'change_encoding' is called from the parser after the
100 * analysis of case (2) has been done; the argument is either the
101 * string name of the encoding, or the empty string to indicate
102 * that no XML declaration was found. It is guaranteed that
103 * 'change_encoding' is invoked after only a few tokens of the
104 * file. The resolver should react as follows:
105 * - If case (1) applies: Ignore the encoding passed to 'change_encoding'.
106 * - If case (2.1) applies: The encoding passed to 'change_encoding' must
107 * be compatible with UTF-16. This should be
108 * checked, and violations should be reported.
109 * - Else: If the passed encoding is "", assume UTF-8.
110 * Otherwise, assume the passed encoding.
112 * The following rule helps synchronizing the lexbuf with the encoding:
113 * If the resolver has been opened, but 'change_encoding' has not yet
114 * been invoked, the lexbuf contains at most one character (which may
115 * be represented by multiple bytes); i.e. the lexbuf is created by
116 * Lexing.from_function, and the function puts only one character into
117 * the buffer at once.
118 * After 'change_encoding' has been invoked, there is no longer a limit
119 * on the lexbuf size.
121 * The reason for this rule is that you know exactly the character where
122 * the encoding changes to the encoding passed by 'change_encoding'.
124 * The method 'clone' may be invoked for open or closed resolvers.
125 * Basically, 'clone' returns a new resolver which is always closed.
126 * If the original resolver is closed, the clone is simply a clone.
127 * If the original resolver is open at the moment of cloning:
128 * If the clone is later opened for a relative system ID (i.e. relative
129 * URL), the clone must interpret this ID relative to the ID of the
132 method init_rep_encoding : rep_encoding -> unit
133 method init_warner : collect_warnings -> unit
135 method rep_encoding : rep_encoding
137 method open_in : ext_id -> Lexing.lexbuf
138 (* May raise Not_competent if the object does not know how to handle
141 method close_in : unit
142 method change_encoding : string -> unit
145 (* Every resolver can be cloned. The clone does not inherit the connection
146 * with the external object, i.e. it is initially closed.
148 method clone : resolver
150 method close_all : unit
151 (* Closes this resolver and every clone *)
156 (* Note: resolve_general is no longer exported. In most cases, the classes
157 * resolve_read_any_channel or resolve_read_any_string are applicable, too,
158 * and much easier to configure.
162 (* The next classes are resolvers for concrete input sources. *)
164 class resolve_read_this_channel :
165 ?id:ext_id -> ?fixenc:encoding -> ?auto_close:bool ->
166 in_channel -> resolver;;
168 (* Reads from the passed channel (it may be even a pipe). If the ~id
169 * argument is passed to the object, the created resolver accepts only
170 * this ID. Otherwise all IDs are accepted.
171 * Once the resolver has been cloned, it does not accept any ID. This
172 * means that this resolver cannot handle inner references to external
173 * entities. Note that you can combine this resolver with another resolver
174 * that can handle inner references (such as resolve_as_file); see
175 * class 'combine' below.
176 * If you pass the ~fixenc argument, the encoding of the channel is
177 * set to the passed value, regardless of any auto-recognition or
178 * any XML declaration.
179 * If ?auto_close = true (which is the default), the channel is
180 * closed after use. If ?auto_close = false, the channel is left open.
184 class resolve_read_any_channel :
186 channel_of_id:(ext_id -> (in_channel * encoding option)) ->
189 (* resolve_read_any_channel f_open:
190 * This resolver calls the function f_open to open a new channel for
191 * the passed ext_id. This function must either return the channel and
192 * the encoding, or it must fail with Not_competent.
193 * The function must return None as encoding if the default mechanism to
194 * recognize the encoding should be used. It must return Some e if it is
195 * already known that the encoding of the channel is e.
196 * If ?auto_close = true (which is the default), the channel is
197 * closed after use. If ?auto_close = false, the channel is left open.
201 class resolve_read_url_channel :
202 ?base_url:Neturl.url ->
204 url_of_id:(ext_id -> Neturl.url) ->
205 channel_of_url:(Neturl.url -> (in_channel * encoding option)) ->
208 (* resolve_read_url_channel url_of_id channel_of_url:
210 * When this resolver gets an ID to read from, it calls the function
211 * ~url_of_id to get the corresponding URL. This URL may be a relative
212 * URL; however, a URL scheme must be used which contains a path.
213 * The resolver converts the URL to an absolute URL if necessary.
214 * The second function, ~channel_of_url, is fed with the absolute URL
215 * as input. This function opens the resource to read from, and returns
216 * the channel and the encoding of the resource.
218 * Both functions, ~url_of_id and ~channel_of_url, can raise
219 * Not_competent to indicate that the object is not able to read from
220 * the specified resource. However, there is a difference: A Not_competent
221 * from ~url_of_id is left as it is, but a Not_competent from ~channel_of_url
222 * is converted to Not_resolvable. So only ~url_of_id decides which URLs
223 * are accepted by the resolver and which not.
225 * The function ~channel_of_url must return None as encoding if the default
226 * mechanism to recognize the encoding should be used. It must return
227 * Some e if it is already known that the encoding of the channel is e.
229 * If ?auto_close = true (which is the default), the channel is
230 * closed after use. If ?auto_close = false, the channel is left open.
232 * Objects of this class contain a base URL relative to which relative
233 * URLs are interpreted. When creating a new object, you can specify
234 * the base URL by passing it as ~base_url argument. When an existing
235 * object is cloned, the base URL of the clone is the URL of the original
238 * Note that the term "base URL" has a strict definition in RFC 1808.
242 class resolve_read_this_string :
243 ?id:ext_id -> ?fixenc:encoding -> string -> resolver;;
245 (* Reads from the passed string. If the ~id
246 * argument is passed to the object, the created resolver accepts only
247 * this ID. Otherwise all IDs are accepted.
248 * Once the resolver has been cloned, it does not accept any ID. This
249 * means that this resolver cannot handle inner references to external
250 * entities. Note that you can combine this resolver with another resolver
251 * that can handle inner references (such as resolve_as_file); see
252 * class 'combine' below.
253 * If you pass the ~fixenc argument, the encoding of the string is
254 * set to the passed value, regardless of any auto-recognition or
255 * any XML declaration.
259 class resolve_read_any_string :
260 string_of_id:(ext_id -> (string * encoding option)) -> resolver;;
262 (* resolver_read_any_string f_open:
263 * This resolver calls the function f_open to get the string for
264 * the passed ext_id. This function must either return the string and
265 * the encoding, or it must fail with Not_competent.
266 * The function must return None as encoding if the default mechanism to
267 * recognize the encoding should be used. It must return Some e if it is
268 * already known that the encoding of the string is e.
272 class resolve_as_file :
273 ?file_prefix:[ `Not_recognized | `Allowed | `Required ] ->
274 ?host_prefix:[ `Not_recognized | `Allowed | `Required ] ->
275 ?system_encoding:encoding ->
276 ?url_of_id:(ext_id -> Neturl.url) ->
277 ?channel_of_url: (Neturl.url -> (in_channel * encoding option)) ->
281 (* Reads from the local file system. Every file name is interpreted as
282 * file name of the local file system, and the referred file is read.
284 * The full form of a file URL is: file://host/path, where
285 * 'host' specifies the host system where the file identified 'path'
286 * resides. host = "" or host = "localhost" are accepted; other values
287 * will raise Not_competent. The standard for file URLs is
288 * defined in RFC 1738.
290 * Option ~file_prefix: Specifies how the "file:" prefix of file names
292 * `Not_recognized: The prefix is not recognized.
293 * `Allowed: The prefix is allowed but not required (the default).
294 * `Required: The prefix is required.
296 * Option ~host_prefix: Specifies how the "//host" phrase of file names
298 * `Not_recognized: The phrase is not recognized.
299 * `Allowed: The phrase is allowed but not required (the default).
300 * `Required: The phrase is required.
302 * Option ~system_encoding: Specifies the encoding of file names of
303 * the local file system. Default: UTF-8.
305 * Options ~url_of_id, ~channel_of_url: Not for the end user!
309 class combine : ?prefer:resolver -> resolver list -> resolver;;
311 (* Combines several resolver objects. If a concrete entity with an
312 * ext_id is to be opened, the combined resolver tries the contained
313 * resolvers in turn until a resolver accepts opening the entity
314 * (i.e. it does not raise Not_competent on open_in).
316 * Clones: If the 'clone' method is invoked before 'open_in', all contained
317 * resolvers are cloned and again combined. If the 'clone' method is
318 * invoked after 'open_in' (i.e. while the resolver is open), only the
319 * active resolver is cloned.
322 (* EXAMPLES OF RESOLVERS:
324 * let r1 = new resolve_as_file
325 * - r1 can open all local files
327 * let r2 = new resolve_read_this_channel
328 * ~id:"file:/dir/f.xml"
329 * (open_in "/dir/f.xml")
330 * - r2 can only read /dir/f.xml of the local file system. If this file
331 * contains references to other files, r2 will fail
333 * let r3 = new combine [ r2; r1 ]
334 * - r3 reads /dir/f.xml of the local file system by calling r2, and all
335 * other files by calling r1
339 (* ======================================================================
343 * Revision 1.1 2000/11/17 09:57:29 lpadovan
346 * Revision 1.5 2000/07/09 01:05:33 gerd
347 * New methode 'close_all' that closes the clones, too.
349 * Revision 1.4 2000/07/08 16:24:56 gerd
350 * Introduced the exception 'Not_resolvable' to indicate that
351 * 'combine' should not try the next resolver of the list.
353 * Revision 1.3 2000/07/06 23:04:46 gerd
354 * Quick fix for 'combine': The active resolver is "prefered",
355 * but the other resolvers are also used.
357 * Revision 1.2 2000/07/04 22:06:49 gerd
358 * MAJOR CHANGE: Complete redesign of the reader classes.
360 * Revision 1.1 2000/05/29 23:48:38 gerd
361 * Changed module names:
362 * Markup_aux into Pxp_aux
363 * Markup_codewriter into Pxp_codewriter
364 * Markup_document into Pxp_document
365 * Markup_dtd into Pxp_dtd
366 * Markup_entity into Pxp_entity
367 * Markup_lexer_types into Pxp_lexer_types
368 * Markup_reader into Pxp_reader
369 * Markup_types into Pxp_types
370 * Markup_yacc into Pxp_yacc
371 * See directory "compatibility" for (almost) compatible wrappers emulating
372 * Markup_document, Markup_dtd, Markup_reader, Markup_types, and Markup_yacc.
374 * ======================================================================
375 * Old logs from markup_reader.mli:
377 * Revision 1.3 2000/05/29 21:14:57 gerd
378 * Changed the type 'encoding' into a polymorphic variant.
380 * Revision 1.2 2000/05/20 20:31:40 gerd
381 * Big change: Added support for various encodings of the
382 * internal representation.
384 * Revision 1.1 2000/03/13 23:41:54 gerd