4 >Resolvers and sources</TITLE
7 CONTENT="Modular DocBook HTML Stylesheet Version 1.46"><LINK
9 TITLE="The PXP user's guide"
10 HREF="index.html"><LINK
12 TITLE="Configuring and calling the parser"
13 HREF="c1567.html"><LINK
15 TITLE="Configuring and calling the parser"
16 HREF="c1567.html"><LINK
18 TITLE="The DTD classes"
19 HREF="x1812.html"><LINK
22 HREF="markup.css"></HEAD
41 >The PXP user's guide</TH
56 >Chapter 4. Configuring and calling the parser</TD
76 >4.2. Resolvers and sources</A
84 >4.2.1. Using the built-in resolvers (called sources)</A
91 possibilities where the document to parse comes from.
94 CLASS="PROGRAMLISTING"
96 Entity of ((dtd -> Pxp_entity.entity) * Pxp_reader.resolver)
97 | ExtID of (ext_id * Pxp_reader.resolver)</PRE
100 You normally need not to worry about this type as there are convenience
101 functions that create <TT
112 STYLE="list-style-type: disc"
117 >: The document is read from
121 >; you may specify absolute or relative path names.
122 The file name must be encoded as UTF-8 string.</P
124 >There is an optional argument <TT
126 >~system_encoding</TT
128 specifying the character encoding which is used for the names of the file
129 system. For example, if this encoding is ISO-8859-1 and <TT
133 also a ISO-8859-1 string, you can form the source:
136 CLASS="PROGRAMLISTING"
137 >let s_utf8 = recode_string ~in_enc:`Enc_iso88591 ~out_enc:`Enc_utf8 s in
138 from_file ~system_encoding:`Enc_iso88591 s_utf8</PRE
144 > has the advantage that
145 it is able to resolve inner external entities; i.e. if your document includes
146 data from another file (using the <TT
150 mode will find that file. However, this mode cannot resolve
154 > identifiers nor <TT
158 other than "file:".</P
161 STYLE="list-style-type: disc"
166 >: The document is read
170 >. In general, this source also supports
171 file URLs found in the document; however, by default only absolute URLs are
172 understood. It is possible to associate an ID with the channel such that the
173 resolver knows how to interpret relative URLs:
176 CLASS="PROGRAMLISTING"
177 >from_channel ~id:(System "file:///dir/dir1/") ch</PRE
180 There is also the ~system_encoding argument specifying how file names are
181 encoded. - The example from above can also be written (but it is no
182 longer possible to interpret relative URLs because there is no ~id argument,
183 and computing this argument is relatively complicated because it must
187 CLASS="PROGRAMLISTING"
188 >let ch = open_in s in
189 let src = from_channel ~system_encoding:`Enc_iso88591 ch in
195 STYLE="list-style-type: disc"
204 > is the document to parse. This mode is not able to
205 interpret file names of <TT
208 > clauses, nor it can look up
214 >Normally, the encoding of the string is detected as usual
215 by analyzing the XML declaration, if any. However, it is also possible to
216 specify the encoding directly:
219 CLASS="PROGRAMLISTING"
220 >let src = from_string ~fixenc:`ISO-8859-2 s</PRE
224 STYLE="list-style-type: disc"
229 >: The document to parse
230 is denoted by the identifier <TT
241 identifier is interpreted by the resolver <TT
245 if you have written your own resolver.</P
247 >Which character sets are possible depends on the passed
254 STYLE="list-style-type: disc"
258 >Entity (get_entity, r)</TT
260 to parse is returned by the function invocation <TT
267 > is the DTD object to use (it may be
268 empty). Inner external references occuring in this entity are resolved using
274 >Which character sets are possible depends on the passed
289 >4.2.2. The resolver API</A
292 >A resolver is an object that can be opened like a file, but you
293 do not pass the file name to the resolver, but the XML identifier of the entity
294 to read from (either a <TT
301 clause). When opened, the resolver must return the
305 > that reads the characters. The resolver can
306 be closed, and it can be cloned. Furthermore, it is possible to tell the
307 resolver which character set it should assume. - The following from Pxp_reader:
310 CLASS="PROGRAMLISTING"
311 >exception Not_competent
312 exception Not_resolvable of exn
314 class type resolver =
316 method init_rep_encoding : rep_encoding -> unit
317 method init_warner : collect_warnings -> unit
318 method rep_encoding : rep_encoding
319 method open_in : ext_id -> Lexing.lexbuf
320 method close_in : unit
321 method change_encoding : string -> unit
322 method clone : resolver
323 method close_all : unit
327 The resolver object must work as follows:</P
334 STYLE="list-style-type: disc"
336 >When the parser is called, it tells the resolver the
337 warner object and the internal encoding by invoking
343 >init_rep_encoding</TT
345 resolver should store these values. The method <TT
349 should return the internal encoding.</P
352 STYLE="list-style-type: disc"
354 >If the parser wants to read from the resolver, it invokes
358 >. Either the resolver succeeds, in which
362 > reading from the file or stream must
363 be returned, or opening fails. In the latter case the method implementation
364 should raise an exception (see below).</P
367 STYLE="list-style-type: disc"
369 >If the parser finishes reading, it calls the
376 STYLE="list-style-type: disc"
378 >If the parser finds a reference to another external
379 entity in the input stream, it calls <TT
383 resolver which must be initially closed (not yet connected with an input
384 stream). The parser then invokes <TT
388 methods as described.</P
391 STYLE="list-style-type: disc"
393 >If you already know the character set of the input
394 stream, you should recode it to the internal encoding, and define the method
398 > as an empty method.</P
401 STYLE="list-style-type: disc"
403 >If you want to support multiple external character sets,
404 the object must follow a much more complicated protocol. Directly after
408 > has been called, the resolver must return a lexical
409 buffer that only reads one byte at a time. This is only possible if you create
410 the lexical buffer with <TT
412 >Lexing.from_function</TT
414 must then always return 1 if the EOF is not yet reached, and 0 if EOF is
415 reached. If the parser has read the first line of the document, it will invoke
419 > to tell the resolver which character set to
420 assume. From this moment, the object can return more than one byte at once. The
424 > is either the parameter of the
425 "encoding" attribute of the XML declaration, or the empty string if there is
426 not any XML declaration or if the declaration does not contain an encoding
429 >At the beginning the resolver must only return one
430 character every time something is read from the lexical buffer. The reason for
431 this is that you otherwise would not exactly know at which position in the
432 input stream the character set changes.</P
434 >If you want automatic recognition of the character set,
435 it is up to the resolver object to implement this.</P
438 STYLE="list-style-type: disc"
440 >If an error occurs, the parser calls the method
444 > for the top-level resolver; this method should
445 close itself (if not already done) and all clones.</P
454 >It is possible to chain resolvers such that when the first resolver is not able
455 to open the entity, the other resolvers of the chain are tried in turn. The
459 > should raise the exception
463 > to indicate that the next resolver should try
464 to open the entity. If the resolver is able to handle the ID, but some other
465 error occurs, the exception <TT
469 to force that the chain breaks.
473 >Example: How to define a resolver that is equivalent to
482 >4.2.3. Predefined resolver components</A
485 >There are some classes in Pxp_reader that define common resolver behaviour.
488 CLASS="PROGRAMLISTING"
489 >class resolve_read_this_channel :
491 ?fixenc:encoding ->
492 ?auto_close:bool ->
497 Reads from the passed channel (it may be even a pipe). If the
501 > argument is passed to the object, the created resolver
502 accepts only this ID. Otherwise all IDs are accepted. - Once the resolver has
503 been cloned, it does not accept any ID. This means that this resolver cannot
504 handle inner references to external entities. Note that you can combine this
505 resolver with another resolver that can handle inner references (such as
506 resolve_as_file); see class 'combine' below. - If you pass the
510 > argument, the encoding of the channel is set to the
511 passed value, regardless of any auto-recognition or any XML declaration. - If
514 >~auto_close = true</TT
515 > (which is the default), the channel is
516 closed after use. If <TT
518 >~auto_close = false</TT
524 CLASS="PROGRAMLISTING"
525 >class resolve_read_any_channel :
526 ?auto_close:bool ->
527 channel_of_id:(ext_id -> (in_channel * encoding option)) ->
531 This resolver calls the function <TT
535 new channel for the passed <TT
538 >. This function must either
539 return the channel and the encoding, or it must fail with Not_competent. The
540 function must return <TT
543 > as encoding if the default
544 mechanism to recognize the encoding should be used. It must return
548 > if it is already known that the encoding of the
554 >~auto_close = true</TT
556 (which is the default), the channel is closed after use. If
559 >~auto_close = false</TT
560 >, the channel is left open.</P
563 CLASS="PROGRAMLISTING"
564 >class resolve_read_url_channel :
565 ?base_url:Neturl.url ->
566 ?auto_close:bool ->
567 url_of_id:(ext_id -> Neturl.url) ->
568 channel_of_url:(Neturl.url -> (in_channel * encoding option)) ->
572 When this resolver gets an ID to read from, it calls the function
576 > to get the corresponding URL. This URL may be a
577 relative URL; however, a URL scheme must be used which contains a path. The
578 resolver converts the URL to an absolute URL if necessary. The second
582 >, is fed with the absolute URL as
583 input. This function opens the resource to read from, and returns the channel
584 and the encoding of the resource.</P
593 >, can raise Not_competent to indicate that
594 the object is not able to read from the specified resource. However, there is a
595 difference: A Not_competent from <TT
599 is, but a Not_competent from <TT
603 Not_resolvable. So only <TT
606 > decides which URLs are
607 accepted by the resolver and which not.</P
616 > as encoding if the default mechanism to recognize the
617 encoding should be used. It must return <TT
621 already known that the encoding of the channel is <TT
628 >~auto_close = true</TT
629 > (which is the default), the channel is
630 closed after use. If <TT
632 >~auto_close = false</TT
636 >Objects of this class contain a base URL relative to which relative URLs are
637 interpreted. When creating a new object, you can specify the base URL by
641 > argument. When an existing object is
642 cloned, the base URL of the clone is the URL of the original object. - Note
643 that the term "base URL" has a strict definition in RFC 1808.</P
646 CLASS="PROGRAMLISTING"
647 >class resolve_read_this_string :
649 ?fixenc:encoding ->
654 Reads from the passed string. If the <TT
658 to the object, the created resolver accepts only this ID. Otherwise all IDs are
659 accepted. - Once the resolver has been cloned, it does not accept any ID. This
660 means that this resolver cannot handle inner references to external
661 entities. Note that you can combine this resolver with another resolver that
662 can handle inner references (such as resolve_as_file); see class 'combine'
663 below. - If you pass the <TT
666 > argument, the encoding of
667 the string is set to the passed value, regardless of any auto-recognition or
668 any XML declaration.</P
671 CLASS="PROGRAMLISTING"
672 >class resolve_read_any_string :
673 string_of_id:(ext_id -> (string * encoding option)) ->
677 This resolver calls the function <TT
681 string for the passed <TT
684 >. This function must either
685 return the string and the encoding, or it must fail with Not_competent. The
686 function must return <TT
689 > as encoding if the default
690 mechanism to recognize the encoding should be used. It must return
694 > if it is already known that the encoding of the
701 CLASS="PROGRAMLISTING"
702 >class resolve_as_file :
703 ?file_prefix:[ `Not_recognized | `Allowed | `Required ] ->
704 ?host_prefix:[ `Not_recognized | `Allowed | `Required ] ->
705 ?system_encoding:encoding ->
706 ?url_of_id:(ext_id -> Neturl.url) ->
707 ?channel_of_url: (Neturl.url -> (in_channel * encoding option)) ->
711 Reads from the local file system. Every file name is interpreted as
712 file name of the local file system, and the referred file is read.</P
714 >The full form of a file URL is: file://host/path, where
715 'host' specifies the host system where the file identified 'path'
716 resides. host = "" or host = "localhost" are accepted; other values
717 will raise Not_competent. The standard for file URLs is
718 defined in RFC 1738.</P
723 >: Specifies how the "file:" prefix of
724 file names is handled:
730 STYLE="list-style-type: disc"
734 >`Not_recognized:</TT
739 STYLE="list-style-type: disc"
744 > The prefix is allowed but
745 not required (the default).</P
748 STYLE="list-style-type: disc"
762 > Specifies how the "//host" phrase of
763 file names is handled:
769 STYLE="list-style-type: disc"
773 >`Not_recognized:</TT
778 STYLE="list-style-type: disc"
783 > The prefix is allowed but
784 not required (the default).</P
787 STYLE="list-style-type: disc"
800 >~system_encoding:</TT
801 > Specifies the encoding of file
802 names of the local file system. Default: UTF-8.</P
811 for the casual user!</P
814 CLASS="PROGRAMLISTING"
816 ?prefer:resolver ->
821 Combines several resolver objects. If a concrete entity with an
825 > is to be opened, the combined resolver tries the
826 contained resolvers in turn until a resolver accepts opening the entity
827 (i.e. it does not raise Not_competent on open_in).</P
829 >Clones: If the 'clone' method is invoked before 'open_in', all contained
830 resolvers are cloned separately and again combined. If the 'clone' method is
831 invoked after 'open_in' (i.e. while the resolver is open), additionally the
832 clone of the active resolver is flagged as being preferred, i.e. it is tried
876 >Configuring and calling the parser</TD