The PXP user's guide
Prev	Chapter 4. Configuring and calling the parser	Next

4.2. Resolvers and sources

4.2.1. Using the built-in resolvers (called sources)

The type source enumerates the two +possibilities where the document to parse comes from. + +

type source =
+    Entity of ((dtd -> Pxp_entity.entity) * Pxp_reader.resolver)
+  | ExtID of (ext_id * Pxp_reader.resolver)

+ +You normally need not to worry about this type as there are convenience +functions that create source values: + + +

from_file s: The document is read from +file s; you may specify absolute or relative path names. +The file name must be encoded as UTF-8 string.
There is an optional argument ~system_encoding +specifying the character encoding which is used for the names of the file +system. For example, if this encoding is ISO-8859-1 and s is +also a ISO-8859-1 string, you can form the source: + +
```
let s_utf8  =  recode_string ~in_enc:`Enc_iso88591 ~out_enc:`Enc_utf8 s in
+from_file ~system_encoding:`Enc_iso88591 s_utf8
```
This source has the advantage that +it is able to resolve inner external entities; i.e. if your document includes +data from another file (using the SYSTEM attribute), this +mode will find that file. However, this mode cannot resolve +PUBLIC identifiers nor SYSTEM identifiers +other than "file:".
from_channel ch: The document is read +from the channel ch. In general, this source also supports +file URLs found in the document; however, by default only absolute URLs are +understood. It is possible to associate an ID with the channel such that the +resolver knows how to interpret relative URLs: + +
```
from_channel ~id:(System "file:///dir/dir1/") ch
```
+ +There is also the ~system_encoding argument specifying how file names are +encoded. - The example from above can also be written (but it is no +longer possible to interpret relative URLs because there is no ~id argument, +and computing this argument is relatively complicated because it must +be a valid URL): + +
```
let ch = open_in s in
+let src = from_channel ~system_encoding:`Enc_iso88591 ch in
+...;
+close_in ch
```
from_string s: The string +s is the document to parse. This mode is not able to +interpret file names of SYSTEM clauses, nor it can look up +PUBLIC identifiers.
Normally, the encoding of the string is detected as usual +by analyzing the XML declaration, if any. However, it is also possible to +specify the encoding directly: + +
```
let src = from_string ~fixenc:`ISO-8859-2 s
```
ExtID (id, r): The document to parse +is denoted by the identifier id (either a +SYSTEM or PUBLIC clause), and this +identifier is interpreted by the resolver r. Use this mode +if you have written your own resolver.
Which character sets are possible depends on the passed +resolver r.
Entity (get_entity, r): The document +to parse is returned by the function invocation get_entity +dtd, where dtd is the DTD object to use (it may be +empty). Inner external references occuring in this entity are resolved using +the resolver r.
Which character sets are possible depends on the passed +resolver r.

4.2.2. The resolver API

A resolver is an object that can be opened like a file, but you +do not pass the file name to the resolver, but the XML identifier of the entity +to read from (either a SYSTEM or PUBLIC +clause). When opened, the resolver must return the +Lexing.lexbuf that reads the characters. The resolver can +be closed, and it can be cloned. Furthermore, it is possible to tell the +resolver which character set it should assume. - The following from Pxp_reader: + +

exception Not_competent
+exception Not_resolvable of exn
+
+class type resolver =
+  object
+    method init_rep_encoding : rep_encoding -> unit
+    method init_warner : collect_warnings -> unit
+    method rep_encoding : rep_encoding
+    method open_in : ext_id -> Lexing.lexbuf
+    method close_in : unit
+    method change_encoding : string -> unit
+    method clone : resolver
+    method close_all : unit
+  end

+ +The resolver object must work as follows:

When the parser is called, it tells the resolver the +warner object and the internal encoding by invoking +init_warner and init_rep_encoding. The +resolver should store these values. The method rep_encoding +should return the internal encoding.
If the parser wants to read from the resolver, it invokes +the method open_in. Either the resolver succeeds, in which +case the Lexing.lexbuf reading from the file or stream must +be returned, or opening fails. In the latter case the method implementation +should raise an exception (see below).
If the parser finishes reading, it calls the +close_in method.
If the parser finds a reference to another external +entity in the input stream, it calls clone to get a second +resolver which must be initially closed (not yet connected with an input +stream). The parser then invokes open_in and the other +methods as described.
If you already know the character set of the input +stream, you should recode it to the internal encoding, and define the method +change_encoding as an empty method.
If you want to support multiple external character sets, +the object must follow a much more complicated protocol. Directly after +open_in has been called, the resolver must return a lexical +buffer that only reads one byte at a time. This is only possible if you create +the lexical buffer with Lexing.from_function; the function +must then always return 1 if the EOF is not yet reached, and 0 if EOF is +reached. If the parser has read the first line of the document, it will invoke +change_encoding to tell the resolver which character set to +assume. From this moment, the object can return more than one byte at once. The +argument of change_encoding is either the parameter of the +"encoding" attribute of the XML declaration, or the empty string if there is +not any XML declaration or if the declaration does not contain an encoding +attribute.
At the beginning the resolver must only return one +character every time something is read from the lexical buffer. The reason for +this is that you otherwise would not exactly know at which position in the +input stream the character set changes.
If you want automatic recognition of the character set, +it is up to the resolver object to implement this.
If an error occurs, the parser calls the method +close_all for the top-level resolver; this method should +close itself (if not already done) and all clones.

Exceptions. It is possible to chain resolvers such that when the first resolver is not able +to open the entity, the other resolvers of the chain are tried in turn. The +method open_in should raise the exception +Not_competent to indicate that the next resolver should try +to open the entity. If the resolver is able to handle the ID, but some other +error occurs, the exception Not_resolvable should be raised +to force that the chain breaks. +

Example: How to define a resolver that is equivalent to +from_string: ...

4.2.3. Predefined resolver components

There are some classes in Pxp_reader that define common resolver behaviour. + +

class resolve_read_this_channel : 
+    ?id:ext_id -> 
+    ?fixenc:encoding -> 
+    ?auto_close:bool -> 
+    in_channel -> 
+        resolver

+ +Reads from the passed channel (it may be even a pipe). If the +~id argument is passed to the object, the created resolver +accepts only this ID. Otherwise all IDs are accepted. - Once the resolver has +been cloned, it does not accept any ID. This means that this resolver cannot +handle inner references to external entities. Note that you can combine this +resolver with another resolver that can handle inner references (such as +resolve_as_file); see class 'combine' below. - If you pass the +~fixenc argument, the encoding of the channel is set to the +passed value, regardless of any auto-recognition or any XML declaration. - If +~auto_close = true (which is the default), the channel is +closed after use. If ~auto_close = false, the channel is +left open. +

class resolve_read_any_channel : 
+    ?auto_close:bool -> 
+    channel_of_id:(ext_id -> (in_channel * encoding option)) -> 
+        resolver

+ +This resolver calls the function ~channel_of_id to open a +new channel for the passed ext_id. This function must either +return the channel and the encoding, or it must fail with Not_competent. The +function must return None as encoding if the default +mechanism to recognize the encoding should be used. It must return +Some e if it is already known that the encoding of the +channel is e. If ~auto_close = true +(which is the default), the channel is closed after use. If +~auto_close = false, the channel is left open.

class resolve_read_url_channel :
+    ?base_url:Neturl.url ->
+    ?auto_close:bool -> 
+    url_of_id:(ext_id -> Neturl.url) -> 
+    channel_of_url:(Neturl.url -> (in_channel * encoding option)) -> 
+        resolver

+ +When this resolver gets an ID to read from, it calls the function +~url_of_id to get the corresponding URL. This URL may be a +relative URL; however, a URL scheme must be used which contains a path. The +resolver converts the URL to an absolute URL if necessary. The second +function, ~channel_of_url, is fed with the absolute URL as +input. This function opens the resource to read from, and returns the channel +and the encoding of the resource.

Both functions, ~url_of_id and +~channel_of_url, can raise Not_competent to indicate that +the object is not able to read from the specified resource. However, there is a +difference: A Not_competent from ~url_of_id is left as it +is, but a Not_competent from ~channel_of_url is converted to +Not_resolvable. So only ~url_of_id decides which URLs are +accepted by the resolver and which not.

The function ~channel_of_url must return +None as encoding if the default mechanism to recognize the +encoding should be used. It must return Some e if it is +already known that the encoding of the channel is e.

If ~auto_close = true (which is the default), the channel is +closed after use. If ~auto_close = false, the channel is +left open.

Objects of this class contain a base URL relative to which relative URLs are +interpreted. When creating a new object, you can specify the base URL by +passing it as ~base_url argument. When an existing object is +cloned, the base URL of the clone is the URL of the original object. - Note +that the term "base URL" has a strict definition in RFC 1808.

class resolve_read_this_string : 
+    ?id:ext_id -> 
+    ?fixenc:encoding -> 
+    string -> 
+        resolver

+ +Reads from the passed string. If the ~id argument is passed +to the object, the created resolver accepts only this ID. Otherwise all IDs are +accepted. - Once the resolver has been cloned, it does not accept any ID. This +means that this resolver cannot handle inner references to external +entities. Note that you can combine this resolver with another resolver that +can handle inner references (such as resolve_as_file); see class 'combine' +below. - If you pass the ~fixenc argument, the encoding of +the string is set to the passed value, regardless of any auto-recognition or +any XML declaration.

class resolve_read_any_string : 
+    string_of_id:(ext_id -> (string * encoding option)) -> 
+        resolver

+ +This resolver calls the function ~string_of_id to get the +string for the passed ext_id. This function must either +return the string and the encoding, or it must fail with Not_competent. The +function must return None as encoding if the default +mechanism to recognize the encoding should be used. It must return +Some e if it is already known that the encoding of the +string is e.

class resolve_as_file :
+    ?file_prefix:[ `Not_recognized | `Allowed | `Required ] ->
+    ?host_prefix:[ `Not_recognized | `Allowed | `Required ] ->
+    ?system_encoding:encoding ->
+    ?url_of_id:(ext_id -> Neturl.url) -> 
+    ?channel_of_url: (Neturl.url -> (in_channel * encoding option)) ->
+    unit -> 
+        resolver

+Reads from the local file system. Every file name is interpreted as +file name of the local file system, and the referred file is read.

The full form of a file URL is: file://host/path, where +'host' specifies the host system where the file identified 'path' +resides. host = "" or host = "localhost" are accepted; other values +will raise Not_competent. The standard for file URLs is +defined in RFC 1738.

Option ~file_prefix: Specifies how the "file:" prefix of +file names is handled: +

`Not_recognized:The prefix is not +recognized.
`Allowed: The prefix is allowed but +not required (the default).
`Required: The prefix is +required.

Option ~host_prefix: Specifies how the "//host" phrase of +file names is handled: +

`Not_recognized:The prefix is not +recognized.
`Allowed: The prefix is allowed but +not required (the default).
`Required: The prefix is +required.

Option ~system_encoding: Specifies the encoding of file +names of the local file system. Default: UTF-8.

Options ~url_of_id, ~channel_of_url: Not +for the casual user!

class combine : 
+    ?prefer:resolver -> 
+    resolver list -> 
+        resolver

+ +Combines several resolver objects. If a concrete entity with an +ext_id is to be opened, the combined resolver tries the +contained resolvers in turn until a resolver accepts opening the entity +(i.e. it does not raise Not_competent on open_in).

Clones: If the 'clone' method is invoked before 'open_in', all contained +resolvers are cloned separately and again combined. If the 'clone' method is +invoked after 'open_in' (i.e. while the resolver is open), additionally the +clone of the active resolver is flagged as being preferred, i.e. it is tried +first.

Prev	Home	Next
Configuring and calling the parser	Up	The DTD classes