4.2. Resolvers and sources

4.2.1. Using the built-in resolvers (called sources)

The type source enumerates the two possibilities where the document to parse comes from.

type source =
    Entity of ((dtd -> Pxp_entity.entity) * Pxp_reader.resolver)
  | ExtID of (ext_id * Pxp_reader.resolver)
You normally need not to worry about this type as there are convenience functions that create source values:

4.2.2. The resolver API

A resolver is an object that can be opened like a file, but you do not pass the file name to the resolver, but the XML identifier of the entity to read from (either a SYSTEM or PUBLIC clause). When opened, the resolver must return the Lexing.lexbuf that reads the characters. The resolver can be closed, and it can be cloned. Furthermore, it is possible to tell the resolver which character set it should assume. - The following from Pxp_reader:

exception Not_competent
exception Not_resolvable of exn

class type resolver =
  object
    method init_rep_encoding : rep_encoding -> unit
    method init_warner : collect_warnings -> unit
    method rep_encoding : rep_encoding
    method open_in : ext_id -> Lexing.lexbuf
    method close_in : unit
    method change_encoding : string -> unit
    method clone : resolver
    method close_all : unit
  end
The resolver object must work as follows:

Exceptions. It is possible to chain resolvers such that when the first resolver is not able to open the entity, the other resolvers of the chain are tried in turn. The method open_in should raise the exception Not_competent to indicate that the next resolver should try to open the entity. If the resolver is able to handle the ID, but some other error occurs, the exception Not_resolvable should be raised to force that the chain breaks.

Example: How to define a resolver that is equivalent to from_string: ...

4.2.3. Predefined resolver components

There are some classes in Pxp_reader that define common resolver behaviour.

class resolve_read_this_channel : 
    ?id:ext_id -> 
    ?fixenc:encoding -> 
    ?auto_close:bool -> 
    in_channel -> 
        resolver
Reads from the passed channel (it may be even a pipe). If the ~id argument is passed to the object, the created resolver accepts only this ID. Otherwise all IDs are accepted. - Once the resolver has been cloned, it does not accept any ID. This means that this resolver cannot handle inner references to external entities. Note that you can combine this resolver with another resolver that can handle inner references (such as resolve_as_file); see class 'combine' below. - If you pass the ~fixenc argument, the encoding of the channel is set to the passed value, regardless of any auto-recognition or any XML declaration. - If ~auto_close = true (which is the default), the channel is closed after use. If ~auto_close = false, the channel is left open.

class resolve_read_any_channel : 
    ?auto_close:bool -> 
    channel_of_id:(ext_id -> (in_channel * encoding option)) -> 
        resolver
This resolver calls the function ~channel_of_id to open a new channel for the passed ext_id. This function must either return the channel and the encoding, or it must fail with Not_competent. The function must return None as encoding if the default mechanism to recognize the encoding should be used. It must return Some e if it is already known that the encoding of the channel is e. If ~auto_close = true (which is the default), the channel is closed after use. If ~auto_close = false, the channel is left open.

class resolve_read_url_channel :
    ?base_url:Neturl.url ->
    ?auto_close:bool -> 
    url_of_id:(ext_id -> Neturl.url) -> 
    channel_of_url:(Neturl.url -> (in_channel * encoding option)) -> 
        resolver
When this resolver gets an ID to read from, it calls the function ~url_of_id to get the corresponding URL. This URL may be a relative URL; however, a URL scheme must be used which contains a path. The resolver converts the URL to an absolute URL if necessary. The second function, ~channel_of_url, is fed with the absolute URL as input. This function opens the resource to read from, and returns the channel and the encoding of the resource.

Both functions, ~url_of_id and ~channel_of_url, can raise Not_competent to indicate that the object is not able to read from the specified resource. However, there is a difference: A Not_competent from ~url_of_id is left as it is, but a Not_competent from ~channel_of_url is converted to Not_resolvable. So only ~url_of_id decides which URLs are accepted by the resolver and which not.

The function ~channel_of_url must return None as encoding if the default mechanism to recognize the encoding should be used. It must return Some e if it is already known that the encoding of the channel is e.

If ~auto_close = true (which is the default), the channel is closed after use. If ~auto_close = false, the channel is left open.

Objects of this class contain a base URL relative to which relative URLs are interpreted. When creating a new object, you can specify the base URL by passing it as ~base_url argument. When an existing object is cloned, the base URL of the clone is the URL of the original object. - Note that the term "base URL" has a strict definition in RFC 1808.

class resolve_read_this_string : 
    ?id:ext_id -> 
    ?fixenc:encoding -> 
    string -> 
        resolver
Reads from the passed string. If the ~id argument is passed to the object, the created resolver accepts only this ID. Otherwise all IDs are accepted. - Once the resolver has been cloned, it does not accept any ID. This means that this resolver cannot handle inner references to external entities. Note that you can combine this resolver with another resolver that can handle inner references (such as resolve_as_file); see class 'combine' below. - If you pass the ~fixenc argument, the encoding of the string is set to the passed value, regardless of any auto-recognition or any XML declaration.

class resolve_read_any_string : 
    string_of_id:(ext_id -> (string * encoding option)) -> 
        resolver
This resolver calls the function ~string_of_id to get the string for the passed ext_id. This function must either return the string and the encoding, or it must fail with Not_competent. The function must return None as encoding if the default mechanism to recognize the encoding should be used. It must return Some e if it is already known that the encoding of the string is e.

class resolve_as_file :
    ?file_prefix:[ `Not_recognized | `Allowed | `Required ] ->
    ?host_prefix:[ `Not_recognized | `Allowed | `Required ] ->
    ?system_encoding:encoding ->
    ?url_of_id:(ext_id -> Neturl.url) -> 
    ?channel_of_url: (Neturl.url -> (in_channel * encoding option)) ->
    unit -> 
        resolver
Reads from the local file system. Every file name is interpreted as file name of the local file system, and the referred file is read.

The full form of a file URL is: file://host/path, where 'host' specifies the host system where the file identified 'path' resides. host = "" or host = "localhost" are accepted; other values will raise Not_competent. The standard for file URLs is defined in RFC 1738.

Option ~file_prefix: Specifies how the "file:" prefix of file names is handled:

Option ~host_prefix: Specifies how the "//host" phrase of file names is handled:

Option ~system_encoding: Specifies the encoding of file names of the local file system. Default: UTF-8.

Options ~url_of_id, ~channel_of_url: Not for the casual user!

class combine : 
    ?prefer:resolver -> 
    resolver list -> 
        resolver
Combines several resolver objects. If a concrete entity with an ext_id is to be opened, the combined resolver tries the contained resolvers in turn until a resolver accepts opening the entity (i.e. it does not raise Not_competent on open_in).

Clones: If the 'clone' method is invoked before 'open_in', all contained resolvers are cloned separately and again combined. If the 'clone' method is invoked after 'open_in' (i.e. while the resolver is open), additionally the clone of the active resolver is flagged as being preferred, i.e. it is tried first.