X-Git-Url: http://matita.cs.unibo.it/gitweb/?a=blobdiff_plain;f=helm%2FDEVEL%2Fpxp%2Fpxp%2Fdoc%2Fmanual%2Fhtml%2Fx1629.html;fp=helm%2FDEVEL%2Fpxp%2Fpxp%2Fdoc%2Fmanual%2Fhtml%2Fx1629.html;h=06b1e60ea5caac67fb51a5aabe19d8341e6a6735;hb=c03d2c1fdab8d228cb88aaba5ca0f556318bebc5;hp=0000000000000000000000000000000000000000;hpb=758057e85325f94cd88583feb1fdf6b038e35055;p=helm.git diff --git a/helm/DEVEL/pxp/pxp/doc/manual/html/x1629.html b/helm/DEVEL/pxp/pxp/doc/manual/html/x1629.html new file mode 100644 index 000000000..06b1e60ea --- /dev/null +++ b/helm/DEVEL/pxp/pxp/doc/manual/html/x1629.html @@ -0,0 +1,895 @@ +
The type source enumerates the two +possibilities where the document to parse comes from. + +
type source = + Entity of ((dtd -> Pxp_entity.entity) * Pxp_reader.resolver) + | ExtID of (ext_id * Pxp_reader.resolver)+ +You normally need not to worry about this type as there are convenience +functions that create source values: + + +
from_file s: The document is read from +file s; you may specify absolute or relative path names. +The file name must be encoded as UTF-8 string.
There is an optional argument ~system_encoding +specifying the character encoding which is used for the names of the file +system. For example, if this encoding is ISO-8859-1 and s is +also a ISO-8859-1 string, you can form the source: + +
let s_utf8 = recode_string ~in_enc:`Enc_iso88591 ~out_enc:`Enc_utf8 s in +from_file ~system_encoding:`Enc_iso88591 s_utf8
This source has the advantage that +it is able to resolve inner external entities; i.e. if your document includes +data from another file (using the SYSTEM attribute), this +mode will find that file. However, this mode cannot resolve +PUBLIC identifiers nor SYSTEM identifiers +other than "file:".
from_channel ch: The document is read +from the channel ch. In general, this source also supports +file URLs found in the document; however, by default only absolute URLs are +understood. It is possible to associate an ID with the channel such that the +resolver knows how to interpret relative URLs: + +
from_channel ~id:(System "file:///dir/dir1/") ch+ +There is also the ~system_encoding argument specifying how file names are +encoded. - The example from above can also be written (but it is no +longer possible to interpret relative URLs because there is no ~id argument, +and computing this argument is relatively complicated because it must +be a valid URL): + +
let ch = open_in s in +let src = from_channel ~system_encoding:`Enc_iso88591 ch in +...; +close_in ch
from_string s: The string +s is the document to parse. This mode is not able to +interpret file names of SYSTEM clauses, nor it can look up +PUBLIC identifiers.
Normally, the encoding of the string is detected as usual +by analyzing the XML declaration, if any. However, it is also possible to +specify the encoding directly: + +
let src = from_string ~fixenc:`ISO-8859-2 s
ExtID (id, r): The document to parse +is denoted by the identifier id (either a +SYSTEM or PUBLIC clause), and this +identifier is interpreted by the resolver r. Use this mode +if you have written your own resolver.
Which character sets are possible depends on the passed +resolver r.
Entity (get_entity, r): The document +to parse is returned by the function invocation get_entity +dtd, where dtd is the DTD object to use (it may be +empty). Inner external references occuring in this entity are resolved using +the resolver r.
Which character sets are possible depends on the passed +resolver r.
A resolver is an object that can be opened like a file, but you +do not pass the file name to the resolver, but the XML identifier of the entity +to read from (either a SYSTEM or PUBLIC +clause). When opened, the resolver must return the +Lexing.lexbuf that reads the characters. The resolver can +be closed, and it can be cloned. Furthermore, it is possible to tell the +resolver which character set it should assume. - The following from Pxp_reader: + +
exception Not_competent +exception Not_resolvable of exn + +class type resolver = + object + method init_rep_encoding : rep_encoding -> unit + method init_warner : collect_warnings -> unit + method rep_encoding : rep_encoding + method open_in : ext_id -> Lexing.lexbuf + method close_in : unit + method change_encoding : string -> unit + method clone : resolver + method close_all : unit + end+ +The resolver object must work as follows:
When the parser is called, it tells the resolver the +warner object and the internal encoding by invoking +init_warner and init_rep_encoding. The +resolver should store these values. The method rep_encoding +should return the internal encoding.
If the parser wants to read from the resolver, it invokes +the method open_in. Either the resolver succeeds, in which +case the Lexing.lexbuf reading from the file or stream must +be returned, or opening fails. In the latter case the method implementation +should raise an exception (see below).
If the parser finishes reading, it calls the +close_in method.
If the parser finds a reference to another external +entity in the input stream, it calls clone to get a second +resolver which must be initially closed (not yet connected with an input +stream). The parser then invokes open_in and the other +methods as described.
If you already know the character set of the input +stream, you should recode it to the internal encoding, and define the method +change_encoding as an empty method.
If you want to support multiple external character sets, +the object must follow a much more complicated protocol. Directly after +open_in has been called, the resolver must return a lexical +buffer that only reads one byte at a time. This is only possible if you create +the lexical buffer with Lexing.from_function; the function +must then always return 1 if the EOF is not yet reached, and 0 if EOF is +reached. If the parser has read the first line of the document, it will invoke +change_encoding to tell the resolver which character set to +assume. From this moment, the object can return more than one byte at once. The +argument of change_encoding is either the parameter of the +"encoding" attribute of the XML declaration, or the empty string if there is +not any XML declaration or if the declaration does not contain an encoding +attribute.
At the beginning the resolver must only return one +character every time something is read from the lexical buffer. The reason for +this is that you otherwise would not exactly know at which position in the +input stream the character set changes.
If you want automatic recognition of the character set, +it is up to the resolver object to implement this.
If an error occurs, the parser calls the method +close_all for the top-level resolver; this method should +close itself (if not already done) and all clones.
Exceptions. It is possible to chain resolvers such that when the first resolver is not able +to open the entity, the other resolvers of the chain are tried in turn. The +method open_in should raise the exception +Not_competent to indicate that the next resolver should try +to open the entity. If the resolver is able to handle the ID, but some other +error occurs, the exception Not_resolvable should be raised +to force that the chain breaks. +
Example: How to define a resolver that is equivalent to +from_string: ...
There are some classes in Pxp_reader that define common resolver behaviour. + +
class resolve_read_this_channel : + ?id:ext_id -> + ?fixenc:encoding -> + ?auto_close:bool -> + in_channel -> + resolver+ +Reads from the passed channel (it may be even a pipe). If the +~id argument is passed to the object, the created resolver +accepts only this ID. Otherwise all IDs are accepted. - Once the resolver has +been cloned, it does not accept any ID. This means that this resolver cannot +handle inner references to external entities. Note that you can combine this +resolver with another resolver that can handle inner references (such as +resolve_as_file); see class 'combine' below. - If you pass the +~fixenc argument, the encoding of the channel is set to the +passed value, regardless of any auto-recognition or any XML declaration. - If +~auto_close = true (which is the default), the channel is +closed after use. If ~auto_close = false, the channel is +left open. +
class resolve_read_any_channel : + ?auto_close:bool -> + channel_of_id:(ext_id -> (in_channel * encoding option)) -> + resolver+ +This resolver calls the function ~channel_of_id to open a +new channel for the passed ext_id. This function must either +return the channel and the encoding, or it must fail with Not_competent. The +function must return None as encoding if the default +mechanism to recognize the encoding should be used. It must return +Some e if it is already known that the encoding of the +channel is e. If ~auto_close = true +(which is the default), the channel is closed after use. If +~auto_close = false, the channel is left open.
class resolve_read_url_channel : + ?base_url:Neturl.url -> + ?auto_close:bool -> + url_of_id:(ext_id -> Neturl.url) -> + channel_of_url:(Neturl.url -> (in_channel * encoding option)) -> + resolver+ +When this resolver gets an ID to read from, it calls the function +~url_of_id to get the corresponding URL. This URL may be a +relative URL; however, a URL scheme must be used which contains a path. The +resolver converts the URL to an absolute URL if necessary. The second +function, ~channel_of_url, is fed with the absolute URL as +input. This function opens the resource to read from, and returns the channel +and the encoding of the resource.
Both functions, ~url_of_id and +~channel_of_url, can raise Not_competent to indicate that +the object is not able to read from the specified resource. However, there is a +difference: A Not_competent from ~url_of_id is left as it +is, but a Not_competent from ~channel_of_url is converted to +Not_resolvable. So only ~url_of_id decides which URLs are +accepted by the resolver and which not.
The function ~channel_of_url must return +None as encoding if the default mechanism to recognize the +encoding should be used. It must return Some e if it is +already known that the encoding of the channel is e.
If ~auto_close = true (which is the default), the channel is +closed after use. If ~auto_close = false, the channel is +left open.
Objects of this class contain a base URL relative to which relative URLs are +interpreted. When creating a new object, you can specify the base URL by +passing it as ~base_url argument. When an existing object is +cloned, the base URL of the clone is the URL of the original object. - Note +that the term "base URL" has a strict definition in RFC 1808.
class resolve_read_this_string : + ?id:ext_id -> + ?fixenc:encoding -> + string -> + resolver+ +Reads from the passed string. If the ~id argument is passed +to the object, the created resolver accepts only this ID. Otherwise all IDs are +accepted. - Once the resolver has been cloned, it does not accept any ID. This +means that this resolver cannot handle inner references to external +entities. Note that you can combine this resolver with another resolver that +can handle inner references (such as resolve_as_file); see class 'combine' +below. - If you pass the ~fixenc argument, the encoding of +the string is set to the passed value, regardless of any auto-recognition or +any XML declaration.
class resolve_read_any_string : + string_of_id:(ext_id -> (string * encoding option)) -> + resolver+ +This resolver calls the function ~string_of_id to get the +string for the passed ext_id. This function must either +return the string and the encoding, or it must fail with Not_competent. The +function must return None as encoding if the default +mechanism to recognize the encoding should be used. It must return +Some e if it is already known that the encoding of the +string is e.
class resolve_as_file : + ?file_prefix:[ `Not_recognized | `Allowed | `Required ] -> + ?host_prefix:[ `Not_recognized | `Allowed | `Required ] -> + ?system_encoding:encoding -> + ?url_of_id:(ext_id -> Neturl.url) -> + ?channel_of_url: (Neturl.url -> (in_channel * encoding option)) -> + unit -> + resolver+Reads from the local file system. Every file name is interpreted as +file name of the local file system, and the referred file is read.
The full form of a file URL is: file://host/path, where +'host' specifies the host system where the file identified 'path' +resides. host = "" or host = "localhost" are accepted; other values +will raise Not_competent. The standard for file URLs is +defined in RFC 1738.
Option ~file_prefix: Specifies how the "file:" prefix of +file names is handled: +
`Not_recognized:The prefix is not +recognized.
`Allowed: The prefix is allowed but +not required (the default).
`Required: The prefix is +required.
Option ~host_prefix: Specifies how the "//host" phrase of +file names is handled: +
`Not_recognized:The prefix is not +recognized.
`Allowed: The prefix is allowed but +not required (the default).
`Required: The prefix is +required.
Option ~system_encoding: Specifies the encoding of file +names of the local file system. Default: UTF-8.
Options ~url_of_id, ~channel_of_url: Not +for the casual user!
class combine : + ?prefer:resolver -> + resolver list -> + resolver+ +Combines several resolver objects. If a concrete entity with an +ext_id is to be opened, the combined resolver tries the +contained resolvers in turn until a resolver accepts opening the entity +(i.e. it does not raise Not_competent on open_in).
Clones: If the 'clone' method is invoked before 'open_in', all contained +resolvers are cloned separately and again combined. If the 'clone' method is +invoked after 'open_in' (i.e. while the resolver is open), additionally the +clone of the active resolver is flagged as being preferred, i.e. it is tried +first.