The PXP user's guide
Prev	Chapter 4. Configuring and calling the parser	Next

4.4. Invoking the parser

Here a description of Pxp_yacc.

4.4.1. Defaults

The following defaults are available: + +

val default_config : config
+val default_extension : ('a node extension) as 'a
+val default_spec : ('a node extension as 'a) spec

4.4.2. Parsing functions

In the following, the term "closed document" refers to +an XML structure like + +

<!DOCTYPE ... [ declarations ] >
+<root>
+...
+</root>

+ +The term "fragment" refers to an XML structure like + +

<root>
+...
+</root>

+ +i.e. only to one isolated element instance.

val parse_dtd_entity : config -> source -> dtd

+ +Parses the declarations which are contained in the entity, and returns them as +dtd object.

val extract_dtd_from_document_entity : config -> source -> dtd

+ +Extracts the DTD from a closed document. Both the internal and the external +subsets are extracted and combined to one dtd object. This +function does not parse the whole document, but only the parts that are +necessary to extract the DTD.

val parse_document_entity : 
+    ?transform_dtd:(dtd -> dtd) ->
+    ?id_index:('ext index) ->
+    config -> 
+    source -> 
+    'ext spec -> 
+        'ext document

+ +Parses a closed document and validates it against the DTD that is contained in +the document (internal and external subsets). The option +~transform_dtd can be used to transform the DTD in the +document, and to use the transformed DTD for validation. If +~id_index is specified, an index of all ID attributes is +created.

val parse_wfdocument_entity : 
+    config -> 
+    source -> 
+    'ext spec -> 
+        'ext document

+ +Parses a closed document, but checks it only on well-formedness.

val parse_content_entity  : 
+    ?id_index:('ext index) ->
+    config ->  
+    source -> 
+    dtd -> 
+    'ext spec -> 
+        'ext node

+ +Parses a fragment, and validates the element.

val parse_wfcontent_entity : 
+    config -> 
+    source -> 
+    'ext spec -> 
+        'ext node

+ +Parses a fragment, but checks it only on well-formedness.

4.4.3. Configuration options

type config =
+    { warner : collect_warnings;
+      errors_with_line_numbers : bool;
+      enable_pinstr_nodes : bool;
+      enable_super_root_node : bool;
+      enable_comment_nodes : bool;
+      encoding : rep_encoding;
+      recognize_standalone_declaration : bool;
+      store_element_positions : bool;
+      idref_pass : bool;
+      validate_by_dfa : bool;
+      accept_only_deterministic_models : bool;
+      ...
+    }

+ +

warner:The parser prints +warnings by invoking the method warn for this warner +object. (Default: all warnings are dropped)
errors_with_line_numbers:If +true, errors contain line numbers; if false, errors contain only byte +positions. The latter mode is faster. (Default: true)
enable_pinstr_nodes:If true, +the parser creates extra nodes for processing instructions. If false, +processing instructions are simply added to the element or document surrounding +the instructions. (Default: false)
enable_super_root_node:If +true, the parser creates an extra node which is the parent of the root of the +document tree. This node is called super root; it is an element with type +T_super_root. - If there are processing instructions outside +the root element and outside the DTD, they are added to the super root instead +of the document. - If false, the super root node is not created. (Default: +false)
enable_comment_nodes:If true, +the parser creates nodes for comments with type T_comment; +if false, such nodes are not created. (Default: false)
encoding:Specifies the +internal encoding of the parser. Most strings are then represented according to +this encoding; however there are some exceptions (especially +ext_id values which are always UTF-8 encoded). +(Default: `Enc_iso88591)
recognize_standalone_declaration: If true and if the parser is +validating, the standalone="yes" declaration forces that it +is checked whether the document is a standalone document. - If false, or if the +parser is in well-formedness mode, such declarations are ignored. +(Default: true)
store_element_positions: If +true, for every non-data node the source position is stored. If false, the +position information is lost. If available, you can get the positions of nodes +by invoking the position method. +(Default: true)
idref_pass:If true and if +there is an ID index, the parser checks whether every IDREF or IDREFS attribute +refer to an existing node; this requires that the parser traverses the whole +doument tree. If false, this check is left out. (Default: false)
validate_by_dfa:If true and if +the content model for an element type is deterministic, a deterministic finite +automaton is used to validate whether the element contents match the content +model of the type. If false, or if a DFA is not available, a backtracking +algorithm is used for validation. (Default: true)
accept_only_deterministic_models: If true, only deterministic content +models are accepted; if false, any syntactically correct content models can be +processed. (Default: true)

4.4.4. Which configuration should I use?

First, I recommend to vary the default configuration instead of +creating a new configuration record. For instance, to set +idref_pass to true, change the default +as in: +

let config = { default_config with idref_pass = true }

+The background is that I can add more options to the record in future versions +of the parser without breaking your programs.

Do I need extra nodes for processing instructions? By default, such nodes are not created. This does not mean that the +processing instructions are lost; however, you cannot find out the exact +location where they occur. For example, the following XML text + +

<x><?pi1?><y/><?pi2?></x>

+ +will normally create one element node for x containing +one subnode for y. The processing +instructions are attached to x in a separate hash table; you +can access them using x # pinstr "pi1" and x # +pinstr "pi2", respectively. The information is lost where the +instructions occur within x.

If the option enable_pinstr_nodes is +turned on, the parser creates extra nodes pi1 and +pi2 such that the subnodes of x are now: + +

x # sub_nodes = [ pi1; y; pi2 ]

+ +The extra nodes contain the processing instructions in the usual way, i.e. you +can access them using pi1 # pinstr "pi1" and pi2 # +pinstr "pi2", respectively.

Note that you will need an exemplar for the PI nodes (see +make_spec_from_alist).

Do I need a super root node? By default, there is no super root node. The +document object refers directly to the node representing the +root element of the document, i.e. + +

doc # root = r

+ +if r is the root node. This is sometimes inconvenient: (1) +Some algorithms become simpler if every node has a parent, even the root +node. (2) Some standards such as XPath call the "root node" the node whose +child represents the root of the document. (3) The super root node can serve +as a container for processing instructions outside the root element. Because of +these reasons, it is possible to create an extra super root node, whose child +is the root node: + +

doc # root = sr         &&
+sr # sub_nodes = [ r ]

+ +When extra nodes are also created for processing instructions, these nodes can +be added to the super root node if they occur outside the root element (reason +(3)), and the order reflects the order in the source text.

Note that you will need an exemplar for the super root node +(see make_spec_from_alist).

What is the effect of the UTF-8 encoding? By default, the parser represents strings (with few +exceptions) as ISO-8859-1 strings. These are well-known, and there are tools +and fonts for this encoding.

However, internationalization may require that you switch over +to UTF-8 encoding. In most environments, the immediate effect will be that you +cannot read strings with character codes >= 160 any longer; your terminal will +only show funny glyph combinations. It is strongly recommended to install +Unicode fonts (GNU Unifont, +Markus Kuhn's fonts) and terminal emulators +that can handle UTF-8 byte sequences. Furthermore, a Unicode editor may +be helpful (such as Yudit). There are +also FAQ by +Markus Kuhn.

By setting encoding to +`Enc_utf8 all strings originating from the parsed XML +document are represented as UTF-8 strings. This includes not only character +data and attribute values but also element names, attribute names and so on, as +it is possible to use any Unicode letter to form such names. Strictly +speaking, PXP is only XML-compliant if the UTF-8 mode is used; otherwise it +will have difficulties when validating documents containing +non-ISO-8859-1-names.

This mode does not have any impact on the external +representation of documents. The character set assumed when reading a document +is set in the XML declaration, and character set when writing a document must +be passed to the write method.

How do I check that nodes exist which are referred by IDREF attributes? First, you must create an index of all occurring ID +attributes: + +

let index = new hash_index

+ +This index must be passed to the parsing function: + +

parse_document_entity
+  ~id_index:(index :> index)
+  config source spec

+ +Next, you must turn on the idref_pass mode: + +

let config = { default_config with idref_pass = true }

+ +Note that now the whole document tree will be traversed, and every node will be +checked for IDREF and IDREFS attributes. If the tree is big, this may take some +time.

What are deterministic content models? These type of models can speed up the validation checks; +furthermore they ensure SGML-compatibility. In particular, a content model is +deterministic if the parser can determine the actually used alternative by +inspecting only the current token. For example, this element has +non-deterministic contents: + +

<!ELEMENT x ((u,v) | (u,y+) | v)>

+ +If the first element in x is u, the +parser does not know which of the alternatives (u,v) or +(u,y+) will work; the parser must also inspect the second +element to be able to distinguish between the alternatives. Because such +look-ahead (or "guessing") is required, this example is +non-deterministic.

The XML standard demands that content models must be +deterministic. So it is recommended to turn the option +accept_only_deterministic_models on; however, PXP can also +process non-deterministic models using a backtracking algorithm.

Deterministic models ensure that validation can be performed in +linear time. In order to get the maximum benefits, PXP also implements a +special validator that profits from deterministic models; this is the +deterministic finite automaton (DFA). This validator is enabled per element +type if the element type has a deterministic model and if the option +validate_by_dfa is turned on.

In general, I expect that the DFA method is faster than the +backtracking method; especially in the worst case the DFA takes only linear +time. However, if the content model has only few alternatives and the +alternatives do not nest, the backtracking algorithm may be better.

Prev	Home	Next
The DTD classes	Up	Updates