Initial revision

[helm.git] / helm / DEVEL / pxp / pxp / doc / manual / html / x1818.html
diff --git a/helm/DEVEL/pxp/pxp/doc/manual/html/x1818.html b/helm/DEVEL/pxp/pxp/doc/manual/html/x1818.html

new file mode 100644 (file)

index 0000000..b289a36
--- /dev/null
+++ b/helm/DEVEL/pxp/pxp/doc/manual/html/x1818.html
@@ -0,0 +1,779 @@
+<HTML
+><HEAD
+><TITLE
+>Invoking the parser</TITLE
+><META
+NAME="GENERATOR"
+CONTENT="Modular DocBook HTML Stylesheet Version 1.46"><LINK
+REL="HOME"
+TITLE="The PXP user's guide"
+HREF="index.html"><LINK
+REL="UP"
+TITLE="Configuring and calling the parser"
+HREF="c1567.html"><LINK
+REL="PREVIOUS"
+TITLE="The DTD classes"
+HREF="x1812.html"><LINK
+REL="NEXT"
+TITLE="Updates"
+HREF="x1965.html"><LINK
+REL="STYLESHEET"
+TYPE="text/css"
+HREF="markup.css"></HEAD
+><BODY
+CLASS="SECT1"
+BGCOLOR="#FFFFFF"
+TEXT="#000000"
+LINK="#0000FF"
+VLINK="#840084"
+ALINK="#0000FF"
+><DIV
+CLASS="NAVHEADER"
+><TABLE
+WIDTH="100%"
+BORDER="0"
+CELLPADDING="0"
+CELLSPACING="0"
+><TR
+><TH
+COLSPAN="3"
+ALIGN="center"
+>The PXP user's guide</TH
+></TR
+><TR
+><TD
+WIDTH="10%"
+ALIGN="left"
+VALIGN="bottom"
+><A
+HREF="x1812.html"
+>Prev</A
+></TD
+><TD
+WIDTH="80%"
+ALIGN="center"
+VALIGN="bottom"
+>Chapter 4. Configuring and calling the parser</TD
+><TD
+WIDTH="10%"
+ALIGN="right"
+VALIGN="bottom"
+><A
+HREF="x1965.html"
+>Next</A
+></TD
+></TR
+></TABLE
+><HR
+ALIGN="LEFT"
+WIDTH="100%"></DIV
+><DIV
+CLASS="SECT1"
+><H1
+CLASS="SECT1"
+><A
+NAME="AEN1818"
+>4.4. Invoking the parser</A
+></H1
+><P
+>Here a description of Pxp_yacc.</P
+><DIV
+CLASS="SECT2"
+><H2
+CLASS="SECT2"
+><A
+NAME="AEN1821"
+>4.4.1. Defaults</A
+></H2
+><P
+>The following defaults are available:
+
+<PRE
+CLASS="PROGRAMLISTING"
+>val default_config : config
+val default_extension : ('a node extension) as 'a
+val default_spec : ('a node extension as 'a) spec</PRE
+></P
+></DIV
+><DIV
+CLASS="SECT2"
+><H2
+CLASS="SECT2"
+><A
+NAME="AEN1825"
+>4.4.2. Parsing functions</A
+></H2
+><P
+>In the following, the term "closed document" refers to
+an XML structure like
+
+<PRE
+CLASS="PROGRAMLISTING"
+>&lt;!DOCTYPE ... [ <TT
+CLASS="REPLACEABLE"
+><I
+>declarations</I
+></TT
+> ] &gt;
+&lt;<TT
+CLASS="REPLACEABLE"
+><I
+>root</I
+></TT
+>&gt;
+...
+&lt;/<TT
+CLASS="REPLACEABLE"
+><I
+>root</I
+></TT
+>&gt;</PRE
+>
+
+The term "fragment" refers to an XML structure like
+
+<PRE
+CLASS="PROGRAMLISTING"
+>&lt;<TT
+CLASS="REPLACEABLE"
+><I
+>root</I
+></TT
+>&gt;
+...
+&lt;/<TT
+CLASS="REPLACEABLE"
+><I
+>root</I
+></TT
+>&gt;</PRE
+>
+
+i.e. only to one isolated element instance.</P
+><P
+><PRE
+CLASS="PROGRAMLISTING"
+>val parse_dtd_entity : config -&#62; source -&#62; dtd</PRE
+>
+
+Parses the declarations which are contained in the entity, and returns them as
+<TT
+CLASS="LITERAL"
+>dtd</TT
+> object.</P
+><P
+><PRE
+CLASS="PROGRAMLISTING"
+>val extract_dtd_from_document_entity : config -&#62; source -&#62; dtd</PRE
+>
+
+Extracts the DTD from a closed document. Both the internal and the external
+subsets are extracted and combined to one <TT
+CLASS="LITERAL"
+>dtd</TT
+> object. This
+function does not parse the whole document, but only the parts that are
+necessary to extract the DTD.</P
+><P
+><PRE
+CLASS="PROGRAMLISTING"
+>val parse_document_entity : 
+    ?transform_dtd:(dtd -&#62; dtd) -&#62;
+    ?id_index:('ext index) -&#62;
+    config -&#62; 
+    source -&#62; 
+    'ext spec -&#62; 
+        'ext document</PRE
+>
+
+Parses a closed document and validates it against the DTD that is contained in
+the document (internal and external subsets). The option
+<TT
+CLASS="LITERAL"
+>~transform_dtd</TT
+> can be used to transform the DTD in the
+document, and to use the transformed DTD for validation. If
+<TT
+CLASS="LITERAL"
+>~id_index</TT
+> is specified, an index of all ID attributes is
+created.</P
+><P
+><PRE
+CLASS="PROGRAMLISTING"
+>val parse_wfdocument_entity : 
+    config -&#62; 
+    source -&#62; 
+    'ext spec -&#62; 
+        'ext document</PRE
+>
+
+Parses a closed document, but checks it only on well-formedness.</P
+><P
+><PRE
+CLASS="PROGRAMLISTING"
+>val parse_content_entity  : 
+    ?id_index:('ext index) -&#62;
+    config -&#62;  
+    source -&#62; 
+    dtd -&#62; 
+    'ext spec -&#62; 
+        'ext node</PRE
+>
+
+Parses a fragment, and validates the element.</P
+><P
+><PRE
+CLASS="PROGRAMLISTING"
+>val parse_wfcontent_entity : 
+    config -&#62; 
+    source -&#62; 
+    'ext spec -&#62; 
+        'ext node</PRE
+>
+
+Parses a fragment, but checks it only on well-formedness.</P
+></DIV
+><DIV
+CLASS="SECT2"
+><H2
+CLASS="SECT2"
+><A
+NAME="AEN1851"
+>4.4.3. Configuration options</A
+></H2
+><P
+>&#13;<PRE
+CLASS="PROGRAMLISTING"
+>type config =
+    { warner : collect_warnings;
+      errors_with_line_numbers : bool;
+      enable_pinstr_nodes : bool;
+      enable_super_root_node : bool;
+      enable_comment_nodes : bool;
+      encoding : rep_encoding;
+      recognize_standalone_declaration : bool;
+      store_element_positions : bool;
+      idref_pass : bool;
+      validate_by_dfa : bool;
+      accept_only_deterministic_models : bool;
+      ...
+    }</PRE
+>
+
+<P
+></P
+><UL
+COMPACT="COMPACT"
+><LI
+STYLE="list-style-type: disc"
+><P
+><TT
+CLASS="LITERAL"
+>warner:</TT
+>The parser prints
+warnings by invoking the method <TT
+CLASS="LITERAL"
+>warn</TT
+> for this warner
+object. (Default: all warnings are dropped)</P
+></LI
+><LI
+STYLE="list-style-type: disc"
+><P
+><TT
+CLASS="LITERAL"
+>errors_with_line_numbers:</TT
+>If
+true, errors contain line numbers; if false, errors contain only byte
+positions. The latter mode is faster. (Default: true)</P
+></LI
+><LI
+STYLE="list-style-type: disc"
+><P
+><TT
+CLASS="LITERAL"
+>enable_pinstr_nodes:</TT
+>If true,
+the parser creates extra nodes for processing instructions. If false,
+processing instructions are simply added to the element or document surrounding
+the instructions. (Default: false)</P
+></LI
+><LI
+STYLE="list-style-type: disc"
+><P
+><TT
+CLASS="LITERAL"
+>enable_super_root_node:</TT
+>If
+true, the parser creates an extra node which is the parent of the root of the
+document tree. This node is called super root; it is an element with type
+<TT
+CLASS="LITERAL"
+>T_super_root</TT
+>. - If there are processing instructions outside
+the root element and outside the DTD, they are added to the super root instead
+of the document. - If false, the super root node is not created. (Default:
+false)</P
+></LI
+><LI
+STYLE="list-style-type: disc"
+><P
+><TT
+CLASS="LITERAL"
+>enable_comment_nodes:</TT
+>If true,
+the parser creates nodes for comments with type <TT
+CLASS="LITERAL"
+>T_comment</TT
+>;
+if false, such nodes are not created. (Default: false)</P
+></LI
+><LI
+STYLE="list-style-type: disc"
+><P
+><TT
+CLASS="LITERAL"
+>encoding:</TT
+>Specifies the
+internal encoding of the parser. Most strings are then represented according to
+this encoding; however there are some exceptions (especially
+<TT
+CLASS="LITERAL"
+>ext_id</TT
+> values which are always UTF-8 encoded).
+(Default: `Enc_iso88591)</P
+></LI
+><LI
+STYLE="list-style-type: disc"
+><P
+><TT
+CLASS="LITERAL"
+>recognize_standalone_declaration:</TT
+> If true and if the parser is
+validating, the <TT
+CLASS="LITERAL"
+>standalone="yes"</TT
+> declaration forces that it
+is checked whether the document is a standalone document. - If false, or if the
+parser is in well-formedness mode, such declarations are ignored.
+(Default: true)</P
+></LI
+><LI
+STYLE="list-style-type: disc"
+><P
+><TT
+CLASS="LITERAL"
+>store_element_positions:</TT
+> If
+true, for every non-data node the source position is stored. If false, the
+position information is lost. If available, you can get the positions of nodes
+by invoking the <TT
+CLASS="LITERAL"
+>position</TT
+> method.
+(Default: true)</P
+></LI
+><LI
+STYLE="list-style-type: disc"
+><P
+><TT
+CLASS="LITERAL"
+>idref_pass:</TT
+>If true and if
+there is an ID index, the parser checks whether every IDREF or IDREFS attribute
+refer to an existing node; this requires that the parser traverses the whole
+doument tree. If false, this check is left out. (Default: false)</P
+></LI
+><LI
+STYLE="list-style-type: disc"
+><P
+><TT
+CLASS="LITERAL"
+>validate_by_dfa:</TT
+>If true and if
+the content model for an element type is deterministic, a deterministic finite
+automaton is used to validate whether the element contents match the content
+model of the type. If false, or if a DFA is not available, a backtracking
+algorithm is used for validation. (Default: true)</P
+></LI
+><LI
+STYLE="list-style-type: disc"
+><P
+><TT
+CLASS="LITERAL"
+>accept_only_deterministic_models:</TT
+> If true, only deterministic content
+models are accepted; if false, any syntactically correct content models can be
+processed. (Default: true)</P
+></LI
+></UL
+></P
+></DIV
+><DIV
+CLASS="SECT2"
+><H2
+CLASS="SECT2"
+><A
+NAME="AEN1895"
+>4.4.4. Which configuration should I use?</A
+></H2
+><P
+>First, I recommend to vary the default configuration instead of
+creating a new configuration record. For instance, to set
+<TT
+CLASS="LITERAL"
+>idref_pass</TT
+> to <TT
+CLASS="LITERAL"
+>true</TT
+>, change the default
+as in:
+<PRE
+CLASS="PROGRAMLISTING"
+>let config = { default_config with idref_pass = true }</PRE
+>
+The background is that I can add more options to the record in future versions
+of the parser without breaking your programs.</P
+><DIV
+CLASS="FORMALPARA"
+><P
+><B
+>Do I need extra nodes for processing instructions? </B
+>By default, such nodes are not created. This does not mean that the
+processing instructions are lost; however, you cannot find out the exact
+location where they occur. For example, the following XML text
+
+<PRE
+CLASS="PROGRAMLISTING"
+>&#60;x&#62;&#60;?pi1?&#62;&#60;y/&#62;&#60;?pi2?&#62;&#60;/x&#62; </PRE
+> 
+
+will normally create one element node for <TT
+CLASS="LITERAL"
+>x</TT
+> containing
+<I
+CLASS="EMPHASIS"
+>one</I
+> subnode for <TT
+CLASS="LITERAL"
+>y</TT
+>. The processing
+instructions are attached to <TT
+CLASS="LITERAL"
+>x</TT
+> in a separate hash table; you
+can access them using <TT
+CLASS="LITERAL"
+>x # pinstr "pi1"</TT
+> and <TT
+CLASS="LITERAL"
+>x #
+pinstr "pi2"</TT
+>, respectively. The information is lost where the
+instructions occur within <TT
+CLASS="LITERAL"
+>x</TT
+>.</P
+></DIV
+><P
+>If the option <TT
+CLASS="LITERAL"
+>enable_pinstr_nodes</TT
+> is
+turned on, the parser creates extra nodes <TT
+CLASS="LITERAL"
+>pi1</TT
+> and
+<TT
+CLASS="LITERAL"
+>pi2</TT
+> such that the subnodes of <TT
+CLASS="LITERAL"
+>x</TT
+> are now: 
+
+<PRE
+CLASS="PROGRAMLISTING"
+>x # sub_nodes = [ pi1; y; pi2 ]</PRE
+>
+
+The extra nodes contain the processing instructions in the usual way, i.e. you
+can access them using <TT
+CLASS="LITERAL"
+>pi1 # pinstr "pi1"</TT
+> and <TT
+CLASS="LITERAL"
+>pi2 #
+pinstr "pi2"</TT
+>, respectively.</P
+><P
+>Note that you will need an exemplar for the PI nodes (see
+<TT
+CLASS="LITERAL"
+>make_spec_from_alist</TT
+>).</P
+><DIV
+CLASS="FORMALPARA"
+><P
+><B
+>Do I need a super root node? </B
+>By default, there is no super root node. The
+<TT
+CLASS="LITERAL"
+>document</TT
+> object refers directly to the node representing the
+root element of the document, i.e.
+
+<PRE
+CLASS="PROGRAMLISTING"
+>doc # root = r</PRE
+>
+
+if <TT
+CLASS="LITERAL"
+>r</TT
+> is the root node. This is sometimes inconvenient: (1)
+Some algorithms become simpler if every node has a parent, even the root
+node. (2) Some standards such as XPath call the "root node" the node whose
+child represents the root of the document. (3) The super root node can serve
+as a container for processing instructions outside the root element. Because of
+these reasons, it is possible to create an extra super root node, whose child
+is the root node:
+
+<PRE
+CLASS="PROGRAMLISTING"
+>doc # root = sr         &#38;&#38;
+sr # sub_nodes = [ r ]</PRE
+>
+
+When extra nodes are also created for processing instructions, these nodes can
+be added to the super root node if they occur outside the root element (reason
+(3)), and the order reflects the order in the source text.</P
+></DIV
+><P
+>Note that you will need an exemplar for the super root node
+(see <TT
+CLASS="LITERAL"
+>make_spec_from_alist</TT
+>).</P
+><DIV
+CLASS="FORMALPARA"
+><P
+><B
+>What is the effect of the UTF-8 encoding? </B
+>By default, the parser represents strings (with few
+exceptions) as ISO-8859-1 strings. These are well-known, and there are tools
+and fonts for this encoding.</P
+></DIV
+><P
+>However, internationalization may require that you switch over
+to UTF-8 encoding. In most environments, the immediate effect will be that you
+cannot read strings with character codes &#62;= 160 any longer; your terminal will
+only show funny glyph combinations. It is strongly recommended to install
+Unicode fonts (<A
+HREF="http://czyborra.com/unifont/"
+TARGET="_top"
+>GNU Unifont</A
+>, 
+<A
+HREF="http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz"
+TARGET="_top"
+>Markus Kuhn's fonts</A
+>) and <A
+HREF="http://myweb.clark.net/pub/dickey/xterm/xterm.html"
+TARGET="_top"
+>terminal emulators
+that can handle UTF-8 byte sequences</A
+>. Furthermore, a Unicode editor may
+be helpful (such as <A
+HREF="ftp://metalab.unc.edu/pub/Linux/apps/editors/X/"
+TARGET="_top"
+>Yudit</A
+>). There are
+also <A
+HREF="http://www.cl.cam.ac.uk/~mgk25/unicode.html"
+TARGET="_top"
+>FAQ</A
+> by
+Markus Kuhn.</P
+><P
+>By setting <TT
+CLASS="LITERAL"
+>encoding</TT
+> to
+<TT
+CLASS="LITERAL"
+>`Enc_utf8</TT
+> all strings originating from the parsed XML
+document are represented as UTF-8 strings. This includes not only character
+data and attribute values but also element names, attribute names and so on, as
+it is possible to use any Unicode letter to form such names.  Strictly
+speaking, PXP is only XML-compliant if the UTF-8 mode is used; otherwise it
+will have difficulties when validating documents containing
+non-ISO-8859-1-names.</P
+><P
+>This mode does not have any impact on the external
+representation of documents. The character set assumed when reading a document
+is set in the XML declaration, and character set when writing a document must
+be passed to the <TT
+CLASS="LITERAL"
+>write</TT
+> method.</P
+><DIV
+CLASS="FORMALPARA"
+><P
+><B
+>How do I check that nodes exist which are referred by IDREF attributes? </B
+>First, you must create an index of all occurring ID
+attributes:
+
+<PRE
+CLASS="PROGRAMLISTING"
+>let index = new hash_index</PRE
+>
+
+This index must be passed to the parsing function:
+
+<PRE
+CLASS="PROGRAMLISTING"
+>parse_document_entity
+  ~id_index:(index :&#62; index)
+  config source spec</PRE
+>
+
+Next, you must turn on the <TT
+CLASS="LITERAL"
+>idref_pass</TT
+> mode:
+
+<PRE
+CLASS="PROGRAMLISTING"
+>let config = { default_config with idref_pass = true }</PRE
+>
+
+Note that now the whole document tree will be traversed, and every node will be
+checked for IDREF and IDREFS attributes. If the tree is big, this may take some
+time.</P
+></DIV
+><DIV
+CLASS="FORMALPARA"
+><P
+><B
+>What are deterministic content models? </B
+>These type of models can speed up the validation checks;
+furthermore they ensure SGML-compatibility. In particular, a content model is
+deterministic if the parser can determine the actually used alternative by
+inspecting only the current token. For example, this element has
+non-deterministic contents:
+
+<PRE
+CLASS="PROGRAMLISTING"
+>&#60;!ELEMENT x ((u,v) | (u,y+) | v)&#62;</PRE
+>
+
+If the first element in <TT
+CLASS="LITERAL"
+>x</TT
+> is <TT
+CLASS="LITERAL"
+>u</TT
+>, the
+parser does not know which of the alternatives <TT
+CLASS="LITERAL"
+>(u,v)</TT
+> or
+<TT
+CLASS="LITERAL"
+>(u,y+)</TT
+> will work; the parser must also inspect the second
+element to be able to distinguish between the alternatives. Because such
+look-ahead (or "guessing") is required, this example is
+non-deterministic.</P
+></DIV
+><P
+>The XML standard demands that content models must be
+deterministic. So it is recommended to turn the option
+<TT
+CLASS="LITERAL"
+>accept_only_deterministic_models</TT
+> on; however, PXP can also
+process non-deterministic models using a backtracking algorithm.</P
+><P
+>Deterministic models ensure that validation can be performed in
+linear time. In order to get the maximum benefits, PXP also implements a
+special validator that profits from deterministic models; this is the
+deterministic finite automaton (DFA). This validator is enabled per element
+type if the element type has a deterministic model and if the option
+<TT
+CLASS="LITERAL"
+>validate_by_dfa</TT
+> is turned on.</P
+><P
+>In general, I expect that the DFA method is faster than the
+backtracking method; especially in the worst case the DFA takes only linear
+time. However, if the content model has only few alternatives and the
+alternatives do not nest, the backtracking algorithm may be better.</P
+></DIV
+></DIV
+><DIV
+CLASS="NAVFOOTER"
+><HR
+ALIGN="LEFT"
+WIDTH="100%"><TABLE
+WIDTH="100%"
+BORDER="0"
+CELLPADDING="0"
+CELLSPACING="0"
+><TR
+><TD
+WIDTH="33%"
+ALIGN="left"
+VALIGN="top"
+><A
+HREF="x1812.html"
+>Prev</A
+></TD
+><TD
+WIDTH="34%"
+ALIGN="center"
+VALIGN="top"
+><A
+HREF="index.html"
+>Home</A
+></TD
+><TD
+WIDTH="33%"
+ALIGN="right"
+VALIGN="top"
+><A
+HREF="x1965.html"
+>Next</A
+></TD
+></TR
+><TR
+><TD
+WIDTH="33%"
+ALIGN="left"
+VALIGN="top"
+>The DTD classes</TD
+><TD
+WIDTH="34%"
+ALIGN="center"
+VALIGN="top"
+><A
+HREF="c1567.html"
+>Up</A
+></TD
+><TD
+WIDTH="33%"
+ALIGN="right"
+VALIGN="top"
+>Updates</TD
+></TR
+></TABLE
+></DIV
+></BODY
+></HTML
+>
+\ No newline at end of file