4 >Invoking the parser</TITLE
7 CONTENT="Modular DocBook HTML Stylesheet Version 1.46"><LINK
9 TITLE="The PXP user's guide"
10 HREF="index.html"><LINK
12 TITLE="Configuring and calling the parser"
13 HREF="c1567.html"><LINK
15 TITLE="The DTD classes"
16 HREF="x1812.html"><LINK
19 HREF="x1965.html"><LINK
22 HREF="markup.css"></HEAD
41 >The PXP user's guide</TH
56 >Chapter 4. Configuring and calling the parser</TD
76 >4.4. Invoking the parser</A
79 >Here a description of Pxp_yacc.</P
89 >The following defaults are available:
92 CLASS="PROGRAMLISTING"
93 >val default_config : config
94 val default_extension : ('a node extension) as 'a
95 val default_spec : ('a node extension as 'a) spec</PRE
104 >4.4.2. Parsing functions</A
107 >In the following, the term "closed document" refers to
108 an XML structure like
111 CLASS="PROGRAMLISTING"
112 ><!DOCTYPE ... [ <TT
133 The term "fragment" refers to an XML structure like
136 CLASS="PROGRAMLISTING"
152 i.e. only to one isolated element instance.</P
155 CLASS="PROGRAMLISTING"
156 >val parse_dtd_entity : config -> source -> dtd</PRE
159 Parses the declarations which are contained in the entity, and returns them as
166 CLASS="PROGRAMLISTING"
167 >val extract_dtd_from_document_entity : config -> source -> dtd</PRE
170 Extracts the DTD from a closed document. Both the internal and the external
171 subsets are extracted and combined to one <TT
175 function does not parse the whole document, but only the parts that are
176 necessary to extract the DTD.</P
179 CLASS="PROGRAMLISTING"
180 >val parse_document_entity :
181 ?transform_dtd:(dtd -> dtd) ->
182 ?id_index:('ext index) ->
189 Parses a closed document and validates it against the DTD that is contained in
190 the document (internal and external subsets). The option
194 > can be used to transform the DTD in the
195 document, and to use the transformed DTD for validation. If
199 > is specified, an index of all ID attributes is
203 CLASS="PROGRAMLISTING"
204 >val parse_wfdocument_entity :
211 Parses a closed document, but checks it only on well-formedness.</P
214 CLASS="PROGRAMLISTING"
215 >val parse_content_entity :
216 ?id_index:('ext index) ->
224 Parses a fragment, and validates the element.</P
227 CLASS="PROGRAMLISTING"
228 >val parse_wfcontent_entity :
235 Parses a fragment, but checks it only on well-formedness.</P
243 >4.4.3. Configuration options</A
247 CLASS="PROGRAMLISTING"
249 { warner : collect_warnings;
250 errors_with_line_numbers : bool;
251 enable_pinstr_nodes : bool;
252 enable_super_root_node : bool;
253 enable_comment_nodes : bool;
254 encoding : rep_encoding;
255 recognize_standalone_declaration : bool;
256 store_element_positions : bool;
258 validate_by_dfa : bool;
259 accept_only_deterministic_models : bool;
269 STYLE="list-style-type: disc"
275 warnings by invoking the method <TT
279 object. (Default: all warnings are dropped)</P
282 STYLE="list-style-type: disc"
286 >errors_with_line_numbers:</TT
288 true, errors contain line numbers; if false, errors contain only byte
289 positions. The latter mode is faster. (Default: true)</P
292 STYLE="list-style-type: disc"
296 >enable_pinstr_nodes:</TT
298 the parser creates extra nodes for processing instructions. If false,
299 processing instructions are simply added to the element or document surrounding
300 the instructions. (Default: false)</P
303 STYLE="list-style-type: disc"
307 >enable_super_root_node:</TT
309 true, the parser creates an extra node which is the parent of the root of the
310 document tree. This node is called super root; it is an element with type
314 >. - If there are processing instructions outside
315 the root element and outside the DTD, they are added to the super root instead
316 of the document. - If false, the super root node is not created. (Default:
320 STYLE="list-style-type: disc"
324 >enable_comment_nodes:</TT
326 the parser creates nodes for comments with type <TT
330 if false, such nodes are not created. (Default: false)</P
333 STYLE="list-style-type: disc"
339 internal encoding of the parser. Most strings are then represented according to
340 this encoding; however there are some exceptions (especially
344 > values which are always UTF-8 encoded).
345 (Default: `Enc_iso88591)</P
348 STYLE="list-style-type: disc"
352 >recognize_standalone_declaration:</TT
353 > If true and if the parser is
356 >standalone="yes"</TT
357 > declaration forces that it
358 is checked whether the document is a standalone document. - If false, or if the
359 parser is in well-formedness mode, such declarations are ignored.
363 STYLE="list-style-type: disc"
367 >store_element_positions:</TT
369 true, for every non-data node the source position is stored. If false, the
370 position information is lost. If available, you can get the positions of nodes
378 STYLE="list-style-type: disc"
384 there is an ID index, the parser checks whether every IDREF or IDREFS attribute
385 refer to an existing node; this requires that the parser traverses the whole
386 doument tree. If false, this check is left out. (Default: false)</P
389 STYLE="list-style-type: disc"
393 >validate_by_dfa:</TT
395 the content model for an element type is deterministic, a deterministic finite
396 automaton is used to validate whether the element contents match the content
397 model of the type. If false, or if a DFA is not available, a backtracking
398 algorithm is used for validation. (Default: true)</P
401 STYLE="list-style-type: disc"
405 >accept_only_deterministic_models:</TT
406 > If true, only deterministic content
407 models are accepted; if false, any syntactically correct content models can be
408 processed. (Default: true)</P
419 >4.4.4. Which configuration should I use?</A
422 >First, I recommend to vary the default configuration instead of
423 creating a new configuration record. For instance, to set
430 >, change the default
433 CLASS="PROGRAMLISTING"
434 >let config = { default_config with idref_pass = true }</PRE
436 The background is that I can add more options to the record in future versions
437 of the parser without breaking your programs.</P
442 >Do I need extra nodes for processing instructions? </B
443 >By default, such nodes are not created. This does not mean that the
444 processing instructions are lost; however, you cannot find out the exact
445 location where they occur. For example, the following XML text
448 CLASS="PROGRAMLISTING"
449 ><x><?pi1?><y/><?pi2?></x> </PRE
452 will normally create one element node for <TT
463 instructions are attached to <TT
466 > in a separate hash table; you
467 can access them using <TT
469 >x # pinstr "pi1"</TT
474 >, respectively. The information is lost where the
475 instructions occur within <TT
483 >enable_pinstr_nodes</TT
485 turned on, the parser creates extra nodes <TT
492 > such that the subnodes of <TT
498 CLASS="PROGRAMLISTING"
499 >x # sub_nodes = [ pi1; y; pi2 ]</PRE
502 The extra nodes contain the processing instructions in the usual way, i.e. you
503 can access them using <TT
505 >pi1 # pinstr "pi1"</TT
512 >Note that you will need an exemplar for the PI nodes (see
515 >make_spec_from_alist</TT
521 >Do I need a super root node? </B
522 >By default, there is no super root node. The
526 > object refers directly to the node representing the
527 root element of the document, i.e.
530 CLASS="PROGRAMLISTING"
537 > is the root node. This is sometimes inconvenient: (1)
538 Some algorithms become simpler if every node has a parent, even the root
539 node. (2) Some standards such as XPath call the "root node" the node whose
540 child represents the root of the document. (3) The super root node can serve
541 as a container for processing instructions outside the root element. Because of
542 these reasons, it is possible to create an extra super root node, whose child
546 CLASS="PROGRAMLISTING"
547 >doc # root = sr &&
548 sr # sub_nodes = [ r ]</PRE
551 When extra nodes are also created for processing instructions, these nodes can
552 be added to the super root node if they occur outside the root element (reason
553 (3)), and the order reflects the order in the source text.</P
556 >Note that you will need an exemplar for the super root node
559 >make_spec_from_alist</TT
565 >What is the effect of the UTF-8 encoding? </B
566 >By default, the parser represents strings (with few
567 exceptions) as ISO-8859-1 strings. These are well-known, and there are tools
568 and fonts for this encoding.</P
571 >However, internationalization may require that you switch over
572 to UTF-8 encoding. In most environments, the immediate effect will be that you
573 cannot read strings with character codes >= 160 any longer; your terminal will
574 only show funny glyph combinations. It is strongly recommended to install
576 HREF="http://czyborra.com/unifont/"
581 HREF="http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz"
583 >Markus Kuhn's fonts</A
585 HREF="http://myweb.clark.net/pub/dickey/xterm/xterm.html"
588 that can handle UTF-8 byte sequences</A
589 >. Furthermore, a Unicode editor may
590 be helpful (such as <A
591 HREF="ftp://metalab.unc.edu/pub/Linux/apps/editors/X/"
596 HREF="http://www.cl.cam.ac.uk/~mgk25/unicode.html"
609 > all strings originating from the parsed XML
610 document are represented as UTF-8 strings. This includes not only character
611 data and attribute values but also element names, attribute names and so on, as
612 it is possible to use any Unicode letter to form such names. Strictly
613 speaking, PXP is only XML-compliant if the UTF-8 mode is used; otherwise it
614 will have difficulties when validating documents containing
615 non-ISO-8859-1-names.</P
617 >This mode does not have any impact on the external
618 representation of documents. The character set assumed when reading a document
619 is set in the XML declaration, and character set when writing a document must
628 >How do I check that nodes exist which are referred by IDREF attributes? </B
629 >First, you must create an index of all occurring ID
633 CLASS="PROGRAMLISTING"
634 >let index = new hash_index</PRE
637 This index must be passed to the parsing function:
640 CLASS="PROGRAMLISTING"
641 >parse_document_entity
642 ~id_index:(index :> index)
643 config source spec</PRE
646 Next, you must turn on the <TT
652 CLASS="PROGRAMLISTING"
653 >let config = { default_config with idref_pass = true }</PRE
656 Note that now the whole document tree will be traversed, and every node will be
657 checked for IDREF and IDREFS attributes. If the tree is big, this may take some
664 >What are deterministic content models? </B
665 >These type of models can speed up the validation checks;
666 furthermore they ensure SGML-compatibility. In particular, a content model is
667 deterministic if the parser can determine the actually used alternative by
668 inspecting only the current token. For example, this element has
669 non-deterministic contents:
672 CLASS="PROGRAMLISTING"
673 ><!ELEMENT x ((u,v) | (u,y+) | v)></PRE
676 If the first element in <TT
683 parser does not know which of the alternatives <TT
690 > will work; the parser must also inspect the second
691 element to be able to distinguish between the alternatives. Because such
692 look-ahead (or "guessing") is required, this example is
693 non-deterministic.</P
696 >The XML standard demands that content models must be
697 deterministic. So it is recommended to turn the option
700 >accept_only_deterministic_models</TT
701 > on; however, PXP can also
702 process non-deterministic models using a backtracking algorithm.</P
704 >Deterministic models ensure that validation can be performed in
705 linear time. In order to get the maximum benefits, PXP also implements a
706 special validator that profits from deterministic models; this is the
707 deterministic finite automaton (DFA). This validator is enabled per element
708 type if the element type has a deterministic model and if the option
714 >In general, I expect that the DFA method is faster than the
715 backtracking method; especially in the worst case the DFA takes only linear
716 time. However, if the content model has only few alternatives and the
717 alternatives do not nest, the backtracking algorithm may be better.</P