4 >How to parse a document from an application</TITLE
7 CONTENT="Modular DocBook HTML Stylesheet Version 1.46"><LINK
9 TITLE="The PXP user's guide"
10 HREF="index.html"><LINK
13 HREF="c533.html"><LINK
16 HREF="c533.html"><LINK
18 TITLE="Class-based processing of the node tree"
19 HREF="x675.html"><LINK
22 HREF="markup.css"></HEAD
41 >The PXP user's guide</TH
56 >Chapter 2. Using <SPAN
79 >2.2. How to parse a document from an application</A
82 >Let me first give a rough overview of the object model of the parser. The
83 following items are represented by objects:
90 STYLE="list-style-type: disc"
95 > The document representation is more or less the
96 anchor for the application; all accesses to the parsed entities start here. It
97 is described by the class <TT
100 > contained in the module
104 >. You can get some global information, such
105 as the XML declaration the document begins with, the DTD of the document,
106 global processing instructions, and most important, the document tree. </P
109 STYLE="list-style-type: disc"
113 >The contents of documents:</I
114 > The contents have the structure
115 of a tree: Elements contain other elements and text<A
121 The common type to represent both kinds of content is <TT
125 which is a class type that unifies the properties of elements and character
126 data. Every node has a list of children (which is empty if the element is empty
127 or the node represents text); nodes may have attributes; nodes have always text
128 contents. There are two implementations of <TT
135 > for elements, and the class
139 > for text data. You find these classes and class
140 types in the module <TT
145 >Note that attribute lists are represented by non-class values.</P
148 STYLE="list-style-type: disc"
152 >The node extension:</I
153 > For advanced usage, every node of the
154 document may have an associated <I
158 a second object. This object must have the three methods
169 > as bare minimum, but you are free to add methods as
170 you want. This is the preferred way to add functionality to the document
175 >. The class type <TT
185 STYLE="list-style-type: disc"
190 > Sometimes it is necessary to access the DTD of a
191 document; the average application does not need this feature. The class
195 > describes DTDs, and makes it possible to get
196 representations of element, entity, and notation declarations as well as
197 processing instructions contained in the DTD. This class, and
207 >proc_instruction</TT
208 > can be found in the module
212 >. There are a couple of classes representing
213 different kinds of entities; these can be found in the module
222 Additionally, the following modules play a role:
229 STYLE="list-style-type: disc"
234 > Here the main parsing functions such as
237 >parse_document_entity</TT
238 > are located. Some additional types and
239 functions allow the parser to be configured in a non-standard way.</P
242 STYLE="list-style-type: disc"
247 > This is a collection of basic types and
253 There are some further modules that are needed internally but are not part of
256 >Let the document to be parsed be stored in a file called
260 >. The parsing process is started by calling the
264 CLASS="PROGRAMLISTING"
265 >val parse_document_entity : config -> source -> 'ext spec -> 'ext document</PRE
268 defined in the module <TT
271 >. The first argument
272 specifies some global properties of the parser; it is recommended to start with
276 >. The second argument determines where the
277 document to be parsed comes from; this may be a file, a channel, or an entity
281 >, it is sufficient to pass
284 >from_file "doc.xml"</TT
287 >The third argument passes the object specification to use. Roughly
288 speaking, it determines which classes implement the node objects of which
289 element types, and which extensions are to be used. The <TT
293 polymorphic variable is the type of the extension. For the moment, let us
297 > as this argument, and ignore it.</P
299 >So the following expression parses <TT
305 CLASS="PROGRAMLISTING"
307 let d = parse_document_entity default_config (from_file "doc.xml") default_spec</PRE
313 > implies that warnings are collected
314 but not printed. Errors raise one of the exception defined in
318 >; to get readable errors and warnings catch the
319 exceptions as follows:
322 CLASS="PROGRAMLISTING"
326 print_endline ("WARNING: " ^ w)
331 let config = { default_config with warner = new warner } in
332 let d = parse_document_entity config (from_file "doc.xml") default_spec
337 print_endline (Pxp_types.string_of_exn e)</PRE
343 > is an object of the <TT
347 class. If you want the node tree, you can get the root element by
350 CLASS="PROGRAMLISTING"
351 >let root = d # root</PRE
354 and if you would rather like to access the DTD, determine it by
357 CLASS="PROGRAMLISTING"
358 >let dtd = d # dtd</PRE
361 As it is more interesting, let us investigate the node tree now. Given the root
362 element, it is possible to recursively traverse the whole tree. The children of
366 > are returned by the method
370 >, and the type of a node is returned by
374 >. This function traverses the tree, and prints the
378 CLASS="PROGRAMLISTING"
379 >let rec print_structure n =
380 let ntype = n # node_type in
382 T_element name ->
383 print_endline ("Element of type " ^ name);
384 let children = n # sub_nodes in
385 List.iter print_structure children
389 (* Other node types are not possible unless the parser is configured
395 You can call this function by
398 CLASS="PROGRAMLISTING"
399 >print_structure root</PRE
402 The type returned by <TT
416 element type is the string included in the angle brackets. Note that only
417 elements have children; data nodes are always leaves of the tree.</P
419 >There are some more methods in order to access a parsed node tree:
426 STYLE="list-style-type: disc"
431 >: Returns the parent node, or raises
435 > if the node is already the root</P
438 STYLE="list-style-type: disc"
443 >: Returns the root of the node tree. </P
446 STYLE="list-style-type: disc"
451 >: Returns the value of the attribute with
455 >. The method returns a value for every
459 > attribute, independently of whether the attribute
460 instance is defined or not. If the attribute is not declared,
464 > will be raised. (In well-formedness mode, every
465 attribute is considered as being implicitly declared with type
471 >The following return values are possible: <TT
482 The first two value types indicate that the attribute value is available,
483 either because there is a definition
498 in the XML text, or because there is a default value (declared in the
499 DTD). Only if both the instance definition and the default declaration are
500 missing, the latter value <TT
503 > will be returned.</P
505 >In the DTD, every attribute is typed. There are single-value types (CDATA, ID,
506 IDREF, ENTITY, NMTOKEN, enumerations), in which case the method passes
514 string value of the attribute. The other types (IDREFS, ENTITIES, NMTOKENS)
515 represent list values, and the parser splits the XML literal into several
516 tokens and returns these tokens as <TT
521 >Normalization means that entity references (the
541 by the text they represent, and that white space characters are converted into
545 STYLE="list-style-type: disc"
550 >: Returns the character data contained in the
551 node. For data nodes, the meaning is obvious as this is the main content of
552 data nodes. For element nodes, this method returns the concatenated contents of
553 all inner data nodes.</P
555 >Note that entity references included in the text are resolved while they are
556 being parsed; for example the text "a &lt;&gt; b" will be returned
557 as "a <> b" by this method. Spaces of data nodes are always
558 preserved. Newlines are preserved, but always converted to \n characters even
559 if newlines are encoded as \r\n or \r. Normally you will never see two adjacent
560 data nodes because the parser collapses all data material at one location into
561 one node. (However, if you create your own tree or transform the parsed tree,
562 it is possible to have adjacent data nodes.)</P
564 >Note that elements that do <I
567 > allow #PCDATA as content
568 will not have data nodes as children. This means that spaces and newlines, the
569 only character material allowed for such elements, are silently dropped.</P
574 For example, if the task is to print all contents of elements with type
575 "valuable" whose attribute "priority" is "1", this function can help:
578 CLASS="PROGRAMLISTING"
579 >let rec print_valuable_prio1 n =
580 let ntype = n # node_type in
582 T_element "valuable" when n # attribute "priority" = Value "1" ->
583 print_endline "Valuable node with priotity 1 found:";
584 print_endline (n # data)
585 | (T_element _ | T_data) ->
586 let children = n # sub_nodes in
587 List.iter print_valuable_prio1 children
592 You can call this function by:
595 CLASS="PROGRAMLISTING"
596 >print_valuable_prio1 root</PRE
599 If you like a DSSSL-like style, you can make the function
602 >process_children</TT
606 CLASS="PROGRAMLISTING"
607 >let rec print_valuable_prio1 n =
609 let process_children n =
610 let children = n # sub_nodes in
611 List.iter print_valuable_prio1 children
614 let ntype = n # node_type in
616 T_element "valuable" when n # attribute "priority" = Value "1" ->
617 print_endline "Valuable node with priority 1 found:";
618 print_endline (n # data)
619 | (T_element _ | T_data) ->
625 So far, O'Caml is now a simple "style-sheet language": You can form a big
626 "match" expression to distinguish between all significant cases, and provide
627 different reactions on different conditions. But this technique has
628 limitations; the "match" expression tends to get larger and larger, and it is
629 difficult to store intermediate values as there is only one big
630 recursion. Alternatively, it is also possible to represent the various cases as
631 classes, and to use dynamic method lookup to find the appropiate class. The
632 next section explains this technique in detail. </P
648 HREF="x550.html#AEN562"
657 also contain processing instructions. Unlike other document models, <SPAN
661 separates processing instructions from the rest of the text and provides a
662 second interface to access them (method <TT
666 there is a parser option (<TT
668 >enable_pinstr_nodes</TT
670 the behaviour of the parser such that extra nodes for processing instructions
671 are included into the tree.</P
673 >Furthermore, the tree does normally not contain nodes for XML comments;
674 they are ignored by default. Again, there is an option
677 >enable_comment_nodes</TT
688 HREF="x550.html#AEN582"
696 >Due to the typing system it is more or less impossible to
697 derive recursive classes in O'Caml. To get around this, it is common practice
698 to put the modifiable or extensible part of recursive objects into parallel
759 >Class-based processing of the node tree</TD