2.2. How to parse a document from an application

Let me first give a rough overview of the object model of the parser. The following items are represented by objects:

Additionally, the following modules play a role:

There are some further modules that are needed internally but are not part of the API.

Let the document to be parsed be stored in a file called doc.xml. The parsing process is started by calling the function

val parse_document_entity : config -> source -> 'ext spec -> 'ext document
defined in the module Pxp_yacc. The first argument specifies some global properties of the parser; it is recommended to start with the default_config. The second argument determines where the document to be parsed comes from; this may be a file, a channel, or an entity ID. To parse doc.xml, it is sufficient to pass from_file "doc.xml".

The third argument passes the object specification to use. Roughly speaking, it determines which classes implement the node objects of which element types, and which extensions are to be used. The 'ext polymorphic variable is the type of the extension. For the moment, let us simply pass default_spec as this argument, and ignore it.

So the following expression parses doc.xml:

open Pxp_yacc
let d = parse_document_entity default_config (from_file "doc.xml") default_spec
Note that default_config implies that warnings are collected but not printed. Errors raise one of the exception defined in Pxp_types; to get readable errors and warnings catch the exceptions as follows:
class warner =
  object 
    method warn w =
      print_endline ("WARNING: " ^ w)
  end
;;

try
  let config = { default_config with warner = new warner } in
  let d = parse_document_entity config (from_file "doc.xml") default_spec
  in
    ...
with
   e ->
     print_endline (Pxp_types.string_of_exn e)
Now d is an object of the document class. If you want the node tree, you can get the root element by
let root = d # root
and if you would rather like to access the DTD, determine it by
let dtd = d # dtd
As it is more interesting, let us investigate the node tree now. Given the root element, it is possible to recursively traverse the whole tree. The children of a node n are returned by the method sub_nodes, and the type of a node is returned by node_type. This function traverses the tree, and prints the type of each node:
let rec print_structure n =
  let ntype = n # node_type in
  match ntype with
    T_element name ->
      print_endline ("Element of type " ^ name);
      let children = n # sub_nodes in
      List.iter print_structure children
  | T_data ->
      print_endline "Data"
  | _ ->
      (* Other node types are not possible unless the parser is configured
         differently.
       *)
      assert false
You can call this function by
print_structure root
The type returned by node_type is either T_element name or T_data. The name of the element type is the string included in the angle brackets. Note that only elements have children; data nodes are always leaves of the tree.

There are some more methods in order to access a parsed node tree:

For example, if the task is to print all contents of elements with type "valuable" whose attribute "priority" is "1", this function can help:
let rec print_valuable_prio1 n =
  let ntype = n # node_type in
  match ntype with
    T_element "valuable" when n # attribute "priority" = Value "1" ->
      print_endline "Valuable node with priotity 1 found:";
      print_endline (n # data)
  | (T_element _ | T_data) ->
      let children = n # sub_nodes in
      List.iter print_valuable_prio1 children
  | _ ->
      assert false
You can call this function by:
print_valuable_prio1 root
If you like a DSSSL-like style, you can make the function process_children explicit:
let rec print_valuable_prio1 n =

  let process_children n =
    let children = n # sub_nodes in
    List.iter print_valuable_prio1 children 
  in

  let ntype = n # node_type in
  match ntype with
    T_element "valuable" when n # attribute "priority" = Value "1" ->
      print_endline "Valuable node with priority 1 found:";
      print_endline (n # data)
  | (T_element _ | T_data) ->
      process_children n
  | _ ->
      assert false
So far, O'Caml is now a simple "style-sheet language": You can form a big "match" expression to distinguish between all significant cases, and provide different reactions on different conditions. But this technique has limitations; the "match" expression tends to get larger and larger, and it is difficult to store intermediate values as there is only one big recursion. Alternatively, it is also possible to represent the various cases as classes, and to use dynamic method lookup to find the appropiate class. The next section explains this technique in detail.

Notes

[1]

Elements may also contain processing instructions. Unlike other document models, PXP separates processing instructions from the rest of the text and provides a second interface to access them (method pinstr). However, there is a parser option (enable_pinstr_nodes) which changes the behaviour of the parser such that extra nodes for processing instructions are included into the tree.

Furthermore, the tree does normally not contain nodes for XML comments; they are ignored by default. Again, there is an option (enable_comment_nodes) changing this.

[2]

Due to the typing system it is more or less impossible to derive recursive classes in O'Caml. To get around this, it is common practice to put the modifiable or extensible part of recursive objects into parallel objects.