The PXP user's guide
Prev	Chapter 2. Using PXP	Next

2.2. How to parse a document from an application

Let me first give a rough overview of the object model of the parser. The +following items are represented by objects: + +

Documents: The document representation is more or less the +anchor for the application; all accesses to the parsed entities start here. It +is described by the class document contained in the module +Pxp_document. You can get some global information, such +as the XML declaration the document begins with, the DTD of the document, +global processing instructions, and most important, the document tree.
The contents of documents: The contents have the structure +of a tree: Elements contain other elements and text[1]. + +The common type to represent both kinds of content is node +which is a class type that unifies the properties of elements and character +data. Every node has a list of children (which is empty if the element is empty +or the node represents text); nodes may have attributes; nodes have always text +contents. There are two implementations of node, the class +element_impl for elements, and the class +data_impl for text data. You find these classes and class +types in the module Pxp_document, too.
Note that attribute lists are represented by non-class values.
The node extension: For advanced usage, every node of the +document may have an associated extension which is simply +a second object. This object must have the three methods +clone, node, and +set_node as bare minimum, but you are free to add methods as +you want. This is the preferred way to add functionality to the document +tree[2]. The class type extension is +defined in Pxp_document, too.
The DTD: Sometimes it is necessary to access the DTD of a +document; the average application does not need this feature. The class +dtd describes DTDs, and makes it possible to get +representations of element, entity, and notation declarations as well as +processing instructions contained in the DTD. This class, and +dtd_element, dtd_notation, and +proc_instruction can be found in the module +Pxp_dtd. There are a couple of classes representing +different kinds of entities; these can be found in the module +Pxp_entity.

+ +Additionally, the following modules play a role: + +

Pxp_yacc: Here the main parsing functions such as +parse_document_entity are located. Some additional types and +functions allow the parser to be configured in a non-standard way.
Pxp_types: This is a collection of basic types and +exceptions.

+ +There are some further modules that are needed internally but are not part of +the API.

Let the document to be parsed be stored in a file called +doc.xml. The parsing process is started by calling the +function + +

val parse_document_entity : config -> source -> 'ext spec -> 'ext document

+ +defined in the module Pxp_yacc. The first argument +specifies some global properties of the parser; it is recommended to start with +the default_config. The second argument determines where the +document to be parsed comes from; this may be a file, a channel, or an entity +ID. To parse doc.xml, it is sufficient to pass +from_file "doc.xml".

The third argument passes the object specification to use. Roughly +speaking, it determines which classes implement the node objects of which +element types, and which extensions are to be used. The 'ext +polymorphic variable is the type of the extension. For the moment, let us +simply pass default_spec as this argument, and ignore it.

So the following expression parses doc.xml: + +

open Pxp_yacc
+let d = parse_document_entity default_config (from_file "doc.xml") default_spec

+ +Note that default_config implies that warnings are collected +but not printed. Errors raise one of the exception defined in +Pxp_types; to get readable errors and warnings catch the +exceptions as follows: + +

class warner =
+  object 
+    method warn w =
+      print_endline ("WARNING: " ^ w)
+  end
+;;
+
+try
+  let config = { default_config with warner = new warner } in
+  let d = parse_document_entity config (from_file "doc.xml") default_spec
+  in
+    ...
+with
+   e ->
+     print_endline (Pxp_types.string_of_exn e)

+ +Now d is an object of the document +class. If you want the node tree, you can get the root element by + +

let root = d # root

+ +and if you would rather like to access the DTD, determine it by + +

let dtd = d # dtd

+ +As it is more interesting, let us investigate the node tree now. Given the root +element, it is possible to recursively traverse the whole tree. The children of +a node n are returned by the method +sub_nodes, and the type of a node is returned by +node_type. This function traverses the tree, and prints the +type of each node: + +

let rec print_structure n =
+  let ntype = n # node_type in
+  match ntype with
+    T_element name ->
+      print_endline ("Element of type " ^ name);
+      let children = n # sub_nodes in
+      List.iter print_structure children
+  | T_data ->
+      print_endline "Data"
+  | _ ->
+      (* Other node types are not possible unless the parser is configured
+         differently.
+       *)
+      assert false

+ +You can call this function by + +

print_structure root

+ +The type returned by node_type is either T_element +name or T_data. The name of the +element type is the string included in the angle brackets. Note that only +elements have children; data nodes are always leaves of the tree.

There are some more methods in order to access a parsed node tree: + +

n # parent: Returns the parent node, or raises +Not_found if the node is already the root
n # root: Returns the root of the node tree.
n # attribute a: Returns the value of the attribute with +name a. The method returns a value for every +declared attribute, independently of whether the attribute +instance is defined or not. If the attribute is not declared, +Not_found will be raised. (In well-formedness mode, every +attribute is considered as being implicitly declared with type +CDATA.)
The following return values are possible: Value s, +Valuelist sl , and Implied_value. +The first two value types indicate that the attribute value is available, +either because there is a definition +a="value" +in the XML text, or because there is a default value (declared in the +DTD). Only if both the instance definition and the default declaration are +missing, the latter value Implied_value will be returned.
In the DTD, every attribute is typed. There are single-value types (CDATA, ID, +IDREF, ENTITY, NMTOKEN, enumerations), in which case the method passes +Value s back, where s is the normalized +string value of the attribute. The other types (IDREFS, ENTITIES, NMTOKENS) +represent list values, and the parser splits the XML literal into several +tokens and returns these tokens as Valuelist sl.
Normalization means that entity references (the +&name; tokens) and +character references +(&#number;) are replaced +by the text they represent, and that white space characters are converted into +plain spaces.
n # data: Returns the character data contained in the +node. For data nodes, the meaning is obvious as this is the main content of +data nodes. For element nodes, this method returns the concatenated contents of +all inner data nodes.
Note that entity references included in the text are resolved while they are +being parsed; for example the text "a <> b" will be returned +as "a <> b" by this method. Spaces of data nodes are always +preserved. Newlines are preserved, but always converted to \n characters even +if newlines are encoded as \r\n or \r. Normally you will never see two adjacent +data nodes because the parser collapses all data material at one location into +one node. (However, if you create your own tree or transform the parsed tree, +it is possible to have adjacent data nodes.)
Note that elements that do not allow #PCDATA as content +will not have data nodes as children. This means that spaces and newlines, the +only character material allowed for such elements, are silently dropped.

+ +For example, if the task is to print all contents of elements with type +"valuable" whose attribute "priority" is "1", this function can help: + +

let rec print_valuable_prio1 n =
+  let ntype = n # node_type in
+  match ntype with
+    T_element "valuable" when n # attribute "priority" = Value "1" ->
+      print_endline "Valuable node with priotity 1 found:";
+      print_endline (n # data)
+  | (T_element _ | T_data) ->
+      let children = n # sub_nodes in
+      List.iter print_valuable_prio1 children
+  | _ ->
+      assert false

+ +You can call this function by: + +

print_valuable_prio1 root

+ +If you like a DSSSL-like style, you can make the function +process_children explicit: + +

let rec print_valuable_prio1 n =
+
+  let process_children n =
+    let children = n # sub_nodes in
+    List.iter print_valuable_prio1 children 
+  in
+
+  let ntype = n # node_type in
+  match ntype with
+    T_element "valuable" when n # attribute "priority" = Value "1" ->
+      print_endline "Valuable node with priority 1 found:";
+      print_endline (n # data)
+  | (T_element _ | T_data) ->
+      process_children n
+  | _ ->
+      assert false

+ +So far, O'Caml is now a simple "style-sheet language": You can form a big +"match" expression to distinguish between all significant cases, and provide +different reactions on different conditions. But this technique has +limitations; the "match" expression tends to get larger and larger, and it is +difficult to store intermediate values as there is only one big +recursion. Alternatively, it is also possible to represent the various cases as +classes, and to use dynamic method lookup to find the appropiate class. The +next section explains this technique in detail.

Prev	Home	Next
Using PXP	Up	Class-based processing of the node tree

2.2. How to parse a document from an application

Notes