3.2. The class type node

From Pxp_document:

type node_type =
  T_data
| T_element of string
| T_super_root
| T_pinstr of string
| T_comment
and some other, reserved types
;;

class type [ 'ext ] node =
  object ('self)
    constraint 'ext = 'ext node #extension

    (* General observers *)

    method extension : 'ext
    method dtd : dtd
    method parent : 'ext node
    method root : 'ext node
    method sub_nodes : 'ext node list
    method iter_nodes : ('ext node -> unit) -> unit
    method iter_nodes_sibl : 
           ('ext node option -> 'ext node -> 'ext node option -> unit) -> unit
    method node_type : node_type
    method encoding : Pxp_types.rep_encoding
    method data : string
    method position : (string * int * int)
    method comment : string option
    method pinstr : string -> proc_instruction list
    method pinstr_names : string list
    method write : Pxp_types.output_stream -> Pxp_types.encoding -> unit

    (* Attribute observers *)

    method attribute : string -> Pxp_types.att_value
    method required_string_attribute : string -> string
    method optional_string_attribute : string -> string option
    method required_list_attribute : string -> string list
    method optional_list_attribute : string -> string list
    method attribute_names : string list
    method attribute_type : string -> Pxp_types.att_type
    method attributes : (string * Pxp_types.att_value) list
    method id_attribute_name : string
    method id_attribute_value : string
    method idref_attribute_names : string

    (* Modifying methods *)

    method add_node : ?force:bool -> 'ext node -> unit
    method add_pinstr : proc_instruction -> unit
    method delete : unit
    method set_nodes : 'ext node list -> unit
    method quick_set_attributes : (string * Pxp_types.att_value) list -> unit
    method set_comment : string option -> unit

    (* Cloning methods *)

    method orphaned_clone : 'self
    method orphaned_flat_clone : 'self
    method create_element : 
              ?position:(string * int * int) ->
              dtd -> node_type -> (string * string) list ->
                  'ext node
    method create_data : dtd -> string -> 'ext node
    method keep_always_whitespace_mode : unit

    (* Validating methods *)

    method local_validate : ?use_dfa:bool -> unit -> unit

    (* ... Internal methods are undocumented. *)

  end
;;
In the module Pxp_types you can find another type definition that is important in this context:
type Pxp_types.att_value =
    Value     of string
  | Valuelist of string list
  | Implied_value
;;

3.2.1. The structure of document trees

A node represents either an element or a character data section. There are two classes implementing the two aspects of nodes: element_impl and data_impl. The latter class does not implement all methods because some methods do not make sense for data nodes.

(Note: PXP also supports a mode which forces that processing instructions and comments are represented as nodes of the document tree. However, these nodes are instances of element_impl with node types T_pinstr and T_comment, respectively. This mode must be explicitly configured; the basic representation knows only element and data nodes.)

The following figure (A tree with element nodes, data nodes, and attributes) shows an example how a tree is constructed from element and data nodes. The circular areas represent element nodes whereas the ovals denote data nodes. Only elements may have subnodes; data nodes are always leaves of the tree. The subnodes of an element can be either element or data nodes; in both cases the O'Caml objects storing the nodes have the class type node.

Attributes (the clouds in the picture) are not directly integrated into the tree; there is always an extra link to the attribute list. This is also true for processing instructions (not shown in the picture). This means that there are separated access methods for attributes and processing instructions.

Figure 3-1. A tree with element nodes, data nodes, and attributes

Only elements, data sections, attributes and processing instructions (and comments, if configured) can, directly or indirectly, occur in the document tree. It is impossible to add entity references to the tree; if the parser finds such a reference, not the reference as such but the referenced text (i.e. the tree representing the structured text) is included in the tree.

Note that the parser collapses as much data material into one data node as possible such that there are normally never two adjacent data nodes. This invariant is enforced even if data material is included by entity references or CDATA sections, or if a data sequence is interrupted by comments. So a &amp; b <-- comment --> c <![CDATA[ <> d]]> is represented by only one data node, for instance. However, you can create document trees manually which break this invariant; it is only the way the parser forms the tree.

Figure 3-2. Nodes are doubly linked trees

The node tree has links in both directions: Every node has a link to its parent (if any), and it has links to the subnodes (see figure Nodes are doubly linked trees). Obviously, this doubly-linked structure simplifies the navigation in the tree; but has also some consequences for the possible operations on trees.

Because every node must have at most one parent node, operations are illegal if they violate this condition. The following figure (A node can only be added if it is a root) shows on the left side that node y is added to x as new subnode which is allowed because y does not have a parent yet. The right side of the picture illustrates what would happen if y had a parent node; this is illegal because y would have two parents after the operation.

Figure 3-3. A node can only be added if it is a root

The "delete" operation simply removes the links between two nodes. In the picture (A deleted node becomes the root of the subtree) the node x is deleted from the list of subnodes of y. After that, x becomes the root of the subtree starting at this node.

Figure 3-4. A deleted node becomes the root of the subtree

It is also possible to make a clone of a subtree; illustrated in The clone of a subtree. In this case, the clone is a copy of the original subtree except that it is no longer a subnode. Because cloning never keeps the connection to the parent, the clones are called orphaned.

Figure 3-5. The clone of a subtree

3.2.2. The methods of the class type node

General observers .

  • extension: The reference to the extension object which belongs to this node (see ...).

  • dtd: Returns a reference to the global DTD. All nodes of a tree must share the same DTD.

  • parent: Get the father node. Raises Not_found in the case the node does not have a parent, i.e. the node is the root.

  • root: Gets the reference to the root node of the tree. Every node is contained in a tree with a root, so this method always succeeds. Note that this method searches the root, which costs time proportional to the length of the path to the root.

  • sub_nodes: Returns references to the children. The returned list reflects the order of the children. For data nodes, this method returns the empty list.

  • iter_nodes f: Iterates over the children, and calls f for every child in turn.

  • iter_nodes_sibl f: Iterates over the children, and calls f for every child in turn. f gets as arguments the previous node, the current node, and the next node.

  • node_type: Returns either T_data which means that the node is a data node, or T_element n which means that the node is an element of type n. If configured, possible node types are also T_pinstr t indicating that the node represents a processing instruction with target t, and T_comment in which case the node is a comment.

  • encoding: Returns the encoding of the strings.

  • data: Returns the character data of this node and all children, concatenated as one string. The encoding of the string is what the method encoding returns. - For data nodes, this method simply returns the represented characters. For elements, the meaning of the method has been extended such that it returns something useful, i.e. the effectively contained characters, without markup. (For T_pinstr and T_comment nodes, the method returns the empty string.)

  • position: If configured, this method returns the position of the element as triple (entity, line, byteposition). For data nodes, the position is not stored. If the position is not available the triple "?", 0, 0 is returned.

  • comment: Returns Some text for comment nodes, and None for other nodes. The text is everything between the comment delimiters <-- and -->.

  • pinstr n: Returns all processing instructions that are directly contained in this element and that have a target specification of n. The target is the first word after the <?.

  • pinstr_names: Returns the list of all targets of processing instructions directly contained in this element.

  • write s enc: Prints the node and all subnodes to the passed output stream as valid XML text, using the passed external encoding.

Attribute observers .

  • attribute n: Returns the value of the attribute with name n. This method returns a value for every declared attribute, and it raises Not_found for any undeclared attribute. Note that it even returns a value if the attribute is actually missing but is declared as #IMPLIED or has a default value. - Possible values are:

    • Implied_value: The attribute has been declared with the keyword #IMPLIED, and the attribute is missing in the attribute list of this element.

    • Value s: The attribute has been declared as type CDATA, as ID, as IDREF, as ENTITY, or as NMTOKEN, or as enumeration or notation, and one of the two conditions holds: (1) The attribute value is present in the attribute list in which case the value is returned in the string s. (2) The attribute has been omitted, and the DTD declared the attribute with a default value. The default value is returned in s. - Summarized, Value s is returned for non-implied, non-list attribute values.

    • Valuelist l: The attribute has been declared as type IDREFS, as ENTITIES, or as NMTOKENS, and one of the two conditions holds: (1) The attribute value is present in the attribute list in which case the space-separated tokens of the value are returned in the string list l. (2) The attribute has been omitted, and the DTD declared the attribute with a default value. The default value is returned in l. - Summarized, Valuelist l is returned for all list-type attribute values.

    Note that before the attribute value is returned, the value is normalized. This means that newlines are converted to spaces, and that references to character entities (i.e. &#n;) and general entities (i.e. &name;) are expanded; if necessary, expansion is performed recursively.

    In well-formedness mode, there is no DTD which could declare an attribute. Because of this, every occuring attribute is considered as a CDATA attribute.

  • required_string_attribute n: returns the Value attribute called n, or the Valuelist attribute as a string where the list elements are separated by spaces. If the attribute value is implied, or if the attribute does not exists, the method will fail. - This method is convenient if you expect a non-implied and non-list attribute value.

  • optional_string_attribute n: returns the Value attribute called n, or the Valuelist attribute as a string where the list elements are separated by spaces. If the attribute value is implied, or if the attribute does not exists, the method returns None. - This method is convenient if you expect a non-list attribute value including the implied value.

  • required_list_attribute n: returns the Valuelist attribute called n, or the Value attribute as a list with a single element. If the attribute value is implied, or if the attribute does not exists, the method will fail. - This method is convenient if you expect a list attribute value.

  • optional_list_attribute n: returns the Valuelist attribute called n, or the Value attribute as a list with a single element. If the attribute value is implied, or if the attribute does not exists, an empty list will be returned. - This method is convenient if you expect a list attribute value or the implied value.

  • attribute_names: returns the list of all attribute names of this element. As this is a validating parser, this list is equal to the list of declared attributes.

  • attribute_type n: returns the type of the attribute called n. See the module Pxp_types for a description of the encoding of the types.

  • attributes: returns the list of pairs of names and values for all attributes of this element.

  • id_attribute_name: returns the name of the attribute that is declared with type ID. There is at most one such attribute. The method raises Not_found if there is no declared ID attribute for the element type.

  • id_attribute_value: returns the value of the attribute that is declared with type ID. There is at most one such attribute. The method raises Not_found if there is no declared ID attribute for the element type.

  • idref_attribute_names: returns the list of attribute names that are declared as IDREF or IDREFS.

Modifying methods . The following methods are only defined for element nodes (more exactly: the methods are defined for data nodes, too, but fail always).

  • add_node sn: Adds sub node sn to the list of children. This operation is illustrated in the picture A node can only be added if it is a root. This method expects that sn is a root, and it requires that sn and the current object share the same DTD.

    Because add_node is the method the parser itself uses to add new nodes to the tree, it performs by default some simple validation checks: If the content model is a regular expression, it is not allowed to add data nodes to this node unless the new nodes consist only of whitespace. In this case, the new data nodes are silently dropped (you can change this by invoking keep_always_whitespace_mode).

    If the document is flagged as stand-alone, these data nodes only containing whitespace are even forbidden if the element declaration is contained in an external entity. This case is detected and rejected.

    If the content model is EMPTY, it is not allowed to add any data node unless the data node is empty. In this case, the new data node is silently dropped.

    These checks only apply if there is a DTD. In well-formedness mode, it is assumed that every element is declared with content model ANY which prohibits any validation check. Furthermore, you turn these checks off by passing ~force:true as first argument.

  • add_pinstr pi: Adds the processing instruction pi to the list of processing instructions.

  • delete: Deletes this node from the tree. After this operation, this node is no longer the child of the former father node; and the node loses the connection to the father as well. This operation is illustrated by the figure A deleted node becomes the root of the subtree.

  • set_nodes nl: Sets the list of children to nl. It is required that every member of nl is a root, and that all members and the current object share the same DTD. Unlike add_node, no validation checks are performed.

  • quick_set_attributes atts: sets the attributes of this element to atts. It is not checked whether atts matches the DTD or not; it is up to the caller of this method to ensure this. (This method may be useful to transform the attribute values, i.e. apply a mapping to every attribute.)

  • set_comment text: This method is only applicable to T_comment nodes; it sets the comment text contained by such nodes.

Cloning methods .

  • orphaned_clone: Returns a clone of the node and the complete tree below this node (deep clone). The clone does not have a parent (i.e. the reference to the parent node is not cloned). While copying the subtree, strings are skipped; it is likely that the original tree and the copy tree share strings. Extension objects are cloned by invoking the clone method on the original objects; how much of the extension objects is cloned depends on the implemention of this method.

    This operation is illustrated by the figure The clone of a subtree.

  • orphaned_flat_clone: Returns a clone of the node, but sets the list of sub nodes to [], i.e. the sub nodes are not cloned.

  • create_element dtd nt al: Returns a flat copy of this node (which must be an element) with the following modifications: The DTD is set to dtd; the node type is set to nt, and the new attribute list is set to al (given as list of (name,value) pairs). The copy does not have children nor a parent. It does not contain processing instructions. See the example below.

    Note that you can specify the position of the new node by the optional argument ~position.

  • create_data dtd cdata: Returns a flat copy of this node (which must be a data node) with the following modifications: The DTD is set to dtd; the node type is set to T_data; the attribute list is empty (data nodes never have attributes); the list of children and PIs is empty, too (same reason). The new node does not have a parent. The value cdata is the new character content of the node. See the example below.

  • keep_always_whitespace_mode: Even data nodes which are normally dropped because they only contain ignorable whitespace, can added to this node once this mode is turned on. (This mode is useful to produce canonical XML.)

Validating methods . There is one method which locally validates the node, i.e. checks whether the subnodes match the content model of this node.

  • local_validate: Checks that this node conforms to the DTD by comparing the type of the subnodes with the content model for this node. (Applications need not call this method unless they add new nodes themselves to the tree.)

3.2.3. The class element_impl

This class is an implementation of node which realizes element nodes:

class [ 'ext ] element_impl : 'ext -> [ 'ext ] node

Constructor. You can create a new instance by

new element_impl extension_object
which creates a special form of empty element which already contains a reference to the extension_object, but is otherwise empty. This special form is called an exemplar. The purpose of exemplars is that they serve as patterns that can be duplicated and filled with data. The method create_element is designed to perform this action.

Example. First, create an exemplar by

let exemplar_ext = ... in
let exemplar     = new element_impl exemplar_ext in
The exemplar is not used in node trees, but only as a pattern when the element nodes are created:
let element = exemplar # create_element dtd (T_element name) attlist 
The element is a copy of exemplar (even the extension exemplar_ext has been copied) which ensures that element and its extension are objects of the same class as the exemplars; note that you need not to pass a class name or other meta information. The copy is initially connected with the dtd, it gets a node type, and the attribute list is filled. The element is now fully functional; it can be added to another element as child, and it can contain references to subnodes.

3.2.4. The class data_impl

This class is an implementation of node which should be used for all character data nodes:

class [ 'ext ] data_impl : 'ext -> [ 'ext ] node

Constructor. You can create a new instance by

new data_impl extension_object
which creates an empty exemplar node which is connected to extension_object. The node does not contain a reference to any DTD, and because of this it cannot be added to node trees.

To get a fully working data node, apply the method create_data to the exemplar (see example).

Example. First, create an exemplar by

let exemplar_ext = ... in
let exemplar     = new exemplar_ext data_impl in
The exemplar is not used in node trees, but only as a pattern when the data nodes are created:
let data_node = exemplar # create_data dtd "The characters contained in the data node" 
The data_node is a copy of exemplar. The copy is initially connected with the dtd, and it is filled with character material. The data_node is now fully functional; it can be added to an element as child.

3.2.5. The type spec

The type spec defines a way to handle the details of creating nodes from exemplars.

type 'ext spec
constraint 'ext = 'ext node #extension

val make_spec_from_mapping :
      ?super_root_exemplar : 'ext node ->
      ?comment_exemplar : 'ext node ->
      ?default_pinstr_exemplar : 'ext node ->
      ?pinstr_mapping : (string, 'ext node) Hashtbl.t ->
      data_exemplar: 'ext node ->
      default_element_exemplar: 'ext node ->
      element_mapping: (string, 'ext node) Hashtbl.t -> 
      unit -> 
        'ext spec

val make_spec_from_alist :
      ?super_root_exemplar : 'ext node ->
      ?comment_exemplar : 'ext node ->
      ?default_pinstr_exemplar : 'ext node ->
      ?pinstr_alist : (string * 'ext node) list ->
      data_exemplar: 'ext node ->
      default_element_exemplar: 'ext node ->
      element_alist: (string * 'ext node) list -> 
      unit -> 
        'ext spec
The two functions make_spec_from_mapping and make_spec_from_alist create spec values. Both functions are functionally equivalent and the only difference is that the first function prefers hashtables and the latter associative lists to describe mappings from names to exemplars.

You can specify exemplars for the various kinds of nodes that need to be generated when an XML document is parsed:

In most cases, you only want to create spec values to pass them to the parser functions found in Pxp_yacc. However, it might be useful to apply spec values directly.

The following functions create various types of nodes by selecting the corresponding exemplar from the passed spec value, and by calling create_element or create_data on the exemplar.

val create_data_node : 
      'ext spec -> 
      dtd -> 
      (* data material: *) string -> 
          'ext node

val create_element_node : 
      ?position:(string * int * int) ->
      'ext spec -> 
      dtd -> 
      (* element type: *) string -> 
      (* attributes: *) (string * string) list -> 
          'ext node

val create_super_root_node :
      ?position:(string * int * int) ->
      'ext spec -> 
       dtd -> 
           'ext node

val create_comment_node :
      ?position:(string * int * int) ->
      'ext spec -> 
      dtd -> 
      (* comment text: *) string -> 
          'ext node

val create_pinstr_node :
      ?position:(string * int * int) ->
      'ext spec -> 
      dtd -> 
      proc_instruction -> 
          'ext node

3.2.6. Examples

Building trees. Here is the piece of code that creates the tree of the figure A tree with element nodes, data nodes, and attributes. The extension object and the DTD are beyond the scope of this example.

let exemplar_ext = ... (* some extension *) in
let dtd = ... (* some DTD *) in

let element_exemplar = new element_impl exemplar_ext in
let data_exemplar    = new data_impl    exemplar_ext in

let a1 = element_exemplar # create_element dtd (T_element "a") ["att", "apple"]
and b1 = element_exemplar # create_element dtd (T_element "b") []
and c1 = element_exemplar # create_element dtd (T_element "c") []
and a2 = element_exemplar # create_element dtd (T_element "a") ["att", "orange"]
in

let cherries = data_exemplar # create_data dtd "Cherries" in
let orange   = data_exemplar # create_data dtd "An orange" in

a1 # add_node b1;
a1 # add_node c1;
b1 # add_node a2;
b1 # add_node cherries;
a2 # add_node orange;
Alternatively, the last block of statements could also be written as:
a1 # set_nodes [b1; c1];
b1 # set_nodes [a2; cherries];
a2 # set_nodes [orange];
The root of the tree is a1, i.e. it is true that
x # root == a1
for every x from { a1, a2, b1, c1, cherries, orange }.

Furthermore, the following properties hold:

  a1 # attribute "att" = Value "apple"
& a2 # attribute "att" = Value "orange"

& cherries # data = "Cherries"
&   orange # data = "An orange"
&       a1 # data = "CherriesAn orange"

&       a1 # node_type = T_element "a"
&       a2 # node_type = T_element "a"
&       b1 # node_type = T_element "b"
&       c1 # node_type = T_element "c"
& cherries # node_type = T_data
&   orange # node_type = T_data

&       a1 # sub_nodes = [ b1; c1 ]
&       a2 # sub_nodes = [ orange ]
&       b1 # sub_nodes = [ a2; cherries ]
&       c1 # sub_nodes = []
& cherries # sub_nodes = []
&   orange # sub_nodes = []

&       a2 # parent == a1
&       b1 # parent == b1
&       c1 # parent == a1
& cherries # parent == b1
&   orange # parent == a2

Searching nodes. The following function searches all nodes of a tree for which a certain condition holds:

let rec search p t =
  if p t then
    t :: search_list p (t # sub_nodes)
  else
    search_list p (t # sub_nodes)

and search_list p l =
  match l with
    []      -> []
  | t :: l' -> (search p t) @ (search_list p l')
;;

For example, if you want to search all elements of a certain type et, the function search can be applied as follows:

let search_element_type et t =
  search (fun x -> x # node_type = T_element et) t
;;

Getting attribute values. Suppose we have the declaration:

<!ATTLIST e a CDATA #REQUIRED
            b CDATA #IMPLIED
            c CDATA "12345">
In this case, every element e must have an attribute a, otherwise the parser would indicate an error. If the O'Caml variable n holds the node of the tree corresponding to the element, you can get the value of the attribute a by
let value_of_a = n # required_string_attribute "a"
which is more or less an abbreviation for
let value_of_a = 
  match n # attribute "a" with
    Value s -> s
  | _       -> assert false
- as the attribute is required, the attribute method always returns a Value.

In contrast to this, the attribute b can be omitted. In this case, the method required_string_attribute works only if the attribute is there, and the method will fail if the attribute is missing. To get the value, you can apply the method optional_string_attribute:

let value_of_b = n # optional_string_attribute "b"
Now, value_of_b is of type string option, and None represents the omitted attribute. Alternatively, you could also use attribute:
let value_of_b = 
  match n # attribute "b" with
    Value s       -> Some s
  | Implied_value -> None
  | _             -> assert false

The attribute c behaves much like a, because it has always a value. If the attribute is omitted, the default, here "12345", will be returned instead. Because of this, you can again use required_string_attribute to get the value.

The type CDATA is the most general string type. The types NMTOKEN, ID, IDREF, ENTITY, and all enumerators and notations are special forms of string types that restrict the possible values. From O'Caml, they behave like CDATA, i.e. you can use the methods required_string_attribute and optional_string_attribute, too.

In contrast to this, the types NMTOKENS, IDREFS, and ENTITIES mean lists of strings. Suppose we have the declaration:

<!ATTLIST f d NMTOKENS #REQUIRED
            e NMTOKENS #IMPLIED>
The type NMTOKENS stands for lists of space-separated tokens; for example the value "1 abc 23ef" means the list ["1"; "abc"; "23ef"]. (Again, IDREFS and ENTITIES have more restricted values.) To get the value of attribute d, one can use
let value_of_d = n # required_list_attribute "d"
or
let value_of_d = 
  match n # attribute "d" with
    Valuelist l -> l
  | _           -> assert false
As d is required, the attribute cannot be omitted, and the attribute method returns always a Valuelist.

For optional attributes like e, apply

let value_of_e = n # optional_list_attribute "e"
or
let value_of_e = 
  match n # attribute "e" with
    Valuelist l   -> l
  | Implied_value -> []
  | _             -> assert false
Here, the case that the attribute is missing counts like the empty list.

3.2.7. Iterators

There are also several iterators in Pxp_document; please see the mli file for details. You can find examples for them in the "simple_transformation" directory.

val find : ?deeply:bool -> 
           f:('ext node -> bool) -> 'ext node -> 'ext node

val find_all : ?deeply:bool ->
               f:('ext node -> bool) -> 'ext node -> 'ext node list

val find_element : ?deeply:bool ->
                   string -> 'ext node -> 'ext node

val find_all_elements : ?deeply:bool ->
                        string -> 'ext node -> 'ext node list

exception Skip
val map_tree :  pre:('exta node -> 'extb node) ->
               ?post:('extb node -> 'extb node) ->
               'exta node -> 
                   'extb node


val map_tree_sibl : 
        pre: ('exta node option -> 'exta node -> 'exta node option -> 
                  'extb node) ->
       ?post:('extb node option -> 'extb node -> 'extb node option -> 
                  'extb node) ->
       'exta node -> 
           'extb node

val iter_tree : ?pre:('ext node -> unit) ->
                ?post:('ext node -> unit) ->
                'ext node -> 
                    unit

val iter_tree_sibl :
       ?pre: ('ext node option -> 'ext node -> 'ext node option -> unit) ->
       ?post:('ext node option -> 'ext node -> 'ext node option -> unit) ->
       'ext node -> 
           unit