The PXP user's guide
Prev	Chapter 3. The objects representing the document	Next

3.4. Details of the mapping from XML text to the tree representation

3.4.1. The representation of character-free elements

If an element declaration does not allow the element to +contain character data, the following rules apply.

If the element must be empty, i.e. it is declared with the +keyword EMPTY, the element instance must be effectively +empty (it must not even contain whitespace characters). The parser guarantees +that a declared EMPTY element does never contain a data +node, even if the data node represents the empty string.

If the element declaration only permits other elements to occur +within that element but not character data, it is still possible to insert +whitespace characters between the subelements. The parser ignores these +characters, too, and does not create data nodes for them.

Example. Consider the following element types: + +

<!ELEMENT x ( #PCDATA | z )* >
+<!ELEMENT y ( z )* >
+<!ELEMENT z EMPTY>

+ +Only x may contain character data, the keyword +#PCDATA indicates this. The other types are character-free.

The XML term + +

<x><z/> <z/></x>

+ +will be internally represented by an element node for x +with three subnodes: the first z element, a data node +containing the space character, and the second z element. +In contrast to this, the term + +

<y><z/> <z/></y>

+ +is represented by an element node for y with only +two subnodes, the two z elements. There +is no data node for the space character because spaces are ignored in the +character-free element y.

3.4.2. The representation of character data

The XML specification allows all Unicode characters in XML +texts. This parser can be configured such that UTF-8 is used to represent the +characters internally; however, the default character encoding is +ISO-8859-1. (Currently, no other encodings are possible for the internal string +representation; the type Pxp_types.rep_encoding enumerates +the possible encodings. Principially, the parser could use any encoding that is +ASCII-compatible, but there are currently only lexical analyzers for UTF-8 and +ISO-8859-1. It is currently impossible to use UTF-16 or UCS-4 as internal +encodings (or other multibyte encodings which are not ASCII-compatible) unless +major parts of the parser are rewritten - unlikely...)

The internal encoding may be different from the external encoding (specified +in the XML declaration <?xml ... encoding="..."?>); in +this case the strings are automatically converted to the internal encoding.

If the internal encoding is ISO-8859-1, it is possible that there are +characters that cannot be represented. In this case, the parser ignores such +characters and prints a warning (to the collect_warning +object that must be passed when the parser is called).

The XML specification allows lines to be separated by single LF +characters, by CR LF character sequences, or by single CR +characters. Internally, these separators are always converted to single LF +characters.

The parser guarantees that there are never two adjacent data +nodes; if necessary, data material that would otherwise be represented by +several nodes is collapsed into one node. Note that you can still create node +trees with adjacent data nodes; however, the parser does not return such trees.

Note that CDATA sections are not represented specially; such +sections are added to the current data material that being collected for the +next data node.

3.4.3. The representation of entities within documents

Entities are not represented within +documents! If the parser finds an entity reference in the document +content, the reference is immediately expanded, and the parser reads the +expansion text instead of the reference.

3.4.4. The representation of attributes

As attribute +values are composed of Unicode characters, too, the same problems with the +character encoding arise as for character material. Attribute values are +converted to the internal encoding, too; and if there are characters that +cannot be represented, these are dropped, and a warning is printed.

Attribute values are normalized before they are returned by +methods like attribute. First, any remaining entity +references are expanded; if necessary, expansion is performed recursively. +Second, newline characters (any of LF, CR LF, or CR characters) are converted +to single space characters. Note that especially the latter action is +prescribed by the XML standard (but is not converted +such that it is still possible to include line feeds into attributes).

3.4.5. The representation of processing instructions

Processing instructions are parsed to some extent: The first word of the +PI is called the target, and it is stored separated from the rest of the PI: + +

<?target rest?>

+ +The exact location where a PI occurs is not represented (by default). The +parser puts the PI into the object that represents the embracing construct (an +element, a DTD, or the whole document); that means you can find out which PIs +occur in a certain element, in the DTD, or in the whole document, but you +cannot lookup the exact position within the construct.

If you require the exact location of PIs, it is possible to +create extra nodes for them. This mode is controled by the option +enable_pinstr_nodes. The additional nodes have the node type +T_pinstr target, and are created +from special exemplars contained in the spec (see +pxp_document.mli).

3.4.6. The representation of comments

Normally, comments are not represented; they are dropped by +default. However, if you require them, it is possible to create +T_comment nodes for them. This mode can be specified by the +option enable_comment_nodes. Comment nodes are created from +special exemplars contained in the spec (see +pxp_document.mli). You can access the contents of comments through the +method comment.

3.4.7. The attributes `xml:lang` and +`xml:space`

These attributes are not supported specially; they are handled +like any other attribute.

3.4.8. And what about namespaces?

Currently, there is no special support for namespaces. +However, the parser allows it that the colon occurs in names such that it is +possible to implement namespaces on top of the current API.

Some future release of PXP will support namespaces as built-in +feature...

Prev	Home	Next
The class type `extension`	Up	Configuring and calling the parser