The PXP user's guide
Prev	Chapter 3. The objects representing the document	Next

3.4. Details of the mapping from XML text to the tree representation

3.4.1. The representation of character-free elements

If an element declaration does not allow the element to -contain character data, the following rules apply.

If the element must be empty, i.e. it is declared with the -keyword EMPTY, the element instance must be effectively -empty (it must not even contain whitespace characters). The parser guarantees -that a declared EMPTY element does never contain a data -node, even if the data node represents the empty string.

If the element declaration only permits other elements to occur -within that element but not character data, it is still possible to insert -whitespace characters between the subelements. The parser ignores these -characters, too, and does not create data nodes for them.

Example. Consider the following element types: - -

<!ELEMENT x ( #PCDATA | z )* >
-<!ELEMENT y ( z )* >
-<!ELEMENT z EMPTY>

- -Only x may contain character data, the keyword -#PCDATA indicates this. The other types are character-free.

The XML term - -

<x><z/> <z/></x>

- -will be internally represented by an element node for x -with three subnodes: the first z element, a data node -containing the space character, and the second z element. -In contrast to this, the term - -

<y><z/> <z/></y>

- -is represented by an element node for y with only -two subnodes, the two z elements. There -is no data node for the space character because spaces are ignored in the -character-free element y.

3.4.2. The representation of character data

The XML specification allows all Unicode characters in XML -texts. This parser can be configured such that UTF-8 is used to represent the -characters internally; however, the default character encoding is -ISO-8859-1. (Currently, no other encodings are possible for the internal string -representation; the type Pxp_types.rep_encoding enumerates -the possible encodings. Principially, the parser could use any encoding that is -ASCII-compatible, but there are currently only lexical analyzers for UTF-8 and -ISO-8859-1. It is currently impossible to use UTF-16 or UCS-4 as internal -encodings (or other multibyte encodings which are not ASCII-compatible) unless -major parts of the parser are rewritten - unlikely...)

The internal encoding may be different from the external encoding (specified -in the XML declaration <?xml ... encoding="..."?>); in -this case the strings are automatically converted to the internal encoding.

If the internal encoding is ISO-8859-1, it is possible that there are -characters that cannot be represented. In this case, the parser ignores such -characters and prints a warning (to the collect_warning -object that must be passed when the parser is called).

The XML specification allows lines to be separated by single LF -characters, by CR LF character sequences, or by single CR -characters. Internally, these separators are always converted to single LF -characters.

The parser guarantees that there are never two adjacent data -nodes; if necessary, data material that would otherwise be represented by -several nodes is collapsed into one node. Note that you can still create node -trees with adjacent data nodes; however, the parser does not return such trees.

Note that CDATA sections are not represented specially; such -sections are added to the current data material that being collected for the -next data node.

3.4.3. The representation of entities within documents

Entities are not represented within -documents! If the parser finds an entity reference in the document -content, the reference is immediately expanded, and the parser reads the -expansion text instead of the reference.

3.4.4. The representation of attributes

As attribute -values are composed of Unicode characters, too, the same problems with the -character encoding arise as for character material. Attribute values are -converted to the internal encoding, too; and if there are characters that -cannot be represented, these are dropped, and a warning is printed.

Attribute values are normalized before they are returned by -methods like attribute. First, any remaining entity -references are expanded; if necessary, expansion is performed recursively. -Second, newline characters (any of LF, CR LF, or CR characters) are converted -to single space characters. Note that especially the latter action is -prescribed by the XML standard (but is not converted -such that it is still possible to include line feeds into attributes).

3.4.5. The representation of processing instructions

Processing instructions are parsed to some extent: The first word of the -PI is called the target, and it is stored separated from the rest of the PI: - -

<?target rest?>

- -The exact location where a PI occurs is not represented (by default). The -parser puts the PI into the object that represents the embracing construct (an -element, a DTD, or the whole document); that means you can find out which PIs -occur in a certain element, in the DTD, or in the whole document, but you -cannot lookup the exact position within the construct.

If you require the exact location of PIs, it is possible to -create extra nodes for them. This mode is controled by the option -enable_pinstr_nodes. The additional nodes have the node type -T_pinstr target, and are created -from special exemplars contained in the spec (see -pxp_document.mli).

3.4.6. The representation of comments

Normally, comments are not represented; they are dropped by -default. However, if you require them, it is possible to create -T_comment nodes for them. This mode can be specified by the -option enable_comment_nodes. Comment nodes are created from -special exemplars contained in the spec (see -pxp_document.mli). You can access the contents of comments through the -method comment.

3.4.7. The attributes `xml:lang` and -`xml:space`

These attributes are not supported specially; they are handled -like any other attribute.

3.4.8. And what about namespaces?

Currently, there is no special support for namespaces. -However, the parser allows it that the colon occurs in names such that it is -possible to implement namespaces on top of the current API.

Some future release of PXP will support namespaces as built-in -feature...

Prev	Home	Next
The class type `extension`	Up	Configuring and calling the parser