4 >Details of the mapping from XML text to the tree representation</TITLE
7 CONTENT="Modular DocBook HTML Stylesheet Version 1.46"><LINK
9 TITLE="The PXP user's guide"
10 HREF="index.html"><LINK
12 TITLE="The objects representing the document"
13 HREF="c893.html"><LINK
15 TITLE="The class type extension"
16 HREF="x1439.html"><LINK
18 TITLE="Configuring and calling the parser"
19 HREF="c1567.html"><LINK
22 HREF="markup.css"></HEAD
41 >The PXP user's guide</TH
56 >Chapter 3. The objects representing the document</TD
76 >3.4. Details of the mapping from XML text to the tree representation</A
84 >3.4.1. The representation of character-free elements</A
87 >If an element declaration does not allow the element to
88 contain character data, the following rules apply.</P
90 >If the element must be empty, i.e. it is declared with the
94 >, the element instance must be effectively
95 empty (it must not even contain whitespace characters). The parser guarantees
99 > element does never contain a data
100 node, even if the data node represents the empty string.</P
102 >If the element declaration only permits other elements to occur
103 within that element but not character data, it is still possible to insert
104 whitespace characters between the subelements. The parser ignores these
105 characters, too, and does not create data nodes for them.</P
111 >Consider the following element types:
114 CLASS="PROGRAMLISTING"
115 ><!ELEMENT x ( #PCDATA | z )* >
116 <!ELEMENT y ( z )* >
117 <!ELEMENT z EMPTY></PRE
123 > may contain character data, the keyword
127 > indicates this. The other types are character-free. </P
133 CLASS="PROGRAMLISTING"
134 ><x><z/> <z/></x></PRE
137 will be internally represented by an element node for <TT
141 with three subnodes: the first <TT
144 > element, a data node
145 containing the space character, and the second <TT
149 In contrast to this, the term
152 CLASS="PROGRAMLISTING"
153 ><y><z/> <z/></y></PRE
156 is represented by an element node for <TT
163 > subnodes, the two <TT
167 is no data node for the space character because spaces are ignored in the
168 character-free element <TT
179 >3.4.2. The representation of character data</A
182 >The XML specification allows all Unicode characters in XML
183 texts. This parser can be configured such that UTF-8 is used to represent the
184 characters internally; however, the default character encoding is
185 ISO-8859-1. (Currently, no other encodings are possible for the internal string
186 representation; the type <TT
188 >Pxp_types.rep_encoding</TT
190 the possible encodings. Principially, the parser could use any encoding that is
191 ASCII-compatible, but there are currently only lexical analyzers for UTF-8 and
192 ISO-8859-1. It is currently impossible to use UTF-16 or UCS-4 as internal
193 encodings (or other multibyte encodings which are not ASCII-compatible) unless
194 major parts of the parser are rewritten - unlikely...)</P
196 >The internal encoding may be different from the external encoding (specified
197 in the XML declaration <TT
199 ><?xml ... encoding="..."?></TT
201 this case the strings are automatically converted to the internal encoding.</P
203 >If the internal encoding is ISO-8859-1, it is possible that there are
204 characters that cannot be represented. In this case, the parser ignores such
205 characters and prints a warning (to the <TT
209 object that must be passed when the parser is called).</P
211 >The XML specification allows lines to be separated by single LF
212 characters, by CR LF character sequences, or by single CR
213 characters. Internally, these separators are always converted to single LF
216 >The parser guarantees that there are never two adjacent data
217 nodes; if necessary, data material that would otherwise be represented by
218 several nodes is collapsed into one node. Note that you can still create node
219 trees with adjacent data nodes; however, the parser does not return such trees.</P
221 >Note that CDATA sections are not represented specially; such
222 sections are added to the current data material that being collected for the
231 >3.4.3. The representation of entities within documents</A
236 >Entities are not represented within
238 > If the parser finds an entity reference in the document
239 content, the reference is immediately expanded, and the parser reads the
240 expansion text instead of the reference.</P
248 >3.4.4. The representation of attributes</A
252 values are composed of Unicode characters, too, the same problems with the
253 character encoding arise as for character material. Attribute values are
254 converted to the internal encoding, too; and if there are characters that
255 cannot be represented, these are dropped, and a warning is printed.</P
257 >Attribute values are normalized before they are returned by
261 >. First, any remaining entity
262 references are expanded; if necessary, expansion is performed recursively.
263 Second, newline characters (any of LF, CR LF, or CR characters) are converted
264 to single space characters. Note that especially the latter action is
265 prescribed by the XML standard (but <TT
269 such that it is still possible to include line feeds into attributes).</P
277 >3.4.5. The representation of processing instructions</A
280 >Processing instructions are parsed to some extent: The first word of the
281 PI is called the target, and it is stored separated from the rest of the PI:
284 CLASS="PROGRAMLISTING"
285 ><?target rest?></PRE
288 The exact location where a PI occurs is not represented (by default). The
289 parser puts the PI into the object that represents the embracing construct (an
290 element, a DTD, or the whole document); that means you can find out which PIs
291 occur in a certain element, in the DTD, or in the whole document, but you
292 cannot lookup the exact position within the construct.</P
294 >If you require the exact location of PIs, it is possible to
295 create extra nodes for them. This mode is controled by the option
298 >enable_pinstr_nodes</TT
299 >. The additional nodes have the node type
309 from special exemplars contained in the <TT
313 pxp_document.mli).</P
321 >3.4.6. The representation of comments</A
324 >Normally, comments are not represented; they are dropped by
325 default. However, if you require them, it is possible to create
329 > nodes for them. This mode can be specified by the
332 >enable_comment_nodes</TT
333 >. Comment nodes are created from
334 special exemplars contained in the <TT
338 pxp_document.mli). You can access the contents of comments through the
350 >3.4.7. The attributes <TT
360 >These attributes are not supported specially; they are handled
361 like any other attribute.</P
369 >3.4.8. And what about namespaces?</A
372 >Currently, there is no special support for namespaces.
373 However, the parser allows it that the colon occurs in names such that it is
374 possible to implement namespaces on top of the current API.</P
376 >Some future release of PXP will support namespaces as built-in
436 >Configuring and calling the parser</TD