helm/DEVEL/pxp/pxp/doc/manual/html/x1496.html

   1 <HTML
   2 ><HEAD
   3 ><TITLE
   4 >Details of the mapping from XML text to the tree representation</TITLE
   5 ><META
   6 NAME="GENERATOR"
   7 CONTENT="Modular DocBook HTML Stylesheet Version 1.46"><LINK
   8 REL="HOME"
   9 TITLE="The PXP user's guide"
  10 HREF="index.html"><LINK
  11 REL="UP"
  12 TITLE="The objects representing the document"
  13 HREF="c893.html"><LINK
  14 REL="PREVIOUS"
  15 TITLE="The class type extension"
  16 HREF="x1439.html"><LINK
  17 REL="NEXT"
  18 TITLE="Configuring and calling the parser"
  19 HREF="c1567.html"><LINK
  20 REL="STYLESHEET"
  21 TYPE="text/css"
  22 HREF="markup.css"></HEAD
  23 ><BODY
  24 CLASS="SECT1"
  25 BGCOLOR="#FFFFFF"
  26 TEXT="#000000"
  27 LINK="#0000FF"
  28 VLINK="#840084"
  29 ALINK="#0000FF"
  30 ><DIV
  31 CLASS="NAVHEADER"
  32 ><TABLE
  33 WIDTH="100%"
  34 BORDER="0"
  35 CELLPADDING="0"
  36 CELLSPACING="0"
  37 ><TR
  38 ><TH
  39 COLSPAN="3"
  40 ALIGN="center"
  41 >The PXP user's guide</TH
  42 ></TR
  43 ><TR
  44 ><TD
  45 WIDTH="10%"
  46 ALIGN="left"
  47 VALIGN="bottom"
  48 ><A
  49 HREF="x1439.html"
  50 >Prev</A
  51 ></TD
  52 ><TD
  53 WIDTH="80%"
  54 ALIGN="center"
  55 VALIGN="bottom"
  56 >Chapter 3. The objects representing the document</TD
  57 ><TD
  58 WIDTH="10%"
  59 ALIGN="right"
  60 VALIGN="bottom"
  61 ><A
  62 HREF="c1567.html"
  63 >Next</A
  64 ></TD
  65 ></TR
  66 ></TABLE
  67 ><HR
  68 ALIGN="LEFT"
  69 WIDTH="100%"></DIV
  70 ><DIV
  71 CLASS="SECT1"
  72 ><H1
  73 CLASS="SECT1"
  74 ><A
  75 NAME="AEN1496"
  76 >3.4. Details of the mapping from XML text to the tree representation</A
  77 ></H1
  78 ><DIV
  79 CLASS="SECT2"
  80 ><H2
  81 CLASS="SECT2"
  82 ><A
  83 NAME="AEN1498"
  84 >3.4.1. The representation of character-free elements</A
  85 ></H2
  86 ><P
  87 >If an element declaration does not allow the element to
  88 contain character data, the following rules apply.</P
  89 ><P
  90 >If the element must be empty, i.e. it is declared with the
  91 keyword <TT
  92 CLASS="LITERAL"
  93 >EMPTY</TT
  94 >, the element instance must be effectively
  95 empty (it must not even contain whitespace characters). The parser guarantees
  96 that a declared <TT
  97 CLASS="LITERAL"
  98 >EMPTY</TT
  99 > element does never contain a data
 100 node, even if the data node represents the empty string.</P
 101 ><P
 102 >If the element declaration only permits other elements to occur
 103 within that element but not character data, it is still possible to insert
 104 whitespace characters between the subelements. The parser ignores these
 105 characters, too, and does not create data nodes for them.</P
 106 ><DIV
 107 CLASS="FORMALPARA"
 108 ><P
 109 ><B
 110 >Example. </B
 111 >Consider the following element types:
 112
 113 <PRE
 114 CLASS="PROGRAMLISTING"
 115 >&#60;!ELEMENT x ( #PCDATA | z )* &#62;
 116 &#60;!ELEMENT y ( z )* &#62;
 117 &#60;!ELEMENT z EMPTY&#62;</PRE
 118 >
 119
 120 Only <TT
 121 CLASS="LITERAL"
 122 >x</TT
 123 > may contain character data, the keyword
 124 <TT
 125 CLASS="LITERAL"
 126 >#PCDATA</TT
 127 > indicates this. The other types are character-free. </P
 128 ></DIV
 129 ><P
 130 >The XML term
 131
 132 <PRE
 133 CLASS="PROGRAMLISTING"
 134 >&#60;x&#62;&#60;z/&#62; &#60;z/&#62;&#60;/x&#62;</PRE
 135 >
 136
 137 will be internally represented by an element node for <TT
 138 CLASS="LITERAL"
 139 >x</TT
 140 >
 141 with three subnodes: the first <TT
 142 CLASS="LITERAL"
 143 >z</TT
 144 > element, a data node
 145 containing the space character, and the second <TT
 146 CLASS="LITERAL"
 147 >z</TT
 148 > element.
 149 In contrast to this, the term
 150
 151 <PRE
 152 CLASS="PROGRAMLISTING"
 153 >&#60;y&#62;&#60;z/&#62; &#60;z/&#62;&#60;/y&#62;</PRE
 154 >
 155
 156 is represented by an  element node for <TT
 157 CLASS="LITERAL"
 158 >y</TT
 159 > with only
 160 <I
 161 CLASS="EMPHASIS"
 162 >two</I
 163 > subnodes, the two <TT
 164 CLASS="LITERAL"
 165 >z</TT
 166 > elements. There
 167 is no data node for the space character because spaces are ignored in the
 168 character-free element <TT
 169 CLASS="LITERAL"
 170 >y</TT
 171 >.</P
 172 ></DIV
 173 ><DIV
 174 CLASS="SECT2"
 175 ><H2
 176 CLASS="SECT2"
 177 ><A
 178 NAME="AEN1521"
 179 >3.4.2. The representation of character data</A
 180 ></H2
 181 ><P
 182 >The XML specification allows all Unicode characters in XML
 183 texts. This parser can be configured such that UTF-8 is used to represent the
 184 characters internally; however, the default character encoding is
 185 ISO-8859-1. (Currently, no other encodings are possible for the internal string
 186 representation; the type <TT
 187 CLASS="LITERAL"
 188 >Pxp_types.rep_encoding</TT
 189 > enumerates
 190 the possible encodings. Principially, the parser could use any encoding that is
 191 ASCII-compatible, but there are currently only lexical analyzers for UTF-8 and
 192 ISO-8859-1. It is currently impossible to use UTF-16 or UCS-4 as internal
 193 encodings (or other multibyte encodings which are not ASCII-compatible) unless
 194 major parts of the parser are rewritten - unlikely...)</P
 195 ><P
 196 >The internal encoding may be different from the external encoding (specified
 197 in the XML declaration <TT
 198 CLASS="LITERAL"
 199 >&lt;?xml ... encoding="..."?&gt;</TT
 200 >); in
 201 this case the strings are automatically converted to the internal encoding.</P
 202 ><P
 203 >If the internal encoding is ISO-8859-1, it is possible that there are
 204 characters that cannot be represented. In this case, the parser ignores such
 205 characters and prints a warning (to the <TT
 206 CLASS="LITERAL"
 207 >collect_warning</TT
 208 >
 209 object that must be passed when the parser is called).</P
 210 ><P
 211 >The XML specification allows lines to be separated by single LF
 212 characters, by CR LF character sequences, or by single CR
 213 characters. Internally, these separators are always converted to single LF
 214 characters.</P
 215 ><P
 216 >The parser guarantees that there are never two adjacent data
 217 nodes; if necessary, data material that would otherwise be represented by
 218 several nodes is collapsed into one node. Note that you can still create node
 219 trees with adjacent data nodes; however, the parser does not return such trees.</P
 220 ><P
 221 >Note that CDATA sections are not represented specially; such
 222 sections are added to the current data material that being collected for the
 223 next data node.</P
 224 ></DIV
 225 ><DIV
 226 CLASS="SECT2"
 227 ><H2
 228 CLASS="SECT2"
 229 ><A
 230 NAME="AEN1532"
 231 >3.4.3. The representation of entities within documents</A
 232 ></H2
 233 ><P
 234 ><I
 235 CLASS="EMPHASIS"
 236 >Entities are not represented within
 237 documents!</I
 238 > If the parser finds an entity reference in the document
 239 content, the reference is immediately expanded, and the parser reads the
 240 expansion text instead of the reference.</P
 241 ></DIV
 242 ><DIV
 243 CLASS="SECT2"
 244 ><H2
 245 CLASS="SECT2"
 246 ><A
 247 NAME="AEN1536"
 248 >3.4.4. The representation of attributes</A
 249 ></H2
 250 ><P
 251 >As attribute
 252 values are composed of Unicode characters, too, the same problems with the
 253 character encoding arise as for character material. Attribute values are
 254 converted to the internal encoding, too; and if there are characters that
 255 cannot be represented, these are dropped, and a warning is printed.</P
 256 ><P
 257 >Attribute values are normalized before they are returned by
 258 methods like <TT
 259 CLASS="LITERAL"
 260 >attribute</TT
 261 >. First, any remaining entity
 262 references are expanded; if necessary, expansion is performed recursively.
 263 Second, newline characters (any of LF, CR LF, or CR characters) are converted
 264 to single space characters. Note that especially the latter action is
 265 prescribed by the XML standard (but <TT
 266 CLASS="LITERAL"
 267 ></TT
 268 > is not converted
 269 such that it is still possible to include line feeds into attributes).</P
 270 ></DIV
 271 ><DIV
 272 CLASS="SECT2"
 273 ><H2
 274 CLASS="SECT2"
 275 ><A
 276 NAME="AEN1542"
 277 >3.4.5. The representation of processing instructions</A
 278 ></H2
 279 ><P
 280 >Processing instructions are parsed to some extent: The first word of the
 281 PI is called the target, and it is stored separated from the rest of the PI:
 282
 283 <PRE
 284 CLASS="PROGRAMLISTING"
 285 >&#60;?target rest?&#62;</PRE
 286 >
 287
 288 The exact location where a PI occurs is not represented (by default). The
 289 parser puts the PI into the object that represents the embracing construct (an
 290 element, a DTD, or the whole document); that means you can find out which PIs
 291 occur in a certain element, in the DTD, or in the whole document, but you
 292 cannot lookup the exact position within the construct.</P
 293 ><P
 294 >If you require the exact location of PIs, it is possible to
 295 create extra nodes for them. This mode is controled by the option
 296 <TT
 297 CLASS="LITERAL"
 298 >enable_pinstr_nodes</TT
 299 >. The additional nodes have the node type
 300 <TT
 301 CLASS="LITERAL"
 302 >T_pinstr <TT
 303 CLASS="REPLACEABLE"
 304 ><I
 305 >target</I
 306 ></TT
 307 ></TT
 308 >, and are created
 309 from special exemplars contained in the <TT
 310 CLASS="LITERAL"
 311 >spec</TT
 312 > (see
 313 pxp_document.mli).</P
 314 ></DIV
 315 ><DIV
 316 CLASS="SECT2"
 317 ><H2
 318 CLASS="SECT2"
 319 ><A
 320 NAME="AEN1551"
 321 >3.4.6. The representation of comments</A
 322 ></H2
 323 ><P
 324 >Normally, comments are not represented; they are dropped by
 325 default. However, if you require them, it is possible to create
 326 <TT
 327 CLASS="LITERAL"
 328 >T_comment</TT
 329 > nodes for them. This mode can be specified by the
 330 option <TT
 331 CLASS="LITERAL"
 332 >enable_comment_nodes</TT
 333 >. Comment nodes are created from
 334 special exemplars contained in the <TT
 335 CLASS="LITERAL"
 336 >spec</TT
 337 > (see
 338 pxp_document.mli). You can access the contents of comments through the
 339 method <TT
 340 CLASS="LITERAL"
 341 >comment</TT
 342 >.</P
 343 ></DIV
 344 ><DIV
 345 CLASS="SECT2"
 346 ><H2
 347 CLASS="SECT2"
 348 ><A
 349 NAME="AEN1558"
 350 >3.4.7. The attributes <TT
 351 CLASS="LITERAL"
 352 >xml:lang</TT
 353 > and
 354 <TT
 355 CLASS="LITERAL"
 356 >xml:space</TT
 357 ></A
 358 ></H2
 359 ><P
 360 >These attributes are not supported specially; they are handled
 361 like any other attribute.</P
 362 ></DIV
 363 ><DIV
 364 CLASS="SECT2"
 365 ><H2
 366 CLASS="SECT2"
 367 ><A
 368 NAME="AEN1563"
 369 >3.4.8. And what about namespaces?</A
 370 ></H2
 371 ><P
 372 >Currently, there is no special support for namespaces.
 373 However, the parser allows it that the colon occurs in names such that it is
 374 possible to implement namespaces on top of the current API.</P
 375 ><P
 376 >Some future release of PXP will support namespaces as built-in
 377 feature...</P
 378 ></DIV
 379 ></DIV
 380 ><DIV
 381 CLASS="NAVFOOTER"
 382 ><HR
 383 ALIGN="LEFT"
 384 WIDTH="100%"><TABLE
 385 WIDTH="100%"
 386 BORDER="0"
 387 CELLPADDING="0"
 388 CELLSPACING="0"
 389 ><TR
 390 ><TD
 391 WIDTH="33%"
 392 ALIGN="left"
 393 VALIGN="top"
 394 ><A
 395 HREF="x1439.html"
 396 >Prev</A
 397 ></TD
 398 ><TD
 399 WIDTH="34%"
 400 ALIGN="center"
 401 VALIGN="top"
 402 ><A
 403 HREF="index.html"
 404 >Home</A
 405 ></TD
 406 ><TD
 407 WIDTH="33%"
 408 ALIGN="right"
 409 VALIGN="top"
 410 ><A
 411 HREF="c1567.html"
 412 >Next</A
 413 ></TD
 414 ></TR
 415 ><TR
 416 ><TD
 417 WIDTH="33%"
 418 ALIGN="left"
 419 VALIGN="top"
 420 >The class type <TT
 421 CLASS="LITERAL"
 422 >extension</TT
 423 ></TD
 424 ><TD
 425 WIDTH="34%"
 426 ALIGN="center"
 427 VALIGN="top"
 428 ><A
 429 HREF="c893.html"
 430 >Up</A
 431 ></TD
 432 ><TD
 433 WIDTH="33%"
 434 ALIGN="right"
 435 VALIGN="top"
 436 >Configuring and calling the parser</TD
 437 ></TR
 438 ></TABLE
 439 ></DIV
 440 ></BODY
 441 ></HTML
 442 >