helm/DEVEL/pxp/pxp/doc/manual/src/markup.sgml

   1 <!DOCTYPE book PUBLIC "-//Davenport//DTD DocBook V3.0//EN" [
   2 <!ENTITY markup "<acronym>PXP</acronym>">
   3 <!ENTITY pxp "<acronym>PXP</acronym>">
   4 <!ENTITY % readme.code.to-html SYSTEM "readme.ent">
   5 <!ENTITY apos "&#39;">
   6 <!ENTITY percent "&#37;">
   7 <!ENTITY % get.markup-yacc.mli SYSTEM "yacc.mli.ent">
   8 <!ENTITY % get.markup-dtd.mli SYSTEM "dtd.mli.ent">
   9 %readme.code.to-html;
  10 %get.markup-yacc.mli;
  11 %get.markup-dtd.mli;
  12
  13 <!ENTITY fun "-&gt;">                       <!-- function type operator -->
  14
  15 ]>
  16
  17
  18 <book>
  19
  20   <title>The PXP user's guide</title>
  21   <bookinfo>
  22     <!-- <bookbiblio> -->
  23     <authorgroup>
  24       <author>
  25         <firstname>Gerd</firstname>
  26         <surname>Stolpmann</surname>
  27         <authorblurb>
  28           <para>
  29         <address>
  30           <email>gerd@gerd-stolpmann.de</email>
  31         </address>
  32       </para>
  33         </authorblurb>
  34       </author>
  35     </authorgroup>
  36
  37     <copyright>
  38       <year>1999, 2000</year><holder>Gerd Stolpmann</holder>
  39     </copyright>
  40     <!-- </bookbiblio> -->
  41
  42     <abstract>
  43       <para>
  44 &markup; is a validating parser for XML-1.0 which has been
  45 written entirely in Objective Caml.
  46 </para>
  47       <formalpara>
  48         <title>Download &markup;: </title>
  49         <para>
  50 The free &markup; library can be downloaded at
  51 <ulink URL="http://www.ocaml-programming.de/packages/">
  52 http://www.ocaml-programming.de/packages/
  53 </ulink>. This user's guide is included.
  54 Newest releases of &markup; will be announced in
  55 <ulink URL="http://www.npc.de/ocaml/linkdb/">The OCaml Link
  56 Database</ulink>.
  57 </para>
  58       </formalpara>
  59     </abstract>
  60
  61     <legalnotice>
  62       <title>License</title>
  63       <para>
  64 This document, and the described software, "&markup;", are copyright by
  65 Gerd Stolpmann.
  66 </para>
  67
  68 <para>
  69 Permission is hereby granted, free of charge, to any person obtaining
  70 a copy of this document and the "&markup;" software (the
  71 "Software"), to deal in the Software without restriction, including
  72 without limitation the rights to use, copy, modify, merge, publish,
  73 distribute, sublicense, and/or sell copies of the Software, and to
  74 permit persons to whom the Software is furnished to do so, subject to
  75 the following conditions:
  76 </para>
  77       <para>
  78 The above copyright notice and this permission notice shall be included
  79 in all copies or substantial portions of the Software.
  80 </para>
  81       <para>
  82 The Software is provided ``as is'', without warranty of any kind, express
  83 or implied, including but not limited to the warranties of
  84 merchantability, fitness for a particular purpose and noninfringement.
  85 In no event shall Gerd Stolpmann be liable for any claim, damages or
  86 other liability, whether in an action of contract, tort or otherwise,
  87 arising from, out of or in connection with the Software or the use or
  88 other dealings in the software.
  89 </para>
  90     </legalnotice>
  91
  92   </bookinfo>
  93
  94
  95 <!-- ********************************************************************** -->
  96
  97   <part>
  98     <title>User's guide</title>
  99
 100     <chapter>
 101       <title>What is XML?</title>
 102
 103       <sect1>
 104         <title>Introduction</title>
 105
 106         <para>XML (short for <emphasis>Extensible Markup Language</emphasis>)
 107 generalizes the idea that text documents are typically structured in sections,
 108 sub-sections, paragraphs, and so on. The format of the document is not fixed
 109 (as, for example, in HTML), but can be declared by a so-called DTD (document
 110 type definition). The DTD describes only the rules how the document can be
 111 structured, but not how the document can be processed. For example, if you want
 112 to publish a book that uses XML markup, you will need a processor that converts
 113 the XML file into a printable format such as Postscript. On the one hand, the
 114 structure of XML documents is configurable; on the other hand, there is no
 115 longer a canonical interpretation of the elements of the document; for example
 116 one XML DTD might want that paragraphes are delimited by
 117 <literal>para</literal> tags, and another DTD expects <literal>p</literal> tags
 118 for the same purpose. As a result, for every DTD a new processor is required.
 119 </para>
 120
 121         <para>
 122 Although XML can be used to express structured text documents it is not limited
 123 to this kind of application. For example, XML can also be used to exchange
 124 structured data over a network, or to simply store structured data in
 125 files. Note that XML documents cannot contain arbitrary binary data because
 126 some characters are forbidden; for some applications you need to encode binary
 127 data as text (e.g. the base 64 encoding).
 128 </para>
 129
 130
 131         <sect2>
 132           <title>The "hello world" example</title>
 133         <para>
 134 The following example shows a very simple DTD, and a corresponding document
 135 instance. The document is structured such that it consists of sections, and
 136 that sections consist of paragraphs, and that paragraphs contain plain text:
 137 </para>
 138
 139         <programlisting>
 140 <![CDATA[<!ELEMENT document (section)+>
 141 <!ELEMENT section (paragraph)+>
 142 <!ELEMENT paragraph (#PCDATA)>
 143 ]]>
 144 </programlisting>
 145
 146         <para>The following document is an instance of this DTD:</para>
 147
 148         <programlisting>
 149 <![CDATA[<?xml version="1.0" encoding="ISO-8859-1"?>
 150 <!DOCTYPE document SYSTEM "simple.dtd">
 151 <document>
 152   <section>
 153     <paragraph>This is a paragraph of the first section.</paragraph>
 154     <paragraph>This is another paragraph of the first section.</paragraph>
 155   </section>
 156   <section>
 157     <paragraph>This is the only paragraph of the second section.</paragraph>
 158   </section>
 159 </document>
 160 ]]>
 161 </programlisting>
 162
 163         <para>As in HTML (and, of course, in grand-father SGML), the "pieces" of
 164 the document are delimited by element braces, i.e. such a piece begins with
 165 <literal>&lt;name-of-the-type-of-the-piece&gt;</literal> and ends with
 166 <literal>&lt;/name-of-the-type-of-the-piece&gt;</literal>, and the pieces are
 167 called <emphasis>elements</emphasis>. Unlike HTML and SGML, both start tags and
 168 end tags (i.e. the delimiters written in angle brackets) can never be left
 169 out. For example, HTML calls the paragraphs simply <literal>p</literal>, and
 170 because paragraphs never contain paragraphs, a sequence of several paragraphs
 171 can be written as:
 172
 173 <programlisting><![CDATA[<p>First paragraph
 174 <p>Second paragraph]]></programlisting>
 175
 176 This is not possible in XML; continuing our example above we must always write
 177
 178 <programlisting><![CDATA[<paragraph>First paragraph</paragraph>
 179 <paragraph>Second paragraph</paragraph>]]></programlisting>
 180
 181 The rationale behind that is to (1) simplify the development of XML parsers
 182 (you need not convert the DTD into a deterministic finite automaton which is
 183 required to detect omitted tags), and to (2) make it possible to parse the
 184 document independent of whether the DTD is known or not.
 185 </para>
 186
 187 <para>
 188 The first line of our sample document,
 189
 190 <programlisting>
 191 <![CDATA[<?xml version="1.0" encoding="ISO-8859-1"?>]]>
 192 </programlisting>
 193
 194 is the so-called <emphasis>XML declaration</emphasis>. It expresses that the
 195 document follows the conventions of XML version 1.0, and that the document is
 196 encoded using characters from the ISO-8859-1 character set (often known as
 197 "Latin 1", mostly used in Western Europe). Although the XML declaration is not
 198 mandatory, it is good style to include it; everybody sees at the first glance
 199 that the document uses XML markup and not the similar-looking HTML and SGML
 200 markup languages. If you omit the XML declaration, the parser will assume
 201 that the document is encoded as UTF-8 or UTF-16 (there is a rule that makes
 202 it possible to distinguish between UTF-8 and UTF-16 automatically); these
 203 are encodings of Unicode's universal character set. (Note that &pxp;, unlike its
 204 predecessor "Markup", fully supports Unicode.)
 205 </para>
 206
 207 <para>
 208 The second line,
 209
 210 <programlisting>
 211 <![CDATA[<!DOCTYPE document SYSTEM "simple.dtd">]]>
 212 </programlisting>
 213
 214 names the DTD that is going to be used for the rest of the document. In
 215 general, it is possible that the DTD consists of two parts, the so-called
 216 external and the internal subset. "External" means that the DTD exists as a
 217 second file; "internal" means that the DTD is included in the same file. In
 218 this example, there is only an external subset, and the system identifier
 219 "simple.dtd" specifies where the DTD file can be found. System identifiers are
 220 interpreted as URLs; for instance this would be legal:
 221
 222 <programlisting>
 223 <![CDATA[<!DOCTYPE document SYSTEM "http://host/location/simple.dtd">]]>
 224 </programlisting>
 225
 226 Please note that &pxp; cannot interpret HTTP identifiers by default, but it is
 227 possible to change the interpretation of system identifiers.
 228 </para>
 229
 230         <para>
 231 The word immediately following <literal>DOCTYPE</literal> determines which of
 232 the declared element types (here "document", "section", and "paragraph") is
 233 used for the outermost element, the <emphasis>root element</emphasis>. In this
 234 example it is <literal>document</literal> because the outermost element is
 235 delimited by <literal>&lt;document&gt;</literal> and
 236 <literal>&lt;/document&gt;</literal>.
 237 </para>
 238
 239         <para>
 240 The DTD consists of three declarations for element types:
 241 <literal>document</literal>, <literal>section</literal>, and
 242 <literal>paragraph</literal>. Such a declaration has two parts:
 243
 244 <programlisting>
 245 &lt;!ELEMENT <replaceable>name</replaceable> <replaceable>content-model</replaceable>&gt;
 246 </programlisting>
 247
 248 The content model is a regular expression which describes the possible inner
 249 structure of the element. Here, <literal>document</literal> contains one or
 250 more sections, and a <literal>section</literal> contains one or more
 251 paragraphs. Note that these two element types are not allowed to contain
 252 arbitrary text. Only the <literal>paragraph</literal> element type is declared
 253 such that parsed character data (indicated by the symbol
 254 <literal>#PCDATA</literal>) is permitted.
 255 </para>
 256
 257         <para>
 258 See below for a detailed discussion of content models.
 259 </para>
 260         </sect2>
 261
 262         <sect2>
 263           <title>XML parsers and processors</title>
 264           <para>
 265 XML documents are human-readable, but this is not the main purpose of this
 266 language. XML has been designed such that documents can be read by a program
 267 called an <emphasis>XML parser</emphasis>. The parser checks that the document
 268 is well-formatted, and it represents the document as objects of the programming
 269 language. There are two aspects when checking the document: First, the document
 270 must follow some basic syntactic rules, such as that tags are written in angle
 271 brackets, that for every start tag there must be a corresponding end tag and so
 272 on. A document respecting these rules is
 273 <emphasis>well-formed</emphasis>. Second, the document must match the DTD in
 274 which case the document is <emphasis>valid</emphasis>. Many parsers check only
 275 on well-formedness and ignore the DTD; &pxp; is designed such that it can
 276 even validate the document.
 277 </para>
 278
 279           <para>
 280 A parser does not make a sensible application, it only reads XML
 281 documents. The whole application working with XML-formatted data is called an
 282 <emphasis>XML processor</emphasis>. Often XML processors convert documents into
 283 another format, such as HTML or Postscript. Sometimes processors extract data
 284 of the documents and output the processed data again XML-formatted. The parser
 285 can help the application processing the document; for example it can provide
 286 means to access the document in a specific manner. &pxp; supports an
 287 object-oriented access layer specially.
 288 </para>
 289         </sect2>
 290
 291         <sect2>
 292           <title>Discussion</title>
 293           <para>
 294 As we have seen, there are two levels of description: On the one hand, XML can
 295 define rules about the format of a document (the DTD), on the other hand, XML
 296 expresses structured documents. There are a number of possible applications:
 297 </para>
 298
 299           <itemizedlist mark="bullet" spacing="compact">
 300             <listitem>
 301               <para>
 302 XML can be used to express structured texts. Unlike HTML, there is no canonical
 303 interpretation; one would have to write a backend for the DTD that translates
 304 the structured texts into a format that existing browsers, printers
 305 etc. understand. The advantage of a self-defined document format is that it is
 306 possible to design the format in a more problem-oriented way. For example, if
 307 the task is to extract reports from a database, one can use a DTD that reflects
 308 the structure of the report or the database. A possible approach would be to
 309 have an element type for every database table and for every column. Once the
 310 DTD has been designed, the report procedure can be splitted up in a part that
 311 selects the database rows and outputs them as an XML document according to the
 312 DTD, and in a part that translates the document into other formats. Of course,
 313 the latter part can be solved in a generic way, e.g. there may be configurable
 314 backends for all DTDs that follow the approach and have element types for
 315 tables and columns.
 316 </para>
 317
 318               <para>
 319 XML plays the role of a configurable intermediate format. The database
 320 extraction function can be written without having to know the details of
 321 typesetting; the backends can be written without having to know the details of
 322 the database.
 323 </para>
 324
 325               <para>
 326 Of course, there are traditional solutions. One can define an ad hoc
 327 intermediate text file format. This disadvantage is that there are no names for
 328 the pieces of the format, and that such formats usually lack of documentation
 329 because of this. Another solution would be to have a binary representation,
 330 either as language-dependent or language-independent structure (example of the
 331 latter can be found in RPC implementations). The disadvantage is that it is
 332 harder to view such representations, one has to write pretty printers for this
 333 purpose. It is also more difficult to enter test data; XML is plain text that
 334 can be written using an arbitrary editor (Emacs has even a good XML mode,
 335 PSGML). All these alternatives suffer from a missing structure checker,
 336 i.e. the programs processing these formats usually do not check the input file
 337 or input object in detail; XML parsers check the syntax of the input (the
 338 so-called well-formedness check), and the advanced parsers like &markup; even
 339 verify that the structure matches the DTD (the so-called validation).
 340 </para>
 341
 342             </listitem>
 343
 344             <listitem>
 345               <para>
 346 XML can be used as configurable communication language. A fundamental problem
 347 of every communication is that sender and receiver must follow the same
 348 conventions about the language. For data exchange, the question is usually
 349 which data records and fields are available, how they are syntactically
 350 composed, and which values are possible for the various fields. Similar
 351 questions arise for text document exchange. XML does not answer these problems
 352 completely, but it reduces the number of ambiguities for such conventions: The
 353 outlines of the syntax are specified by the DTD (but not necessarily the
 354 details), and XML introduces canonical names for the components of documents
 355 such that it is simpler to describe the rest of the syntax and the semantics
 356 informally.
 357 </para>
 358             </listitem>
 359
 360             <listitem>
 361               <para>
 362 XML is a data storage format. Currently, every software product tends to use
 363 its own way to store data; commercial software often does not describe such
 364 formats, and it is a pain to integrate such software into a bigger project.
 365 XML can help to improve this situation when several applications share the same
 366 syntax of data files. DTDs are then neutral instances that check the format of
 367 data files independent of applications.
 368 </para>
 369             </listitem>
 370
 371           </itemizedlist>
 372         </sect2>
 373       </sect1>
 374
 375
 376       <!-- ================================================== -->
 377
 378
 379       <sect1>
 380         <title>Highlights of XML</title>
 381
 382         <para>
 383 This section explains many of the features of XML, but not all, and some
 384 features not in detail. For a complete description, see the <ulink
 385 url="http://www.w3.org/TR/1998/REC-xml-19980210.html">XML
 386 specification</ulink>.
 387 </para>
 388
 389         <sect2>
 390           <title>The DTD and the instance</title>
 391           <para>
 392 The DTD contains various declarations; in general you can only use a feature if
 393 you have previously declared it. The document instance file may contain the
 394 full DTD, but it is also possible to split the DTD into an internal and an
 395 external subset. A document must begin as follows if the full DTD is included:
 396
 397 <programlisting>
 398 &lt;?xml version="1.0" encoding="<replaceable>Your encoding</replaceable>"?&gt;
 399 &lt;!DOCTYPE <replaceable>root</replaceable> [
 400   <replaceable>Declarations</replaceable>
 401 ]&gt;
 402 </programlisting>
 403
 404 These declarations are called the <emphasis>internal subset</emphasis>. Note
 405 that the usage of entities and conditional sections is restricted within the
 406 internal subset.
 407 </para>
 408           <para>
 409 If the declarations are located in a different file, you can refer to this file
 410 as follows:
 411
 412 <programlisting>
 413 &lt;?xml version="1.0" encoding="<replaceable>Your encoding</replaceable>"?&gt;
 414 &lt;!DOCTYPE <replaceable>root</replaceable> SYSTEM "<replaceable>file name</replaceable>"&gt;
 415 </programlisting>
 416
 417 The declarations in the file are called the <emphasis>external
 418 subset</emphasis>. The file name is called the <emphasis>system
 419 identifier</emphasis>.
 420 It is also possible to refer to the file by a so-called
 421 <emphasis>public identifier</emphasis>, but most XML applications won't use
 422 this feature.
 423 </para>
 424           <para>
 425 You can also specify both internal and external subsets. In this case, the
 426 declarations of both subsets are mixed, and if there are conflicts, the
 427 declaration of the internal subset overrides those of the external subset with
 428 the same name. This looks as follows:
 429
 430 <programlisting>
 431 &lt;?xml version="1.0" encoding="<replaceable>Your encoding</replaceable>"?&gt;
 432 &lt;!DOCTYPE <replaceable>root</replaceable>  SYSTEM "<replaceable>file name</replaceable>" [
 433   <replaceable>Declarations</replaceable>
 434 ]&gt;
 435 </programlisting>
 436 </para>
 437
 438           <para>
 439 The XML declaration (the string beginning with <literal>&lt;?xml</literal> and
 440 ending at <literal>?&gt;</literal>) should specify the encoding of the
 441 file. Common values are UTF-8, and the ISO-8859 series of character sets. Note
 442 that every file parsed by the XML processor can begin with an XML declaration
 443 and that every file may have its own encoding.
 444 </para>
 445
 446           <para>
 447 The name of the root element must be mentioned directly after the
 448 <literal>DOCTYPE</literal> string. This means that a full document instance
 449 looks like
 450
 451 <programlisting>
 452 &lt;?xml version="1.0" encoding="<replaceable>Your encoding</replaceable>"?&gt;
 453 &lt;!DOCTYPE <replaceable>root</replaceable>  SYSTEM "<replaceable>file name</replaceable>" [
 454   <replaceable>Declarations</replaceable>
 455 ]&gt;
 456
 457 &lt;<replaceable>root</replaceable>&gt;
 458   <replaceable>inner contents</replaceable>
 459 &lt;/<replaceable>root</replaceable>&gt;
 460 </programlisting>
 461 </para>
 462         </sect2>
 463
 464         <!-- ======================================== -->
 465
 466         <sect2>
 467           <title>Reserved characters</title>
 468           <para>
 469 Some characters are generally reserved to indicate markup such that they cannot
 470 be used for character data. These characters are &lt;, &gt;, and
 471 &amp;. Furthermore, single and double quotes are sometimes reserved. If you
 472 want to include such a character as character, write it as follows:
 473
 474 <itemizedlist mark="bullet" spacing="compact">
 475               <listitem>
 476                 <para>
 477 <literal>&amp;lt;</literal> instead of &lt;
 478 </para>
 479               </listitem>
 480               <listitem>
 481                 <para>
 482 <literal>&amp;gt;</literal> instead of &gt;
 483 </para>
 484               </listitem>
 485               <listitem>
 486                 <para>
 487 <literal>&amp;amp;</literal> instead of &amp;
 488 </para>
 489               </listitem>
 490               <listitem>
 491                 <para>
 492 <literal>&amp;apos;</literal> instead of '
 493 </para>
 494               </listitem>
 495               <listitem>
 496                 <para>
 497 <literal>&amp;quot;</literal> instead of "
 498 </para>
 499               </listitem>
 500             </itemizedlist>
 501
 502 All other characters are free in the document instance. It is possible to
 503 include a character by its position in the Unicode alphabet:
 504
 505 <programlisting>
 506 &amp;#<replaceable>n</replaceable>;
 507 </programlisting>
 508
 509 where <replaceable>n</replaceable> is the decimal number of the
 510 character. Alternatively, you can specify the character by its hexadecimal
 511 number:
 512
 513 <programlisting>
 514 &amp;#x<replaceable>n</replaceable>;
 515 </programlisting>
 516
 517 In the scope of declarations, the character % is no longer free. To include it
 518 as character, you must use the notations <literal>&amp;#37;</literal> or
 519 <literal>&amp;#x25;</literal>.
 520 </para>
 521
 522           <para>Note that besides &amp;lt;, &amp;gt;, &amp;amp;,
 523 &amp;apos;, and &amp;quot; there are no predefines character entities. This is
 524 different from HTML which defines a list of characters that can be referenced
 525 by name (e.g. &amp;auml; for ä); however, if you prefer named characters, you
 526 can declare such entities yourself (see below).</para>
 527         </sect2>
 528
 529
 530         <!-- ======================================== -->
 531
 532         <sect2>
 533           <title>Elements and ELEMENT declarations</title>
 534
 535           <para>
 536 Elements structure the document instance in a hierarchical way. There is a
 537 top-level element, the <emphasis>root element</emphasis>, which contains a
 538 sequence of inner elements and character sections. The inner elements are
 539 structured in the same way. Every element has an <emphasis>element
 540 type</emphasis>. The beginning of the element is indicated by a <emphasis>start
 541 tag</emphasis>, written
 542
 543 <programlisting>
 544 &lt;<replaceable>element-type</replaceable>&gt;
 545 </programlisting>
 546
 547 and the element continues until the corresponding <emphasis>end tag</emphasis>
 548 is reached:
 549
 550 <programlisting>
 551 &lt;/<replaceable>element-type</replaceable>&gt;
 552 </programlisting>
 553
 554 In XML, it is not allowed to omit start or end tags, even if the DTD would
 555 permit this. Note that there are no special rules how to interpret spaces or
 556 newlines near start or end tags; all spaces and newlines count.
 557 </para>
 558
 559           <para>
 560 Every element type must be declared before it can be used. The declaration
 561 consists of two parts: the ELEMENT declaration describes the content model,
 562 i.e. which inner elements are allowed; the ATTLIST declaration describes the
 563 attributes of the element.
 564 </para>
 565
 566           <para>
 567 An element can simply allow everything as content. This is written:
 568
 569 <programlisting>
 570 &lt!ELEMENT <replaceable>name</replaceable> ANY&gt;
 571 </programlisting>
 572
 573 On the opposite, an element can be forced to be empty; declared by:
 574
 575 <programlisting>
 576 &lt!ELEMENT <replaceable>name</replaceable> EMPTY&gt;
 577 </programlisting>
 578
 579 Note that there is an abbreviated notation for empty element instances:
 580 <literal>&lt;<replaceable>name</replaceable>/&gt;</literal>.
 581 </para>
 582
 583           <para>
 584 There are two more sophisticated forms of declarations: so-called
 585 <emphasis>mixed declarations</emphasis>, and <emphasis>regular
 586 expressions</emphasis>. An element with mixed content contains character data
 587 interspersed with inner elements, and the set of allowed inner elements can be
 588 specified. In contrast to this, a regular expression declaration does not allow
 589 character data, but the inner elements can be described by the more powerful
 590 means of regular expressions.
 591 </para>
 592
 593           <para>
 594 A declaration for mixed content looks as follows:
 595
 596 <programlisting>
 597 &lt;!ELEMENT <replaceable>name</replaceable> (#PCDATA | <replaceable>element<subscript>1</subscript></replaceable> | ... | <replaceable>element<subscript>n</subscript></replaceable> )*&gt;
 598 </programlisting>
 599
 600 or if you do not want to allow any inner element, simply
 601
 602 <programlisting>
 603 &lt;!ELEMENT <replaceable>name</replaceable> (#PCDATA)&gt;
 604 </programlisting>
 605 </para>
 606
 607
 608 <blockquote>
 609               <title>Example</title>
 610               <para>
 611 If element type <literal>q</literal> is declared as
 612
 613 <programlisting>
 614 <![CDATA[<!ELEMENT q (#PCDATA | r | s)*>]]>
 615 </programlisting>
 616
 617 this is a legal instance:
 618
 619 <programlisting>
 620 <![CDATA[<q>This is character data<r></r>with <s></s>inner elements</q>]]>
 621 </programlisting>
 622
 623 But this is illegal because <literal>t</literal> has not been enumerated in the
 624 declaration:
 625
 626 <programlisting>
 627 <![CDATA[<q>This is character data<r></r>with <t></t>inner elements</q>]]>
 628 </programlisting>
 629 </para>
 630             </blockquote>
 631
 632           <para>
 633 The other form uses a regular expression to describe the possible contents:
 634
 635 <programlisting>
 636 &lt;!ELEMENT <replaceable>name</replaceable> <replaceable>regexp</replaceable>&gt;
 637 </programlisting>
 638
 639 The following well-known regexp operators are allowed:
 640
 641 <itemizedlist mark="bullet" spacing="compact">
 642               <listitem>
 643                 <para>
 644 <literal><replaceable>element-name</replaceable></literal>
 645 </para>
 646               </listitem>
 647
 648               <listitem>
 649                 <para>
 650 <literal>(<replaceable>subexpr<subscript>1</subscript></replaceable> ,</literal> ... <literal>, <replaceable>subexpr<subscript>n</subscript></replaceable> )</literal>
 651 </para>
 652               </listitem>
 653
 654               <listitem>
 655                 <para>
 656 <literal>(<replaceable>subexpr<subscript>1</subscript></replaceable> |</literal> ... <literal>| <replaceable>subexpr<subscript>n</subscript></replaceable> )</literal>
 657 </para>
 658               </listitem>
 659
 660               <listitem>
 661                 <para>
 662 <literal><replaceable>subexpr</replaceable>*</literal>
 663 </para>
 664               </listitem>
 665
 666               <listitem>
 667                 <para>
 668 <literal><replaceable>subexpr</replaceable>+</literal>
 669 </para>
 670               </listitem>
 671
 672               <listitem>
 673                 <para>
 674 <literal><replaceable>subexpr</replaceable>?</literal>
 675 </para>
 676               </listitem>
 677             </itemizedlist>
 678
 679 The <literal>,</literal> operator indicates a sequence of sub-models, the
 680 <literal>|</literal> operator describes alternative sub-models. The
 681 <literal>*</literal> indicates zero or more repetitions, and
 682 <literal>+</literal> one or more repetitions. Finally, <literal>?</literal> can
 683 be used for optional sub-models. As atoms the regexp can contain names of
 684 elements; note that it is not allowed to include <literal>#PCDATA</literal>.
 685 </para>
 686
 687           <para>
 688 The exact syntax of the regular expressions is rather strange. This can be
 689 explained best by a list of constraints:
 690
 691 <itemizedlist mark="bullet" spacing="compact">
 692               <listitem>
 693                 <para>
 694 The outermost expression must not be
 695 <literal><replaceable>element-name</replaceable></literal>.
 696 </para>
 697                 <para><emphasis>Illegal:</emphasis>
 698 <literal><![CDATA[<!ELEMENT x y>]]></literal>; this must be written as
 699 <literal><![CDATA[<!ELEMENT x (y)>]]></literal>.</para>
 700               </listitem>
 701               <listitem>
 702                 <para>
 703 For the unary operators <literal><replaceable>subexpr</replaceable>*</literal>,
 704 <literal><replaceable>subexpr</replaceable>+</literal>, and
 705 <literal><replaceable>subexpr</replaceable>?</literal>, the
 706 <literal><replaceable>subexpr</replaceable></literal> must not be again an
 707 unary operator.
 708 </para>
 709                 <para><emphasis>Illegal:</emphasis>
 710 <literal><![CDATA[<!ELEMENT x y**>]]></literal>; this must be written as
 711 <literal><![CDATA[<!ELEMENT x (y*)*>]]></literal>.</para>
 712       </listitem>
 713               <listitem>
 714                 <para>
 715 Between <literal>)</literal> and one of the unary operatory
 716 <literal>*</literal>, <literal>+</literal>, or <literal>?</literal>, there must
 717 not be whitespace.</para>
 718                 <para><emphasis>Illegal:</emphasis>
 719 <literal><![CDATA[<!ELEMENT x (y|z) *>]]></literal>; this must be written as
 720 <literal><![CDATA[<!ELEMENT x (y|z)*>]]></literal>.</para>
 721               </listitem>
 722               <listitem><para>There is the additional constraint that the
 723 right parenthsis must be contained in the same entity as the left parenthesis;
 724 see the section about parsed entities below.</para>
 725               </listitem>
 726             </itemizedlist>
 727
 728 </para>
 729
 730 <para>
 731 Note that there is another restriction on regular expressions which must be
 732 deterministic. This means that the parser must be able to see by looking at the
 733 next token which alternative is actually used, or whether the repetition
 734 stops. The reason for this is simply compatability with SGML (there is no
 735 intrinsic reason for this rule; XML can live without this restriction).
 736 </para>
 737
 738           <blockquote>
 739             <title>Example</title>
 740             <para>
 741 The elements are declared as follows:
 742
 743 <programlisting>
 744 <![CDATA[<!ELEMENT q (r?, (s | t)+)>
 745 <!ELEMENT r (#PCDATA)>
 746 <!ELEMENT s EMPTY>
 747 <!ELEMENT t (q | r)>
 748 ]]></programlisting>
 749
 750 This is a legal instance:
 751
 752 <programlisting>
 753 <![CDATA[<q><r>Some characters</r><s/></q>]]>
 754 </programlisting>
 755
 756 (Note: <literal>&lt;s/&gt;</literal> is an abbreviation for
 757 <literal>&lt;s&gt;&lt;/s&gt;</literal>.)
 758
 759 It would be illegal to leave <literal><![CDATA[<s/>]]></literal> out because at
 760 least one instance of <literal>s</literal> or <literal>t</literal> must be
 761 present. It would be illegal, too, if characters existed outside the
 762 <literal>r</literal> element; the only exception is white space. -- This is
 763 legal, too:
 764
 765 <programlisting>
 766 <![CDATA[<q><s/><t><q><s/></q></t></q>]]>
 767 </programlisting>
 768 </para>
 769           </blockquote>
 770
 771         </sect2>
 772
 773         <!-- ======================================== -->
 774
 775         <sect2>
 776           <title>Attribute lists and ATTLIST declarations</title>
 777           <para>
 778 Elements may have attributes. These are put into the start tag of an element as
 779 follows:
 780
 781 <programlisting>
 782 &lt;<replaceable>element-name</replaceable> <replaceable>attribute<subscript>1</subscript></replaceable>="<replaceable>value<subscript>1</subscript></replaceable>" ... <replaceable>attribute<subscript>n</subscript></replaceable>="<replaceable>value<subscript>n</subscript></replaceable>"&gt;
 783 </programlisting>
 784
 785 Instead of
 786 <literal>"<replaceable>value<subscript>k</subscript></replaceable>"</literal>
 787 it is also possible to use single quotes as in
 788 <literal>'<replaceable>value<subscript>k</subscript></replaceable>'</literal>.
 789 Note that you cannot use double quotes literally within the value of the
 790 attribute if double quotes are the delimiters; the same applies to single
 791 quotes. You can generally not use &lt; and &amp; as characters in attribute
 792 values. It is possible to include the paraphrases &amp;lt;, &amp;gt;,
 793 &amp;amp;, &amp;apos;, and &amp;quot; (and any other reference to a general
 794 entity as long as the entity is not defined by an external file) as well as
 795 &amp;#<replaceable>n</replaceable>;.
 796 </para>
 797
 798           <para>
 799 Before you can use an attribute you must declare it. An ATTLIST declaration
 800 looks as follows:
 801
 802 <programlisting>
 803 &lt;!ATTLIST <replaceable>element-name</replaceable>
 804           <replaceable>attribute-name</replaceable> <replaceable>attribute-type</replaceable> <replaceable>attribute-default</replaceable>
 805           ...
 806           <replaceable>attribute-name</replaceable> <replaceable>attribute-type</replaceable> <replaceable>attribute-default</replaceable>
 807 &gt;
 808 </programlisting>
 809
 810 There are a lot of types, but most important are:
 811
 812 <itemizedlist mark="bullet" spacing="compact">
 813               <listitem>
 814                 <para>
 815 <literal>CDATA</literal>: Every string is allowed as attribute value.
 816 </para>
 817               </listitem>
 818               <listitem>
 819                 <para>
 820 <literal>NMTOKEN</literal>: Every nametoken is allowed as attribute
 821 value. Nametokens consist (mainly) of letters, digits, ., :, -, _ in arbitrary
 822 order.
 823 </para>
 824               </listitem>
 825               <listitem>
 826                 <para>
 827 <literal>NMTOKENS</literal>: A space-separated list of nametokens is allowed as
 828 attribute value.
 829 </para>
 830               </listitem>
 831             </itemizedlist>
 832
 833 The most interesting default declarations are:
 834
 835 <itemizedlist mark="bullet" spacing="compact">
 836               <listitem>
 837                 <para>
 838 <literal>#REQUIRED</literal>: The attribute must be specified.
 839 </para>
 840               </listitem>
 841               <listitem>
 842                 <para>
 843 <literal>#IMPLIED</literal>: The attribute can be specified but also can be
 844 left out. The application can find out whether the attribute was present or
 845 not.
 846 </para>
 847               </listitem>
 848               <listitem>
 849                 <para>
 850 <literal>"<replaceable>value</replaceable>"</literal> or
 851 <literal>'<replaceable>value</replaceable>'</literal>: This particular value is
 852 used as default if the attribute is omitted in the element.
 853 </para>
 854               </listitem>
 855             </itemizedlist>
 856 </para>
 857
 858           <blockquote>
 859             <title>Example</title>
 860             <para>
 861 This is a valid attribute declaration for element type <literal>r</literal>:
 862
 863 <programlisting>
 864 <![CDATA[<!ATTLIST r
 865           x CDATA    #REQUIRED
 866           y NMTOKEN  #IMPLIED
 867           z NMTOKENS "one two three">
 868 ]]></programlisting>
 869
 870 This means that <literal>x</literal> is a required attribute that cannot be
 871 left out, while <literal>y</literal> and <literal>z</literal> are optional. The
 872 XML parser indicates the application whether <literal>y</literal> is present or
 873 not, but if <literal>z</literal> is missing the default value
 874 "one two three" is returned automatically.
 875 </para>
 876
 877             <para>
 878 This is a valid example of these attributes:
 879
 880 <programlisting>
 881 <![CDATA[<r x="He said: &quot;I don't like quotes!&quot;" y='1'>]]>
 882 </programlisting>
 883 </para>
 884           </blockquote>
 885
 886         </sect2>
 887
 888         <sect2>
 889           <title>Parsed entities</title>
 890           <para>
 891 Elements describe the logical structure of the document, while
 892 <emphasis>entities</emphasis> determine the physical structure. Entities are
 893 the pieces of text the parser operates on, mostly files and macros. Entities
 894 may be <emphasis>parsed</emphasis> in which case the parser reads the text and
 895 interprets it as XML markup, or <emphasis>unparsed</emphasis> which simply
 896 means that the data of the entity has a foreign format (e.g. a GIF icon).
 897 </para>
 898
 899           <para>If the parsed entity is going to be used as part of the DTD, it
 900 is called a <emphasis>parameter entity</emphasis>. You can declare a parameter
 901 entity with a fixed text as content by:
 902
 903 <programlisting>
 904 &lt;!ENTITY % <replaceable>name</replaceable> "<replaceable>value</replaceable>"&gt;
 905 </programlisting>
 906
 907 Within the DTD, you can <emphasis>refer to</emphasis> this entity, i.e. read
 908 the text of the entity, by:
 909
 910 <programlisting>
 911 %<replaceable>name</replaceable>;
 912 </programlisting>
 913
 914 Such entities behave like macros, i.e. when they are referred to, the
 915 macro text is inserted and read instead of the original text.
 916
 917 <blockquote>
 918               <title>Example</title>
 919               <para>
 920 For example, you can declare two elements with the same content model by:
 921
 922 <programlisting>
 923 <![CDATA[
 924 <!ENTITY % model "a | b | c">
 925 <!ELEMENT x (%model;)>
 926 <!ELEMENT y (%model;)>
 927 ]]>
 928 </programlisting>
 929
 930 </para>
 931             </blockquote>
 932
 933 If the contents of the entity are given as string constant, the entity is
 934 called an <emphasis>internal</emphasis> entity. It is also possible to name a
 935 file to be used as content (an <emphasis>external</emphasis> entity):
 936
 937 <programlisting>
 938 &lt;!ENTITY % <replaceable>name</replaceable> SYSTEM "<replaceable>file name</replaceable>"&gt;
 939 </programlisting>
 940
 941 There are some restrictions for parameter entities:
 942
 943 <itemizedlist mark="bullet" spacing="compact">
 944               <listitem>
 945                 <para>
 946 If the internal parameter entity contains the first token of a declaration
 947 (i.e. <literal>&lt;!</literal>), it must also contain the last token of the
 948 declaration, i.e. the <literal>&gt;</literal>. This means that the entity
 949 either contains a whole number of complete declarations, or some text from the
 950 middle of one declaration.
 951 </para>
 952 <para><emphasis>Illegal:</emphasis>
 953 <programlisting>
 954 <![CDATA[
 955 <!ENTITY % e "(a | b | c)>">
 956 <!ELEMENT x %e;
 957 ]]></programlisting> Because <literal>&lt;!</literal> is contained in the main
 958 entity, and the corresponding <literal>&gt;</literal> is contained in the
 959 entity <literal>e</literal>.</para>
 960               </listitem>
 961               <listitem>
 962                 <para>
 963 If the internal parameter entity contains a left paranthesis, it must also
 964 contain the corresponding right paranthesis.
 965 </para>
 966 <para><emphasis>Illegal:</emphasis>
 967 <programlisting>
 968 <![CDATA[
 969 <!ENTITY % e "(a | b | c">
 970 <!ELEMENT x %e;)>
 971 ]]></programlisting> Because <literal>(</literal> is contained in the entity
 972 <literal>e</literal>, and the corresponding <literal>)</literal> is
 973 contained in the main entity.</para>
 974               </listitem>
 975               <listitem>
 976                 <para>
 977 When reading text from an entity, the parser automatically inserts one space
 978 character before the entity text and one space character after the entity
 979 text. However, this rule is not applied within the definition of another
 980 entity.</para>
 981 <para><emphasis>Legal:</emphasis>
 982 <programlisting>
 983 <![CDATA[
 984 <!ENTITY % suffix "gif">
 985 <!ENTITY iconfile 'icon.%suffix;'>
 986 ]]></programlisting> Because <literal>%suffix;</literal> is referenced within
 987 the definition text for <literal>iconfile</literal>, no additional spaces are
 988 added.
 989 </para>
 990 <para><emphasis>Illegal:</emphasis>
 991 <programlisting>
 992 <![CDATA[
 993 <!ENTITY % suffix "test">
 994 <!ELEMENT x.%suffix; ANY>
 995 ]]></programlisting>
 996 Because <literal>%suffix;</literal> is referenced outside the definition
 997 text of another entity, the parser replaces <literal>%suffix;</literal> by
 998 <literal><replaceable>space</replaceable>test<replaceable>space</replaceable></literal>. </para>
 999 <para><emphasis>Illegal:</emphasis>
1000 <programlisting>
1001 <![CDATA[
1002 <!ENTITY % e "(a | b | c)">
1003 <!ELEMENT x %e;*>
1004 ]]></programlisting> Because there is a whitespace between <literal>)</literal>
1005 and <literal>*</literal>, which is illegal.</para>
1006               </listitem>
1007               <listitem>
1008                 <para>
1009 An external parameter entity must always consist of a whole number of complete
1010 declarations.
1011 </para>
1012               </listitem>
1013               <listitem>
1014                 <para>
1015 In the internal subset of the DTD, a reference to a parameter entity (internal
1016 or external) is only allowed at positions where a new declaration can start.
1017 </para>
1018               </listitem>
1019             </itemizedlist>
1020 </para>
1021
1022           <para>
1023 If the parsed entity is going to be used in the document instance, it is called
1024 a <emphasis>general entity</emphasis>. Such entities can be used as
1025 abbreviations for frequent phrases, or to include external files. Internal
1026 general entities are declared as follows:
1027
1028 <programlisting>
1029 &lt;!ENTITY <replaceable>name</replaceable> "<replaceable>value</replaceable>"&gt;
1030 </programlisting>
1031
1032 External general entities are declared this way:
1033
1034 <programlisting>
1035 &lt;!ENTITY <replaceable>name</replaceable> SYSTEM "<replaceable>file name</replaceable>"&gt;
1036 </programlisting>
1037
1038 References to general entities are written as:
1039
1040 <programlisting>
1041 &<replaceable>name</replaceable>;
1042 </programlisting>
1043
1044 The main difference between parameter and general entities is that the former
1045 are only recognized in the DTD and that the latter are only recognized in the
1046 document instance. As the DTD is parsed before the document, the parameter
1047 entities are expanded first; for example it is possible to use the content of a
1048 parameter entity as the name of a general entity:
1049 <literal>&amp;#38;%name;;</literal><footnote><para>This construct is only
1050 allowed within the definition of another entity; otherwise extra spaces would
1051 be added (as explained above). Such indirection is not recommended.
1052 </para>
1053 <para>Complete example:
1054 <programlisting>
1055 <![CDATA[
1056 <!ENTITY % variant "a">      <!-- or "b" -->
1057 <!ENTITY text-a "This is text A.">
1058 <!ENTITY text-b "This is text B.">
1059 <!ENTITY text "&#38;text-%variant;;">
1060 ]]></programlisting>
1061 You can now write <literal>&amp;text;</literal> in the document instance, and
1062 depending on the value of <literal>variant</literal> either
1063 <literal>text-a</literal> or <literal>text-b</literal> is inserted.</para>
1064 </footnote>.
1065 </para>
1066           <para>
1067 General entities must respect the element hierarchy. This means that there must
1068 be an end tag for every start tag in the entity value, and that end tags
1069 without corresponding start tags are not allowed.
1070 </para>
1071
1072           <blockquote>
1073             <title>Example</title>
1074             <para>
1075 If the author of a document changes sometimes, it is worthwhile to set up a
1076 general entity containing the names of the authors. If the author changes, you
1077 need only to change the definition of the entity, and do not need to check all
1078 occurrences of authors' names:
1079
1080 <programlisting>
1081 <![CDATA[
1082 <!ENTITY authors "Gerd Stolpmann">
1083 ]]>
1084 </programlisting>
1085
1086 In the document text, you can now refer to the author names by writing
1087 <literal>&amp;authors;</literal>.
1088 </para>
1089
1090             <para>
1091 <emphasis>Illegal:</emphasis>
1092 The following two entities are illegal because the elements in the definition
1093 do not nest properly:
1094
1095 <programlisting>
1096 <![CDATA[
1097 <!ENTITY lengthy-tag "<section textcolor='white' background='graphic'>">
1098 <!ENTITY nonsense    "<a></b>">
1099 ]]></programlisting>
1100 </para>
1101           </blockquote>
1102
1103           <para>
1104 Earlier in this introduction we explained that there are substitutes for
1105 reserved characters: &amp;lt;, &amp;gt;, &amp;amp;, &amp;apos;, and
1106 &amp;quot;. These are simply predefined general entities; note that they are
1107 the only predefined entities. It is allowed to define these entities again
1108 as long as the meaning is unchanged.
1109 </para>
1110         </sect2>
1111
1112         <sect2>
1113           <title>Notations and unparsed entities</title>
1114           <para>
1115 Unparsed entities have a foreign format and can thus not be read by the XML
1116 parser. Unparsed entities are always external. The format of an unparsed entity
1117 must have been declared, such a format is called a
1118 <emphasis>notation</emphasis>. The entity can then be declared by referring to
1119 this notation. As unparsed entities do not contain XML text, it is not possible
1120 to include them directly into the document; you can only declare attributes
1121 such that names of unparsed entities are acceptable values.
1122 </para>
1123
1124           <para>
1125 As you can see, unparsed entities are too complicated in order to have any
1126 purpose. It is almost always better to simply pass the name of the data file as
1127 normal attribute value, and let the application recognize and process the
1128 foreign format.
1129 </para>
1130         </sect2>
1131
1132       </sect1>
1133
1134
1135       <!-- ================================================== -->
1136
1137
1138       <sect1 id="sect.readme.dtd">
1139         <title>A complete example: The <emphasis>readme</emphasis> DTD</title>
1140         <para>
1141 The reason for <emphasis>readme</emphasis> was that I often wrote two versions
1142 of files such as README and INSTALL which explain aspects of a distributed
1143 software archive; one version was ASCII-formatted, the other was written in
1144 HTML. Maintaining both versions means double amount of work, and changes
1145 of one version may be forgotten in the other version. To improve this situation
1146 I invented the <emphasis>readme</emphasis> DTD which allows me to maintain only
1147 one source written as XML document, and to generate the ASCII and the HTML
1148 version from it.
1149 </para>
1150
1151         <para>
1152 In this section, I explain only the DTD. The <emphasis>readme</emphasis> DTD is
1153 contained in the &markup; distribution together with the two converters to
1154 produce ASCII and HTML. Another <link
1155 linkend="sect.readme.to-html">section</link> of this manual describes the HTML
1156 converter.
1157 </para>
1158
1159         <para>
1160 The documents have a simple structure: There are up to three levels of nested
1161 sections, paragraphs, item lists, footnotes, hyperlinks, and text emphasis. The
1162 outermost element has usually the type <literal>readme</literal>, it is
1163 declared by
1164
1165 <programlisting>
1166 <![CDATA[<!ELEMENT readme (sect1+)>
1167 <!ATTLIST readme
1168           title CDATA #REQUIRED>
1169 ]]></programlisting>
1170
1171 This means that this element contains one or more sections of the first level
1172 (element type <literal>sect1</literal>), and that the element has a required
1173 attribute <literal>title</literal> containing character data (CDATA). Note that
1174 <literal>readme</literal> elements must not contain text data.
1175 </para>
1176
1177         <para>
1178 The three levels of sections are declared as follows:
1179
1180 <programlisting>
1181 <![CDATA[<!ELEMENT sect1 (title,(sect2|p|ul)+)>
1182
1183 <!ELEMENT sect2 (title,(sect3|p|ul)+)>
1184
1185 <!ELEMENT sect3 (title,(p|ul)+)>
1186 ]]></programlisting>
1187
1188 Every section has a <literal>title</literal> element as first subelement. After
1189 the title an arbitrary but non-empty sequence of inner sections, paragraphs and
1190 item lists follows. Note that the inner sections must belong to the next higher
1191 section level; <literal>sect3</literal> elements must not contain inner
1192 sections because there is no next higher level.
1193 </para>
1194
1195         <para>
1196 Obviously, all three declarations allow paragraphs (<literal>p</literal>) and
1197 item lists (<literal>ul</literal>). The definition can be simplified at this
1198 point by using a parameter entity:
1199
1200 <programlisting>
1201 <![CDATA[<!ENTITY % p.like "p|ul">
1202
1203 <!ELEMENT sect1 (title,(sect2|%p.like;)+)>
1204
1205 <!ELEMENT sect2 (title,(sect3|%p.like;)+)>
1206
1207 <!ELEMENT sect3 (title,(%p.like;)+)>
1208 ]]></programlisting>
1209
1210 Here, the entity <literal>p.like</literal> is nothing but a macro abbreviating
1211 the same sequence of declarations; if new elements on the same level as
1212 <literal>p</literal> and <literal>ul</literal> are later added, it is
1213 sufficient only to change the entity definition. Note that there are some
1214 restrictions on the usage of entities in this context; most important, entities
1215 containing a left paranthesis must also contain the corresponding right
1216 paranthesis.
1217 </para>
1218
1219         <para>
1220 Note that the entity <literal>p.like</literal> is a
1221 <emphasis>parameter</emphasis> entity, i.e. the ENTITY declaration contains a
1222 percent sign, and the entity is referred to by
1223 <literal>%p.like;</literal>. This kind of entity must be used to abbreviate
1224 parts of the DTD; the <emphasis>general</emphasis> entities declared without
1225 percent sign and referred to as <literal>&amp;name;</literal> are not allowed
1226 in this context.
1227 </para>
1228
1229         <para>
1230 The <literal>title</literal> element specifies the title of the section in
1231 which it occurs. The title is given as character data, optionally interspersed
1232 with line breaks (<literal>br</literal>):
1233
1234 <programlisting>
1235 <![CDATA[<!ELEMENT title (#PCDATA|br)*>
1236 ]]></programlisting>
1237
1238 Compared with the <literal>title</literal> <emphasis>attribute</emphasis> of
1239 the <literal>readme</literal> element, this element allows inner markup
1240 (i.e. <literal>br</literal>) while attribute values do not: It is an error if
1241 an attribute value contains the left angle bracket &lt; literally such that it
1242 is impossible to include inner elements.
1243 </para>
1244
1245         <para>
1246 The paragraph element <literal>p</literal> has a structure similar to
1247 <literal>title</literal>, but it allows more inner elements:
1248
1249 <programlisting>
1250 <![CDATA[<!ENTITY % text "br|code|em|footnote|a">
1251
1252 <!ELEMENT p (#PCDATA|%text;)*>
1253 ]]></programlisting>
1254
1255 Line breaks do not have inner structure, so they are declared as being empty:
1256
1257 <programlisting>
1258 <![CDATA[<!ELEMENT br EMPTY>
1259 ]]></programlisting>
1260
1261 This means that really nothing is allowed within <literal>br</literal>; you
1262 must always write <literal><![CDATA[<br></br>]]></literal> or abbreviated
1263 <literal><![CDATA[<br/>]]></literal>.
1264 </para>
1265
1266         <para>
1267 Code samples should be marked up by the <literal>code</literal> tag; emphasized
1268 text can be indicated by <literal>em</literal>:
1269
1270 <programlisting>
1271 <![CDATA[<!ELEMENT code (#PCDATA)>
1272
1273 <!ELEMENT em (#PCDATA|%text;)*>
1274 ]]></programlisting>
1275
1276 That <literal>code</literal> elements are not allowed to contain further markup
1277 while <literal>em</literal> elements do is a design decision by the author of
1278 the DTD.
1279 </para>
1280
1281         <para>
1282 Unordered lists simply consists of one or more list items, and a list item may
1283 contain paragraph-level material:
1284
1285 <programlisting>
1286 <![CDATA[<!ELEMENT ul (li+)>
1287
1288 <!ELEMENT li (%p.like;)*>
1289 ]]></programlisting>
1290
1291 Footnotes are described by the text of the note; this text may contain
1292 text-level markup. There is no mechanism to describe the numbering scheme of
1293 footnotes, or to specify how footnote references are printed.
1294
1295 <programlisting>
1296 <![CDATA[<!ELEMENT footnote (#PCDATA|%text;)*>
1297 ]]></programlisting>
1298
1299 Hyperlinks are written as in HTML. The anchor tag contains the text describing
1300 where the link points to, and the <literal>href</literal> attribute is the
1301 pointer (as URL). There is no way to describe locations of "hash marks". If the
1302 link refers to another <emphasis>readme</emphasis> document, the attribute
1303 <literal>readmeref</literal> should be used instead of <literal>href</literal>.
1304 The reason is that the converted document has usually a different system
1305 identifier (file name), and the link to a converted document must be
1306 converted, too.
1307
1308 <programlisting>
1309 <![CDATA[<!ELEMENT a (#PCDATA)*>
1310 <!ATTLIST a
1311           href      CDATA #IMPLIED
1312           readmeref CDATA #IMPLIED
1313 >
1314 ]]></programlisting>
1315
1316 Note that although it is only sensible to specify one of the two attributes,
1317 the DTD has no means to express this restriction.
1318 </para>
1319
1320 <para>
1321 So far the DTD. Finally, here is a document for it:
1322
1323 <programlisting>
1324 <![CDATA[
1325 <?xml version="1.0" encoding="ISO-8859-1"?>
1326 <!DOCTYPE readme SYSTEM "readme.dtd">
1327 <readme title="How to use the readme converters">
1328 <sect1>
1329   <title>Usage</title>
1330   <p>
1331     The <em>readme</em> converter is invoked on the command line by:
1332   </p>
1333   <p>
1334     <code>readme [ -text | -html ] input.xml</code>
1335   </p>
1336   <p>
1337     Here a list of options:
1338   </p>
1339   <ul>
1340     <li>
1341       <p><code>-text</code>: specifies that ASCII output should be produced</p>
1342     </li>
1343     <li>
1344       <p><code>-html</code>: specifies that HTML output should be produced</p>
1345     </li>
1346   </ul>
1347   <p>
1348     The input file must be given on the command line. The converted output is
1349     printed to <em>stdout</em>.
1350   </p>
1351 </sect1>
1352 <sect1>
1353   <title>Author</title>
1354   <p>
1355     The program has been written by
1356     <a href="mailto:Gerd.Stolpmann@darmstadt.netsurf.de">Gerd Stolpmann</a>.
1357   </p>
1358 </sect1>
1359 </readme>
1360 ]]></programlisting>
1361
1362 </para>
1363
1364
1365       </sect1>
1366     </chapter>
1367
1368 <!-- ********************************************************************** -->
1369
1370     <chapter>
1371       <title>Using &markup;</title>
1372
1373       <sect1>
1374         <title>Validation</title>
1375         <para>
1376 The parser can be used to <emphasis>validate</emphasis> a document. This means
1377 that all the constraints that must hold for a valid document are actually
1378 checked. Validation is the default mode of &markup;, i.e. every document is
1379 validated while it is being parsed.
1380 </para>
1381
1382         <para>
1383 In the <literal>examples</literal> directory of the distribution you find the
1384 <literal>pxpvalidate</literal> application. It is invoked in the following way:
1385
1386 <programlisting>
1387 pxpvalidate [ -wf ] <replaceable>file</replaceable>...
1388 </programlisting>
1389
1390 The files mentioned on the command line are validated, and every warning and
1391 every error messages are printed to stderr.
1392 </para>
1393
1394         <para>
1395 The -wf switch modifies the behaviour such that a well-formedness parser is
1396 simulated. In this mode, the ELEMENT, ATTLIST, and NOTATION declarations of the
1397 DTD are ignored, and only the ENTITY declarations will take effect. This mode
1398 is intended for documents lacking a DTD. Please note that the parser still
1399 scans the DTD fully and will report all errors in the DTD; such checks are not
1400 required by a well-formedness parser.
1401 </para>
1402
1403         <para>
1404 The <literal>pxpvalidate</literal> application is the simplest sensible program
1405 using &markup;, you may consider it as "hello world" program.
1406 </para>
1407       </sect1>
1408
1409
1410       <!-- ================================================== -->
1411
1412
1413       <sect1>
1414         <title>How to parse a document from an application</title>
1415         <para>
1416 Let me first give a rough overview of the object model of the parser. The
1417 following items are represented by objects:
1418
1419 <itemizedlist mark="bullet" spacing="compact">
1420             <listitem>
1421               <para>
1422 <emphasis>Documents:</emphasis> The document representation is more or less the
1423 anchor for the application; all accesses to the parsed entities start here. It
1424 is described by the class <literal>document</literal> contained in the module
1425 <literal>Pxp_document</literal>. You can get some global information, such
1426 as the XML declaration the document begins with, the DTD of the document,
1427 global processing instructions, and most important, the document tree.
1428 </para>
1429             </listitem>
1430
1431             <listitem>
1432               <para>
1433 <emphasis>The contents of documents:</emphasis> The contents have the structure
1434 of a tree: Elements contain other elements and text<footnote><para>Elements may
1435 also contain processing instructions. Unlike other document models, &markup;
1436 separates processing instructions from the rest of the text and provides a
1437 second interface to access them (method <literal>pinstr</literal>). However,
1438 there is a parser option (<literal>enable_pinstr_nodes</literal>) which changes
1439 the behaviour of the parser such that extra nodes for processing instructions
1440 are included into the tree.</para>
1441 <para>Furthermore, the tree does normally not contain nodes for XML comments;
1442 they are ignored by default. Again, there is an option
1443 (<literal>enable_comment_nodes</literal>) changing this.</para>
1444 </footnote>.
1445
1446 The common type to represent both kinds of content is <literal>node</literal>
1447 which is a class type that unifies the properties of elements and character
1448 data. Every node has a list of children (which is empty if the element is empty
1449 or the node represents text); nodes may have attributes; nodes have always text
1450 contents. There are two implementations of <literal>node</literal>, the class
1451 <literal>element_impl</literal> for elements, and the class
1452 <literal>data_impl</literal> for text data. You find these classes and class
1453 types in the module <literal>Pxp_document</literal>, too.
1454 </para>
1455
1456               <para>
1457 Note that attribute lists are represented by non-class values.
1458 </para>
1459             </listitem>
1460
1461             <listitem>
1462               <para>
1463 <emphasis>The node extension:</emphasis> For advanced usage, every node of the
1464 document may have an associated <emphasis>extension</emphasis> which is simply
1465 a second object. This object must have the three methods
1466 <literal>clone</literal>, <literal>node</literal>, and
1467 <literal>set_node</literal> as bare minimum, but you are free to add methods as
1468 you want. This is the preferred way to add functionality to the document
1469 tree<footnote><para>Due to the typing system it is more or less impossible to
1470 derive recursive classes in O'Caml. To get around this, it is common practice
1471 to put the modifiable or extensible part of recursive objects into parallel
1472 objects.</para> </footnote>. The class type <literal>extension</literal> is
1473 defined in <literal>Pxp_document</literal>, too.
1474 </para>
1475             </listitem>
1476
1477             <listitem>
1478               <para>
1479 <emphasis>The DTD:</emphasis> Sometimes it is necessary to access the DTD of a
1480 document; the average application does not need this feature. The class
1481 <literal>dtd</literal> describes DTDs, and makes it possible to get
1482 representations of element, entity, and notation declarations as well as
1483 processing instructions contained in the DTD. This class, and
1484 <literal>dtd_element</literal>, <literal>dtd_notation</literal>, and
1485 <literal>proc_instruction</literal> can be found in the module
1486 <literal>Pxp_dtd</literal>. There are a couple of classes representing
1487 different kinds of entities; these can be found in the module
1488 <literal>Pxp_entity</literal>.
1489 </para>
1490             </listitem>
1491           </itemizedlist>
1492
1493 Additionally, the following modules play a role:
1494
1495 <itemizedlist mark="bullet" spacing="compact">
1496             <listitem>
1497               <para>
1498 <emphasis>Pxp_yacc:</emphasis> Here the main parsing functions such as
1499 <literal>parse_document_entity</literal> are located. Some additional types and
1500 functions allow the parser to be configured in a non-standard way.
1501 </para>
1502             </listitem>
1503
1504             <listitem>
1505               <para>
1506 <emphasis>Pxp_types:</emphasis> This is a collection of basic types and
1507 exceptions.
1508 </para>
1509             </listitem>
1510           </itemizedlist>
1511
1512 There are some further modules that are needed internally but are not part of
1513 the API.
1514 </para>
1515
1516         <para>
1517 Let the document to be parsed be stored in a file called
1518 <literal>doc.xml</literal>. The parsing process is started by calling the
1519 function
1520
1521 <programlisting>
1522 val parse_document_entity : config -> source -> 'ext spec -> 'ext document
1523 </programlisting>
1524
1525 defined in the module <literal>Pxp_yacc</literal>. The first argument
1526 specifies some global properties of the parser; it is recommended to start with
1527 the <literal>default_config</literal>. The second argument determines where the
1528 document to be parsed comes from; this may be a file, a channel, or an entity
1529 ID. To parse <literal>doc.xml</literal>, it is sufficient to pass
1530 <literal>from_file "doc.xml"</literal>.
1531 </para>
1532
1533         <para>
1534 The third argument passes the object specification to use. Roughly
1535 speaking, it determines which classes implement the node objects of which
1536 element types, and which extensions are to be used. The <literal>'ext</literal>
1537 polymorphic variable is the type of the extension. For the moment, let us
1538 simply pass <literal>default_spec</literal> as this argument, and ignore it.
1539 </para>
1540
1541         <para>
1542 So the following expression parses <literal>doc.xml</literal>:
1543
1544 <programlisting>
1545 open Pxp_yacc
1546 let d = parse_document_entity default_config (from_file "doc.xml") default_spec
1547 </programlisting>
1548
1549 Note that <literal>default_config</literal> implies that warnings are collected
1550 but not printed. Errors raise one of the exception defined in
1551 <literal>Pxp_types</literal>; to get readable errors and warnings catch the
1552 exceptions as follows:
1553
1554 <programlisting>
1555 <![CDATA[class warner =
1556   object
1557     method warn w =
1558       print_endline ("WARNING: " ^ w)
1559   end
1560 ;;
1561
1562 try
1563   let config = { default_config with warner = new warner } in
1564   let d = parse_document_entity config (from_file "doc.xml") default_spec
1565   in
1566     ...
1567 with
1568    e ->
1569      print_endline (Pxp_types.string_of_exn e)
1570 ]]></programlisting>
1571
1572 Now <literal>d</literal> is an object of the <literal>document</literal>
1573 class. If you want the node tree, you can get the root element by
1574
1575 <programlisting>
1576 let root = d # root
1577 </programlisting>
1578
1579 and if you would rather like to access the DTD, determine it by
1580
1581 <programlisting>
1582 let dtd = d # dtd
1583 </programlisting>
1584
1585 As it is more interesting, let us investigate the node tree now. Given the root
1586 element, it is possible to recursively traverse the whole tree. The children of
1587 a node <literal>n</literal> are returned by the method
1588 <literal>sub_nodes</literal>, and the type of a node is returned by
1589 <literal>node_type</literal>. This function traverses the tree, and prints the
1590 type of each node:
1591
1592 <programlisting>
1593 <![CDATA[let rec print_structure n =
1594   let ntype = n # node_type in
1595   match ntype with
1596     T_element name ->
1597       print_endline ("Element of type " ^ name);
1598       let children = n # sub_nodes in
1599       List.iter print_structure children
1600   | T_data ->
1601       print_endline "Data"
1602   | _ ->
1603       (* Other node types are not possible unless the parser is configured
1604          differently.
1605        *)
1606       assert false
1607 ]]></programlisting>
1608
1609 You can call this function by
1610
1611 <programlisting>
1612 print_structure root
1613 </programlisting>
1614
1615 The type returned by <literal>node_type</literal> is either <literal>T_element
1616 name</literal> or <literal>T_data</literal>. The <literal>name</literal> of the
1617 element type is the string included in the angle brackets. Note that only
1618 elements have children; data nodes are always leaves of the tree.
1619 </para>
1620
1621         <para>
1622 There are some more methods in order to access a parsed node tree:
1623
1624 <itemizedlist mark="bullet" spacing="compact">
1625             <listitem>
1626               <para>
1627 <literal>n # parent</literal>: Returns the parent node, or raises
1628 <literal>Not_found</literal> if the node is already the root
1629 </para>
1630             </listitem>
1631             <listitem>
1632               <para>
1633 <literal>n # root</literal>: Returns the root of the node tree.
1634 </para>
1635             </listitem>
1636             <listitem>
1637               <para>
1638 <literal>n # attribute a</literal>: Returns the value of the attribute with
1639 name <literal>a</literal>. The method returns a value for every
1640 <emphasis>declared</emphasis> attribute, independently of whether the attribute
1641 instance is defined or not. If the attribute is not declared,
1642 <literal>Not_found</literal> will be raised. (In well-formedness mode, every
1643 attribute is considered as being implicitly declared with type
1644 <literal>CDATA</literal>.)
1645 </para>
1646
1647 <para>
1648 The following return values are possible: <literal>Value s</literal>,
1649 <literal>Valuelist sl</literal> , and <literal>Implied_value</literal>.
1650 The first two value types indicate that the attribute value is available,
1651 either because there is a definition
1652 <literal><replaceable>a</replaceable>="<replaceable>value</replaceable>"</literal>
1653 in the XML text, or because there is a default value (declared in the
1654 DTD). Only if both the instance definition and the default declaration are
1655 missing, the latter value <literal>Implied_value</literal> will be returned.
1656 </para>
1657
1658 <para>
1659 In the DTD, every attribute is typed. There are single-value types (CDATA, ID,
1660 IDREF, ENTITY, NMTOKEN, enumerations), in which case the method passes
1661 <literal>Value s</literal> back, where <literal>s</literal> is the normalized
1662 string value of the attribute. The other types (IDREFS, ENTITIES, NMTOKENS)
1663 represent list values, and the parser splits the XML literal into several
1664 tokens and returns these tokens as <literal>Valuelist sl</literal>.
1665 </para>
1666
1667 <para>
1668 Normalization means that entity references (the
1669 <literal>&amp;<replaceable>name</replaceable>;</literal> tokens) and
1670 character references
1671 (<literal>&amp;#<replaceable>number</replaceable>;</literal>) are replaced
1672 by the text they represent, and that white space characters are converted into
1673 plain spaces.
1674 </para>
1675             </listitem>
1676             <listitem>
1677               <para>
1678 <literal>n # data</literal>: Returns the character data contained in the
1679 node. For data nodes, the meaning is obvious as this is the main content of
1680 data nodes. For element nodes, this method returns the concatenated contents of
1681 all inner data nodes.
1682 </para>
1683               <para>
1684 Note that entity references included in the text are resolved while they are
1685 being parsed; for example the text <![CDATA["a &lt;&gt; b"]]> will be returned
1686 as <![CDATA["a <> b"]]> by this method. Spaces of data nodes are always
1687 preserved. Newlines are preserved, but always converted to \n characters even
1688 if newlines are encoded as \r\n or \r. Normally you will never see two adjacent
1689 data nodes because the parser collapses all data material at one location into
1690 one node. (However, if you create your own tree or transform the parsed tree,
1691 it is possible to have adjacent data nodes.)
1692 </para>
1693               <para>
1694 Note that elements that do <emphasis>not</emphasis> allow #PCDATA as content
1695 will not have data nodes as children. This means that spaces and newlines, the
1696 only character material allowed for such elements, are silently dropped.
1697 </para>
1698             </listitem>
1699           </itemizedlist>
1700
1701 For example, if the task is to print all contents of elements with type
1702 "valuable" whose attribute "priority" is "1", this function can help:
1703
1704 <programlisting>
1705 <![CDATA[let rec print_valuable_prio1 n =
1706   let ntype = n # node_type in
1707   match ntype with
1708     T_element "valuable" when n # attribute "priority" = Value "1" ->
1709       print_endline "Valuable node with priotity 1 found:";
1710       print_endline (n # data)
1711   | (T_element _ | T_data) ->
1712       let children = n # sub_nodes in
1713       List.iter print_valuable_prio1 children
1714   | _ ->
1715       assert false
1716 ]]></programlisting>
1717
1718 You can call this function by:
1719
1720 <programlisting>
1721 print_valuable_prio1 root
1722 </programlisting>
1723
1724 If you like a DSSSL-like style, you can make the function
1725 <literal>process_children</literal> explicit:
1726
1727 <programlisting>
1728 <![CDATA[let rec print_valuable_prio1 n =
1729
1730   let process_children n =
1731     let children = n # sub_nodes in
1732     List.iter print_valuable_prio1 children
1733   in
1734
1735   let ntype = n # node_type in
1736   match ntype with
1737     T_element "valuable" when n # attribute "priority" = Value "1" ->
1738       print_endline "Valuable node with priority 1 found:";
1739       print_endline (n # data)
1740   | (T_element _ | T_data) ->
1741       process_children n
1742   | _ ->
1743       assert false
1744 ]]></programlisting>
1745
1746 So far, O'Caml is now a simple "style-sheet language": You can form a big
1747 "match" expression to distinguish between all significant cases, and provide
1748 different reactions on different conditions. But this technique has
1749 limitations; the "match" expression tends to get larger and larger, and it is
1750 difficult to store intermediate values as there is only one big
1751 recursion. Alternatively, it is also possible to represent the various cases as
1752 classes, and to use dynamic method lookup to find the appropiate class. The
1753 next section explains this technique in detail.
1754
1755 </para>
1756       </sect1>
1757
1758
1759       <!-- ================================================== -->
1760
1761
1762       <sect1>
1763         <title>Class-based processing of the node tree</title>
1764         <para>
1765 By default, the parsed node tree consists of objects of the same class; this is
1766 a good design as long as you want only to access selected parts of the
1767 document. For complex transformations, it may be better to use different
1768 classes for objects describing different element types.
1769 </para>
1770
1771         <para>
1772 For example, if the DTD declares the element types <literal>a</literal>,
1773 <literal>b</literal>, and <literal>c</literal>, and if the task is to convert
1774 an arbitrary document into a printable format, the idea is to define for every
1775 element type a separate class that has a method <literal>print</literal>. The
1776 classes are <literal>eltype_a</literal>, <literal>eltype_b</literal>, and
1777 <literal>eltype_c</literal>, and every class implements
1778 <literal>print</literal> such that elements of the type corresponding to the
1779 class are converted to the output format.
1780 </para>
1781
1782         <para>
1783 The parser supports such a design directly. As it is impossible to derive
1784 recursive classes in O'Caml<footnote><para>The problem is that the subclass is
1785 usually not a subtype in this case because O'Caml has a contravariant subtyping
1786 rule. </para> </footnote>, the specialized element classes cannot be formed by
1787 simply inheriting from the built-in classes of the parser and adding methods
1788 for customized functionality. To get around this limitation, every node of the
1789 document tree is represented by <emphasis>two</emphasis> objects, one called
1790 "the node" and containing the recursive definition of the tree, one called "the
1791 extension". Every node object has a reference to the extension, and the
1792 extension has a reference to the node. The advantage of this model is that it
1793 is now possible to customize the extension without affecting the typing
1794 constraints of the recursive node definition.
1795 </para>
1796
1797         <para>
1798 Every extension must have the three methods <literal>clone</literal>,
1799 <literal>node</literal>, and <literal>set_node</literal>. The method
1800 <literal>clone</literal> creates a deep copy of the extension object and
1801 returns it; <literal>node</literal> returns the node object for this extension
1802 object; and <literal>set_node</literal> is used to tell the extension object
1803 which node is associated with it, this method is automatically called when the
1804 node tree is initialized. The following definition is a good starting point
1805 for these methods; usually <literal>clone</literal> must be further refined
1806 when instance variables are added to the class:
1807
1808 <programlisting>
1809 <![CDATA[class custom_extension =
1810   object (self)
1811
1812     val mutable node = (None : custom_extension node option)
1813
1814     method clone = {< >}
1815     method node =
1816       match node with
1817           None ->
1818             assert false
1819         | Some n -> n
1820     method set_node n =
1821       node <- Some n
1822
1823   end
1824 ]]>
1825 </programlisting>
1826
1827 This part of the extension is usually the same for all classes, so it is a good
1828 idea to consider <literal>custom_extension</literal> as the super-class of the
1829 further class definitions. Continuining the example of above, we can define the
1830 element type classes as follows:
1831
1832 <programlisting>
1833 <![CDATA[class virtual custom_extension =
1834   object (self)
1835     ... clone, node, set_node defined as above ...
1836
1837     method virtual print : out_channel -> unit
1838   end
1839
1840 class eltype_a =
1841   object (self)
1842     inherit custom_extension
1843     method print ch = ...
1844   end
1845
1846 class eltype_b =
1847   object (self)
1848     inherit custom_extension
1849     method print ch = ...
1850   end
1851
1852 class eltype_c =
1853   object (self)
1854     inherit custom_extension
1855     method print ch = ...
1856   end
1857 ]]></programlisting>
1858
1859 The method <literal>print</literal> can now be implemented for every element
1860 type separately. Note that you get the associated node by invoking
1861
1862 <programlisting>
1863 self # node
1864 </programlisting>
1865
1866 and you get the extension object of a node <literal>n</literal> by writing
1867
1868 <programlisting>
1869 n # extension
1870 </programlisting>
1871
1872 It is guaranteed that
1873
1874 <programlisting>
1875 self # node # extension == self
1876 </programlisting>
1877
1878 always holds.
1879 </para>
1880
1881         <para>Here are sample definitions of the <literal>print</literal>
1882 methods:
1883
1884 <programlisting><![CDATA[
1885 class eltype_a =
1886   object (self)
1887     inherit custom_extension
1888     method print ch =
1889       (* Nodes <a>...</a> are only containers: *)
1890       output_string ch "(";
1891       List.iter
1892         (fun n -> n # extension # print ch)
1893         (self # node # sub_nodes);
1894       output_string ch ")";
1895   end
1896
1897 class eltype_b =
1898   object (self)
1899     inherit custom_extension
1900     method print ch =
1901       (* Print the value of the CDATA attribute "print": *)
1902       match self # node # attribute "print" with
1903         Value s       -> output_string ch s
1904       | Implied_value -> output_string ch "<missing>"
1905       | Valuelist l   -> assert false
1906                          (* not possible because the att is CDATA *)
1907   end
1908
1909 class eltype_c =
1910   object (self)
1911     inherit custom_extension
1912     method print ch =
1913       (* Print the contents of this element: *)
1914       output_string ch (self # node # data)
1915   end
1916
1917 class null_extension =
1918   object (self)
1919     inherit custom_extension
1920     method print ch = assert false
1921   end
1922 ]]></programlisting>
1923 </para>
1924
1925
1926         <para>
1927 The remaining task is to configure the parser such that these extension classes
1928 are actually used. Here another problem arises: It is not possible to
1929 dynamically select the class of an object to be created. As workaround,
1930 &markup; allows the user to specify <emphasis>exemplar objects</emphasis> for
1931 the various element types; instead of creating the nodes of the tree by
1932 applying the <literal>new</literal> operator the nodes are produced by
1933 duplicating the exemplars. As object duplication preserves the class of the
1934 object, one can create fresh objects of every class for which previously an
1935 exemplar has been registered.
1936 </para>
1937
1938         <para>
1939 Exemplars are meant as objects without contents, the only interesting thing is
1940 that exemplars are instances of a certain class. The creation of an exemplar
1941 for an element node can be done by:
1942
1943 <programlisting>
1944 let element_exemplar = new element_impl extension_exemplar
1945 </programlisting>
1946
1947 And a data node exemplar is created by:
1948
1949 <programlisting>
1950 let data_exemplar = new data_impl extension_exemplar
1951 </programlisting>
1952
1953 The classes <literal>element_impl</literal> and <literal>data_impl</literal>
1954 are defined in the module <literal>Pxp_document</literal>. The constructors
1955 initialize the fresh objects as empty objects, i.e. without children, without
1956 data contents, and so on. The <literal>extension_exemplar</literal> is the
1957 initial extension object the exemplars are associated with.
1958 </para>
1959
1960         <para>
1961 Once the exemplars are created and stored somewhere (e.g. in a hash table), you
1962 can take an exemplar and create a concrete instance (with contents) by
1963 duplicating it. As user of the parser you are normally not concerned with this
1964 as this is part of the internal logic of the parser, but as background knowledge
1965 it is worthwhile to mention that the two methods
1966 <literal>create_element</literal> and <literal>create_data</literal> actually
1967 perform the duplication of the exemplar for which they are invoked,
1968 additionally apply modifications to the clone, and finally return the new
1969 object. Moreover, the extension object is copied, too, and the new node object
1970 is associated with the fresh extension object. Note that this is the reason why
1971 every extension object must have a <literal>clone</literal> method.
1972 </para>
1973
1974         <para>
1975 The configuration of the set of exemplars is passed to the
1976 <literal>parse_document_entity</literal> function as third argument. In our
1977 example, this argument can be set up as follows:
1978
1979 <programlisting>
1980 <![CDATA[let spec =
1981   make_spec_from_alist
1982     ~data_exemplar:            (new data_impl (new null_extension))
1983     ~default_element_exemplar: (new element_impl (new null_extension))
1984     ~element_alist:
1985        [ "a",  new element_impl (new eltype_a);
1986          "b",  new element_impl (new eltype_b);
1987          "c",  new element_impl (new eltype_c);
1988        ]
1989     ()
1990 ]]></programlisting>
1991
1992 The <literal>~element_alist</literal> function argument defines the mapping
1993 from element types to exemplars as associative list. The argument
1994 <literal>~data_exemplar</literal> specifies the exemplar for data nodes, and
1995 the <literal>~default_element_exemplar</literal> is used whenever the parser
1996 finds an element type for which the associative list does not define an
1997 exemplar.
1998 </para>
1999
2000         <para>
2001 The configuration is now complete. You can still use the same parsing
2002 functions, only the initialization is a bit different. For example, call the
2003 parser by:
2004
2005 <programlisting>
2006 let d = parse_document_entity default_config (from_file "doc.xml") spec
2007 </programlisting>
2008
2009 Note that the resulting document <literal>d</literal> has a usable type;
2010 especially the <literal>print</literal> method we added is visible. So you can
2011 print your document by
2012
2013 <programlisting>
2014 d # root # extension # print stdout
2015 </programlisting>
2016 </para>
2017
2018         <para>
2019 This object-oriented approach looks rather complicated; this is mostly caused
2020 by working around some problems of the strict typing system of O'Caml. Some
2021 auxiliary concepts such as extensions were needed, but the practical
2022 consequences are low. In the next section, one of the examples of the
2023 distribution is explained, a converter from <emphasis>readme</emphasis>
2024 documents to HTML.
2025 </para>
2026
2027       </sect1>
2028
2029
2030       <!-- ================================================== -->
2031
2032
2033       <sect1 id="sect.readme.to-html">
2034         <title>Example: An HTML backend for the <emphasis>readme</emphasis>
2035 DTD</title>
2036
2037         <para>The converter from <emphasis>readme</emphasis> documents to HTML
2038 documents follows strictly the approach to define one class per element
2039 type. The HTML code is similar to the <emphasis>readme</emphasis> source,
2040 because of this most elements can be converted in the following way: Given the
2041 input element
2042
2043 <programlisting>
2044 <![CDATA[<e>content</e>]]>
2045 </programlisting>
2046
2047 the conversion text is the concatenation of a computed prefix, the recursively
2048 converted content, and a computed suffix.
2049 </para>
2050
2051         <para>
2052 Only one element type cannot be handled by this scheme:
2053 <literal>footnote</literal>. Footnotes are collected while they are found in
2054 the input text, and they are printed after the main text has been converted and
2055 printed.
2056 </para>
2057
2058         <sect2>
2059           <title>Header</title>
2060           <para>
2061 <programlisting>&readme.code.header;</programlisting>
2062 </para>
2063         </sect2>
2064
2065         <sect2>
2066           <title>Type declarations</title>
2067           <para>
2068 <programlisting>&readme.code.footnote-printer;</programlisting>
2069 </para>
2070         </sect2>
2071
2072         <sect2>
2073           <title>Class <literal>store</literal></title>
2074           <para>
2075 The <literal>store</literal> is a container for footnotes. You can add a
2076 footnote by invoking <literal>alloc_footnote</literal>; the argument is an
2077 object of the class <literal>footnote_printer</literal>, the method returns the
2078 number of the footnote. The interesting property of a footnote is that it can
2079 be converted to HTML, so a <literal>footnote_printer</literal> is an object
2080 with a method <literal>footnote_to_html</literal>. The class
2081 <literal>footnote</literal> which is defined below has a compatible method
2082 <literal>footnote_to_html</literal> such that objects created from it can be
2083 used as <literal>footnote_printer</literal>s.
2084 </para>
2085           <para>
2086 The other method, <literal>print_footnotes</literal> prints the footnotes as
2087 definition list, and is typically invoked after the main material of the page
2088 has already been printed. Every item of the list is printed by
2089 <literal>footnote_to_html</literal>.
2090 </para>
2091
2092           <para>
2093 <programlisting>&readme.code.store;</programlisting>
2094 </para>
2095         </sect2>
2096
2097         <sect2>
2098           <title>Function <literal>escape_html</literal></title>
2099           <para>
2100 This function converts the characters &lt;, &gt;, &amp;, and " to their HTML
2101 representation. For example,
2102 <literal>escape_html "&lt;&gt;" = "&amp;lt;&amp;gt;"</literal>. Other
2103 characters are left unchanged.
2104
2105 <programlisting>&readme.code.escape-html;</programlisting>
2106 </para>
2107         </sect2>
2108
2109         <sect2>
2110           <title>Virtual class <literal>shared</literal></title>
2111           <para>
2112 This virtual class is the abstract superclass of the extension classes shown
2113 below. It defines the standard methods <literal>clone</literal>,
2114 <literal>node</literal>, and <literal>set_node</literal>, and declares the type
2115 of the virtual method <literal>to_html</literal>. This method recursively
2116 traverses the whole element tree, and prints the converted HTML code to the
2117 output channel passed as second argument. The first argument is the reference
2118 to the global <literal>store</literal> object which collects the footnotes.
2119
2120 <programlisting>&readme.code.shared;</programlisting>
2121 </para>
2122         </sect2>
2123
2124         <sect2>
2125           <title>Class <literal>only_data</literal></title>
2126           <para>
2127 This class defines <literal>to_html</literal> such that the character data of
2128 the current node is converted to HTML. Note that <literal>self</literal> is an
2129 extension object, <literal>self # node</literal> is the node object, and
2130 <literal>self # node # data</literal> returns the character data of the node.
2131
2132 <programlisting>&readme.code.only-data;</programlisting>
2133 </para>
2134         </sect2>
2135
2136         <sect2>
2137           <title>Class <literal>readme</literal></title>
2138           <para>
2139 This class converts elements of type <literal>readme</literal> to HTML. Such an
2140 element is (by definition) always the root element of the document. First, the
2141 HTML header is printed; the <literal>title</literal> attribute of the element
2142 determines the title of the HTML page. Some aspects of the HTML page can be
2143 configured by setting certain parameter entities, for example the background
2144 color, the text color, and link colors. After the header, the
2145 <literal>body</literal> tag, and the headline have been printed, the contents
2146 of the page are converted by invoking <literal>to_html</literal> on all
2147 children of the current node (which is the root node). Then, the footnotes are
2148 appended to this by telling the global <literal>store</literal> object to print
2149 the footnotes. Finally, the end tags of the HTML pages are printed.
2150 </para>
2151
2152           <para>
2153 This class is an example how to access the value of an attribute: The value is
2154 determined by invoking <literal>self # node # attribute "title"</literal>. As
2155 this attribute has been declared as CDATA and as being required, the value has
2156 always the form <literal>Value s</literal> where <literal>s</literal> is the
2157 string value of the attribute.
2158 </para>
2159
2160           <para>
2161 You can also see how entity contents can be accessed. A parameter entity object
2162 can be looked up by <literal>self # node # dtd # par_entity "name"</literal>,
2163 and by invoking <literal>replacement_text</literal> the value of the entity
2164 is returned after inner parameter and character entities have been
2165 processed. Note that you must use <literal>gen_entity</literal> instead of
2166 <literal>par_entity</literal> to access general entities.
2167 </para>
2168
2169           <para>
2170 <programlisting>&readme.code.readme;</programlisting>
2171 </para>
2172         </sect2>
2173
2174         <sect2>
2175           <title>Classes <literal>section</literal>, <literal>sect1</literal>,
2176 <literal>sect2</literal>, and <literal>sect3</literal></title>
2177           <para>
2178 As the conversion process is very similar, the conversion classes of the three
2179 section levels are derived from the more general <literal>section</literal>
2180 class. The HTML code of the section levels only differs in the type of the
2181 headline, and because of this the classes describing the section levels can be
2182 computed by replacing the class argument <literal>the_tag</literal> of
2183 <literal>section</literal> by the HTML name of the headline tag.
2184 </para>
2185
2186           <para>
2187 Section elements are converted to HTML by printing a headline and then
2188 converting the contents of the element recursively. More precisely, the first
2189 sub-element is always a <literal>title</literal> element, and the other
2190 elements are the contents of the section. This structure is declared in the
2191 DTD, and it is guaranteed that the document matches the DTD. Because of this
2192 the title node can be separated from the rest without any checks.
2193 </para>
2194
2195           <para>
2196 Both the title node, and the body nodes are then converted to HTML by calling
2197 <literal>to_html</literal> on them.
2198 </para>
2199
2200           <para>
2201 <programlisting>&readme.code.section;</programlisting>
2202 </para>
2203         </sect2>
2204
2205         <sect2>
2206           <title>Classes <literal>map_tag</literal>, <literal>p</literal>,
2207 <literal>em</literal>, <literal>ul</literal>, <literal>li</literal></title>
2208           <para>
2209 Several element types are converted to HTML by simply mapping them to
2210 corresponding HTML element types. The class <literal>map_tag</literal>
2211 implements this, and the class argument <literal>the_target_tag</literal>
2212 determines the tag name to map to. The output consists of the start tag, the
2213 recursively converted inner elements, and the end tag.
2214
2215 <programlisting>&readme.code.map-tag;</programlisting>
2216 </para>
2217         </sect2>
2218
2219         <sect2>
2220           <title>Class <literal>br</literal></title>
2221           <para>
2222 Element of type <literal>br</literal> are mapped to the same HTML type. Note
2223 that HTML forbids the end tag of <literal>br</literal>.
2224
2225 <programlisting>&readme.code.br;</programlisting>
2226 </para>
2227         </sect2>
2228
2229         <sect2>
2230           <title>Class <literal>code</literal></title>
2231           <para>
2232 The <literal>code</literal> type is converted to a <literal>pre</literal>
2233 section (preformatted text). As the meaning of tabs is unspecified in HTML,
2234 tabs are expanded to spaces.
2235
2236 <programlisting>&readme.code.code;</programlisting>
2237 </para>
2238         </sect2>
2239
2240         <sect2>
2241           <title>Class <literal>a</literal></title>
2242           <para>
2243 Hyperlinks, expressed by the <literal>a</literal> element type, are converted
2244 to the HTML <literal>a</literal> type. If the target of the hyperlink is given
2245 by <literal>href</literal>, the URL of this attribute can be used
2246 directly. Alternatively, the target can be given by
2247 <literal>readmeref</literal> in which case the ".html" suffix must be added to
2248 the file name.
2249 </para>
2250
2251           <para>
2252 Note that within <literal>a</literal> only #PCDATA is allowed, so the contents
2253 can be converted directly by applying <literal>escape_html</literal> to the
2254 character data contents.
2255
2256 <programlisting>&readme.code.a;</programlisting>
2257 </para>
2258         </sect2>
2259
2260         <sect2>
2261           <title>Class <literal>footnote</literal></title>
2262           <para>
2263 The <literal>footnote</literal> class has two methods:
2264 <literal>to_html</literal> to convert the footnote reference to HTML, and
2265 <literal>footnote_to_html</literal> to convert the footnote text itself.
2266 </para>
2267
2268           <para>
2269 The footnote reference is converted to a local hyperlink; more precisely, to
2270 two anchor tags which are connected with each other. The text anchor points to
2271 the footnote anchor, and the footnote anchor points to the text anchor.
2272 </para>
2273
2274           <para>
2275 The footnote must be allocated in the <literal>store</literal> object. By
2276 allocating the footnote, you get the number of the footnote, and the text of
2277 the footnote is stored until the end of the HTML page is reached when the
2278 footnotes can be printed. The <literal>to_html</literal> method stores simply
2279 the object itself, such that the <literal>footnote_to_html</literal> method is
2280 invoked on the same object that encountered the footnote.
2281 </para>
2282
2283           <para>
2284 The <literal>to_html</literal> only allocates the footnote, and prints the
2285 reference anchor, but it does not print nor convert the contents of the
2286 note. This is deferred until the footnotes actually get printed, i.e. the
2287 recursive call of <literal>to_html</literal> on the sub nodes is done by
2288 <literal>footnote_to_html</literal>.
2289 </para>
2290
2291           <para>
2292 Note that this technique does not work if you make another footnote within a
2293 footnote; the second footnote gets allocated but not printed.
2294 </para>
2295
2296           <para>
2297 <programlisting>&readme.code.footnote;</programlisting>
2298 </para>
2299         </sect2>
2300
2301         <sect2>
2302           <title>The specification of the document model</title>
2303           <para>
2304 This code sets up the hash table that connects element types with the exemplars
2305 of the extension classes that convert the elements to HTML.
2306
2307 <programlisting>&readme.code.tag-map;</programlisting>
2308 </para>
2309         </sect2>
2310
2311 <!-- <![RCDATA[&readme.code.to-html;]]> -->
2312       </sect1>
2313
2314     </chapter>
2315
2316 <!-- ********************************************************************** -->
2317
2318     <chapter>
2319       <title>The objects representing the document</title>
2320
2321       <para>
2322 <emphasis>This description might be out-of-date. See the module interface files
2323 for updated information.</emphasis></para>
2324
2325       <sect1>
2326         <title>The <literal>document</literal> class</title>
2327         <para>
2328 <programlisting>
2329 <![CDATA[
2330 class [ 'ext ] document :
2331   Pxp_types.collect_warnings ->
2332   object
2333     method init_xml_version : string -> unit
2334     method init_root : 'ext node -> unit
2335
2336     method xml_version : string
2337     method xml_standalone : bool
2338     method dtd : dtd
2339     method root : 'ext node
2340
2341     method encoding : Pxp_types.rep_encoding
2342
2343     method add_pinstr : proc_instruction -> unit
2344     method pinstr : string -> proc_instruction list
2345     method pinstr_names : string list
2346
2347     method write : Pxp_types.output_stream -> Pxp_types.encoding -> unit
2348
2349   end
2350 ;;
2351 ]]>
2352 </programlisting>
2353
2354 The methods beginning with <literal>init_</literal> are only for internal use
2355 of the parser.
2356 </para>
2357
2358         <itemizedlist mark="bullet" spacing="compact">
2359           <listitem>
2360             <para>
2361 <literal>xml_version</literal>: returns the version string at the beginning of
2362 the document. For example, "1.0" is returned if the document begins with
2363 <literal>&lt;?xml version="1.0"?&gt;</literal>.</para>
2364           </listitem>
2365           <listitem>
2366             <para>
2367 <literal>xml_standalone</literal>: returns the boolean value of
2368 <literal>standalone</literal> declaration in the XML declaration. If the
2369 <literal>standalone</literal> attribute is missing, <literal>false</literal> is
2370 returned. </para>
2371           </listitem>
2372           <listitem>
2373             <para>
2374 <literal>dtd</literal>: returns a reference to the global DTD object.</para>
2375           </listitem>
2376           <listitem>
2377             <para>
2378 <literal>root</literal>: returns a reference to the root element.</para>
2379           </listitem>
2380           <listitem>
2381             <para>
2382 <literal>encoding</literal>: returns the internal encoding of the
2383 document. This means that all strings of which the document consists are
2384 encoded in this character set.
2385 </para>
2386           </listitem>
2387           <listitem>
2388             <para>
2389 <literal>pinstr</literal>: returns the processing instructions outside the DTD
2390 and outside the root element. The argument passed to the method names a
2391 <emphasis>target</emphasis>, and the method returns all instructions with this
2392 target. The target is the first word inside <literal>&lt;?</literal> and
2393 <literal>?&gt;</literal>.</para>
2394           </listitem>
2395           <listitem>
2396             <para>
2397 <literal>pinstr_names</literal>: returns the names of the processing instructions</para>
2398           </listitem>
2399           <listitem>
2400             <para>
2401 <literal>add_pinstr</literal>: adds another processing instruction. This method
2402 is used by the parser itself to enter the instructions returned by
2403 <literal>pinstr</literal>, but you can also enter additional instructions.
2404 </para>
2405           </listitem>
2406           <listitem>
2407             <para>
2408 <literal>write</literal>: writes the document to the passed stream as XML
2409 text using the passed (external) encoding. The generated text is always valid
2410 XML and can be parsed by PXP; however, the text is badly formatted (this is not
2411 a pretty printer).</para>
2412           </listitem>
2413         </itemizedlist>
2414       </sect1>
2415
2416 <!-- ********************************************************************** -->
2417
2418       <sect1>
2419         <title>The class type <literal>node</literal></title>
2420         <para>
2421
2422 From <literal>Pxp_document</literal>:
2423
2424 <programlisting>
2425 type node_type =
2426   T_data
2427 | T_element of string
2428 | T_super_root
2429 | T_pinstr of string
2430 | T_comment
2431 <replaceable>and some other, reserved types</replaceable>
2432 ;;
2433
2434 class type [ 'ext ] node =
2435   object ('self)
2436     constraint 'ext = 'ext node #extension
2437
2438     <anchor id="type-node-general.sig"
2439    >(* <link linkend="type-node-general" endterm="type-node-general.title"
2440        ></link> *)
2441
2442     method extension : 'ext
2443     method dtd : dtd
2444     method parent : 'ext node
2445     method root : 'ext node
2446     method sub_nodes : 'ext node list
2447     method iter_nodes : ('ext node &fun; unit) &fun; unit
2448     method iter_nodes_sibl :
2449            ('ext node option &fun; 'ext node &fun; 'ext node option &fun; unit) &fun; unit
2450     method node_type : node_type
2451     method encoding : Pxp_types.rep_encoding
2452     method data : string
2453     method position : (string * int * int)
2454     method comment : string option
2455     method pinstr : string &fun; proc_instruction list
2456     method pinstr_names : string list
2457     method write : Pxp_types.output_stream -> Pxp_types.encoding -> unit
2458
2459     <anchor id="type-node-atts.sig"
2460    >(* <link linkend="type-node-atts" endterm="type-node-atts.title"
2461        ></link> *)
2462
2463     method attribute : string &fun; Pxp_types.att_value
2464     method required_string_attribute : string &fun; string
2465     method optional_string_attribute : string &fun; string option
2466     method required_list_attribute : string &fun; string list
2467     method optional_list_attribute : string &fun; string list
2468     method attribute_names : string list
2469     method attribute_type : string &fun; Pxp_types.att_type
2470     method attributes : (string * Pxp_types.att_value) list
2471     method id_attribute_name : string
2472     method id_attribute_value : string
2473     method idref_attribute_names : string
2474
2475     <anchor id="type-node-mods.sig"
2476    >(* <link linkend="type-node-mods" endterm="type-node-mods.title"
2477        ></link> *)
2478
2479     method add_node : ?force:bool &fun; 'ext node &fun; unit
2480     method add_pinstr : proc_instruction &fun; unit
2481     method delete : unit
2482     method set_nodes : 'ext node list &fun; unit
2483     method quick_set_attributes : (string * Pxp_types.att_value) list &fun; unit
2484     method set_comment : string option &fun; unit
2485
2486     <anchor id="type-node-cloning.sig"
2487    >(* <link linkend="type-node-cloning" endterm="type-node-cloning.title"
2488        ></link> *)
2489
2490     method orphaned_clone : 'self
2491     method orphaned_flat_clone : 'self
2492     method create_element :
2493               ?position:(string * int * int) &fun;
2494               dtd &fun; node_type &fun; (string * string) list &fun;
2495                   'ext node
2496     method create_data : dtd &fun; string &fun; 'ext node
2497     method keep_always_whitespace_mode : unit
2498
2499     <anchor id="type-node-weird.sig"
2500    >(* <link linkend="type-node-weird" endterm="type-node-weird.title"
2501        ></link> *)
2502
2503     method local_validate : ?use_dfa:bool -> unit -> unit
2504
2505     (* ... Internal methods are undocumented. *)
2506
2507   end
2508 ;;
2509 </programlisting>
2510
2511 In the module <literal>Pxp_types</literal> you can find another type
2512 definition that is important in this context:
2513
2514 <programlisting>
2515 type Pxp_types.att_value =
2516     Value     of string
2517   | Valuelist of string list
2518   | Implied_value
2519 ;;
2520 </programlisting>
2521 </para>
2522
2523         <sect2>
2524           <title>The structure of document trees</title>
2525
2526 <para>
2527 A node represents either an element or a character data section. There are two
2528 classes implementing the two aspects of nodes: <literal>element_impl</literal>
2529 and <literal>data_impl</literal>. The latter class does not implement all
2530 methods because some methods do not make sense for data nodes.
2531 </para>
2532
2533 <para>
2534 (Note: PXP also supports a mode which forces that processing instructions and
2535 comments are represented as nodes of the document tree. However, these nodes
2536 are instances of <literal>element_impl</literal> with node types
2537 <literal>T_pinstr</literal> and <literal>T_comment</literal>,
2538 respectively. This mode must be explicitly configured; the basic representation
2539 knows only element and data nodes.)
2540 </para>
2541
2542         <para>The following figure
2543 (<link linkend="node-term" endterm="node-term"></link>) shows an example how
2544 a tree is constructed from element and data nodes. The circular areas
2545 represent element nodes whereas the ovals denote data nodes. Only elements
2546 may have subnodes; data nodes are always leaves of the tree. The subnodes
2547 of an element can be either element or data nodes; in both cases the O'Caml
2548 objects storing the nodes have the class type <literal>node</literal>.</para>
2549
2550         <para>Attributes (the clouds in the picture) are not directly
2551 integrated into the tree; there is always an extra link to the attribute
2552 list. This is also true for processing instructions (not shown in the
2553 picture). This means that there are separated access methods for attributes and
2554 processing instructions.</para>
2555
2556 <figure id="node-term" float="1">
2557 <title>A tree with element nodes, data nodes, and attributes</title>
2558 <graphic fileref="pic/node_term" format="GIF"></graphic>
2559 </figure>
2560
2561         <para>Only elements, data sections, attributes and processing
2562 instructions (and comments, if configured) can, directly or indirectly, occur
2563 in the document tree. It is impossible to add entity references to the tree; if
2564 the parser finds such a reference, not the reference as such but the referenced
2565 text (i.e. the tree representing the structured text) is included in the
2566 tree.</para>
2567
2568         <para>Note that the parser collapses as much data material into one
2569 data node as possible such that there are normally never two adjacent data
2570 nodes. This invariant is enforced even if data material is included by entity
2571 references or CDATA sections, or if a data sequence is interrupted by
2572 comments. So <literal>a &amp;amp; b &lt;-- comment --&gt; c &lt;![CDATA[
2573 &lt;&gt; d]]&gt;</literal> is represented by only one data node, for
2574 instance. However, you can create document trees manually which break this
2575 invariant; it is only the way the parser forms the tree.
2576 </para>
2577
2578 <figure id="node-general" float="1">
2579 <title>Nodes are doubly linked trees</title>
2580 <graphic fileref="pic/node_general" format="GIF"></graphic>
2581 </figure>
2582
2583         <para>
2584 The node tree has links in both directions: Every node has a link to its parent
2585 (if any), and it has links to the subnodes (see
2586 figure <link linkend="node-general" endterm="node-general"></link>). Obviously,
2587 this doubly-linked structure simplifies the navigation in the tree; but has
2588 also some consequences for the possible operations on trees.</para>
2589
2590         <para>
2591 Because every node must have at most <emphasis>one</emphasis> parent node,
2592 operations are illegal if they violate this condition. The following figure
2593 (<link linkend="node-add" endterm="node-add"></link>) shows on the left side
2594 that node <literal>y</literal> is added to <literal>x</literal> as new subnode
2595 which is allowed because <literal>y</literal> does not have a parent yet. The
2596 right side of the picture illustrates what would happen if <literal>y</literal>
2597 had a parent node; this is illegal because <literal>y</literal> would have two
2598 parents after the operation.</para>
2599
2600 <figure id="node-add" float="1">
2601 <title>A node can only be added if it is a root</title>
2602 <graphic fileref="pic/node_add" format="GIF">
2603 </graphic>
2604 </figure>
2605
2606         <para>
2607 The "delete" operation simply removes the links between two nodes. In the
2608 picture (<link linkend="node-delete" endterm="node-delete"></link>) the node
2609 <literal>x</literal> is deleted from the list of subnodes of
2610 <literal>y</literal>. After that, <literal>x</literal> becomes the root of the
2611 subtree starting at this node.</para>
2612
2613 <figure id="node-delete" float="1">
2614 <title>A deleted node becomes the root of the subtree</title>
2615 <graphic fileref="pic/node_delete" format="GIF"></graphic>
2616 </figure>
2617
2618         <para>
2619 It is also possible to make a clone of a subtree; illustrated in
2620 <link linkend="node-clone" endterm="node-clone"></link>. In this case, the
2621 clone is a copy of the original subtree except that it is no longer a
2622 subnode. Because cloning never keeps the connection to the parent, the clones
2623 are called <emphasis>orphaned</emphasis>.
2624 </para>
2625
2626 <figure id="node-clone" float="1">
2627 <title>The clone of a subtree</title>
2628 <graphic fileref="pic/node_clone" format="GIF"></graphic>
2629 </figure>
2630         </sect2>
2631
2632         <sect2>
2633           <title>The methods of the class type <literal>node</literal></title>
2634
2635           <anchor id="type-node-general">
2636           <formalpara>
2637             <title id="type-node-general.title">
2638               <link linkend="type-node-general.sig">General observers</link>
2639             </title>
2640
2641             <para>
2642               <itemizedlist mark="bullet" spacing="compact">
2643                 <listitem>
2644                   <para>
2645 <literal>extension</literal>: The reference to the extension object which
2646 belongs to this node (see ...).</para>
2647                 </listitem>
2648                 <listitem>
2649                   <para>
2650 <literal>dtd</literal>: Returns a reference to the global DTD. All nodes
2651 of a tree must share the same DTD.
2652 </para>
2653                 </listitem>
2654                 <listitem>
2655                   <para>
2656 <literal>parent</literal>: Get the father node. Raises
2657 <literal>Not_found</literal> in the case the node does not have a
2658 parent, i.e. the node is the root.</para>
2659                 </listitem>
2660                 <listitem>
2661                   <para>
2662 <literal>root</literal>: Gets the reference to the root node of the tree.
2663 Every node is contained in a tree with a root, so this method always
2664 succeeds. Note that this method <emphasis>searches</emphasis> the root,
2665 which costs time proportional to the length of the path to the root.
2666 </para>
2667                 </listitem>
2668                 <listitem>
2669                   <para>
2670 <literal>sub_nodes</literal>: Returns references to the children. The returned
2671 list reflects the order of the children. For data nodes, this method returns
2672 the empty list.
2673 </para>
2674                 </listitem>
2675                 <listitem>
2676                   <para>
2677 <literal>iter_nodes f</literal>: Iterates over the children, and calls
2678 <literal>f</literal> for every child in turn.
2679 </para>
2680                 </listitem>
2681                 <listitem>
2682                   <para>
2683 <literal>iter_nodes_sibl f</literal>: Iterates over the children, and calls
2684 <literal>f</literal> for every child in turn. <literal>f</literal> gets as
2685 arguments the previous node, the current node, and the next node.</para>
2686                 </listitem>
2687                 <listitem>
2688                   <para>
2689 <literal>node_type</literal>: Returns either <literal>T_data</literal> which
2690 means that the node is a data node, or <literal>T_element n</literal>
2691 which means that the node is an element of type <literal>n</literal>.
2692 If configured, possible node types are also <literal>T_pinstr t</literal>
2693 indicating that the node represents a processing instruction with target
2694 <literal>t</literal>, and <literal>T_comment</literal> in which case the node
2695 is a comment.
2696 </para>
2697                 </listitem>
2698                 <listitem>
2699                   <para>
2700 <literal>encoding</literal>: Returns the encoding of the strings.</para>
2701                 </listitem>
2702                 <listitem>
2703                   <para>
2704 <literal>data</literal>: Returns the character data of this node and all
2705 children, concatenated as one string. The encoding of the string is what
2706 the method <literal>encoding</literal> returns.
2707 - For data nodes, this method simply returns the represented characters.
2708 For elements, the meaning of the method has been extended such that it
2709 returns something useful, i.e. the effectively contained characters, without
2710 markup. (For <literal>T_pinstr</literal> and <literal>T_comment</literal>
2711 nodes, the method returns the empty string.)
2712 </para>
2713                 </listitem>
2714                 <listitem>
2715                   <para>
2716 <literal>position</literal>: If configured, this method returns the position of
2717 the element as triple (entity, line, byteposition). For data nodes, the
2718 position is not stored. If the position is not available the triple
2719 <literal>"?", 0, 0</literal> is returned.
2720 </para>
2721                 </listitem>
2722                 <listitem>
2723                   <para>
2724 <literal>comment</literal>: Returns <literal>Some text</literal> for comment
2725 nodes, and <literal>None</literal> for other nodes. The <literal>text</literal>
2726 is everything between the comment delimiters <literal>&lt;--</literal> and
2727 <literal>--&gt;</literal>.
2728 </para>
2729                 </listitem>
2730                 <listitem>
2731                   <para>
2732 <literal>pinstr n</literal>: Returns all processing instructions that are
2733 directly contained in this element and that have a <emphasis>target</emphasis>
2734 specification of <literal>n</literal>. The target is the first word after
2735 the <literal>&lt;?</literal>.
2736 </para>
2737                 </listitem>
2738                 <listitem>
2739                   <para>
2740 <literal>pinstr_names</literal>: Returns the list of all targets of processing
2741 instructions directly contained in this element.</para>
2742                 </listitem>
2743                 <listitem>
2744                   <para>
2745 <literal>write s enc</literal>: Prints the node and all subnodes to the passed
2746 output stream as valid XML text, using the passed external encoding.
2747 </para>
2748                 </listitem>
2749               </itemizedlist>
2750             </para>
2751           </formalpara>
2752
2753           <anchor id="type-node-atts">
2754           <formalpara>
2755             <title id="type-node-atts.title">
2756               <link linkend="type-node-atts.sig">Attribute observers</link>
2757             </title>
2758             <para>
2759               <itemizedlist mark="bullet" spacing="compact">
2760                 <listitem>
2761                   <para>
2762 <literal>attribute n</literal>: Returns the value of the attribute with name
2763 <literal>n</literal>. This method returns a value for every declared
2764 attribute, and it raises <literal>Not_found</literal> for any undeclared
2765 attribute. Note that it even returns a value if the attribute is actually
2766 missing but is declared as <literal>#IMPLIED</literal> or has a default
2767 value. - Possible values are:
2768                   <itemizedlist mark="bullet" spacing="compact">
2769                       <listitem>
2770                         <para>
2771 <literal>Implied_value</literal>: The attribute has been declared with the
2772 keyword <literal>#IMPLIED</literal>, and the attribute is missing in the
2773 attribute list of this element.</para>
2774                       </listitem>
2775                       <listitem>
2776                         <para>
2777 <literal>Value s</literal>: The attribute has been declared as type
2778 <literal>CDATA</literal>, as <literal>ID</literal>, as
2779 <literal>IDREF</literal>, as <literal>ENTITY</literal>, or as
2780 <literal>NMTOKEN</literal>, or as enumeration or notation, and one of the two
2781 conditions holds: (1) The attribute value is present in the attribute list in
2782 which case the value is returned in the string <literal>s</literal>. (2) The
2783 attribute has been omitted, and the DTD declared the attribute with a default
2784 value. The default value is returned in <literal>s</literal>.
2785 - Summarized, <literal>Value s</literal> is returned for non-implied, non-list
2786 attribute values.
2787 </para>
2788                       </listitem>
2789                       <listitem>
2790                         <para>
2791 <literal>Valuelist l</literal>: The attribute has been declared as type
2792 <literal>IDREFS</literal>, as <literal>ENTITIES</literal>, or
2793 as <literal>NMTOKENS</literal>, and one of the two conditions holds: (1) The
2794 attribute value is present in the attribute list in which case the
2795 space-separated tokens of the value are returned in the string list
2796 <literal>l</literal>. (2) The attribute has been omitted, and the DTD declared
2797 the attribute with a default value. The default value is returned in
2798 <literal>l</literal>.
2799 - Summarized, <literal>Valuelist l</literal> is returned for all list-type
2800 attribute values.
2801 </para>
2802                       </listitem>
2803                     </itemizedlist>
2804
2805 Note that before the attribute value is returned, the value is normalized. This
2806 means that newlines are converted to spaces, and that references to character
2807 entities (i.e. <literal>&amp;#<replaceable>n</replaceable>;</literal>) and
2808 general entities
2809 (i.e. <literal>&amp;<replaceable>name</replaceable>;</literal>) are expanded;
2810 if necessary, expansion is performed recursively.
2811 </para>
2812
2813 <para>
2814 In well-formedness mode, there is no DTD which could declare an
2815 attribute. Because of this, every occuring attribute is considered as a CDATA
2816 attribute.
2817 </para>
2818                 </listitem>
2819                 <listitem>
2820                   <para>
2821 <literal>required_string_attribute n</literal>: returns the Value attribute
2822 called n, or the Valuelist attribute as a string where the list elements
2823 are separated by spaces. If the attribute value is implied, or if the
2824 attribute does not exists, the method will fail. - This method is convenient
2825 if you expect a non-implied and non-list attribute value.
2826 </para>
2827                 </listitem>
2828                 <listitem>
2829                   <para>
2830 <literal>optional_string_attribute n</literal>: returns the Value attribute
2831 called n, or the Valuelist attribute as a string where the list elements
2832 are separated by spaces. If the attribute value is implied, or if the
2833 attribute does not exists, the method returns None. - This method is
2834 convenient if you expect a non-list attribute value including the implied
2835 value.
2836 </para>
2837                 </listitem>
2838                 <listitem>
2839                   <para>
2840 <literal>required_list_attribute n</literal>: returns the Valuelist attribute
2841 called n, or the Value attribute as a list with a single element.
2842 If the attribute value is implied, or if the
2843 attribute does not exists, the method will fail. - This method is
2844 convenient if you expect a list attribute value.
2845 </para>
2846                 </listitem>
2847                 <listitem>
2848                   <para>
2849 <literal>optional_list_attribute n</literal>: returns the Valuelist attribute
2850 called n, or the Value attribute as a list with a single element.
2851 If the attribute value is implied, or if the
2852 attribute does not exists, an empty list will be returned. - This method
2853 is convenient if you expect a list attribute value or the implied value.
2854 </para>
2855                 </listitem>
2856                 <listitem>
2857                   <para>
2858 <literal>attribute_names</literal>: returns the list of all attribute names of
2859 this element. As this is a validating parser, this list is equal to the
2860 list of declared attributes.
2861 </para>
2862                 </listitem>
2863                 <listitem>
2864                   <para>
2865 <literal>attribute_type n</literal>: returns the type of the attribute called
2866 <literal>n</literal>. See the module <literal>Pxp_types</literal> for a
2867 description of the encoding of the types.
2868 </para>
2869                 </listitem>
2870                 <listitem>
2871                   <para>
2872 <literal>attributes</literal>: returns the list of pairs of names and values
2873 for all attributes of
2874 this element.</para>
2875                 </listitem>
2876                 <listitem>
2877                   <para>
2878 <literal>id_attribute_name</literal>: returns the name of the attribute that is
2879 declared with type ID. There is at most one such attribute. The method raises
2880 <literal>Not_found</literal> if there is no declared ID attribute for the
2881 element type.</para>
2882                 </listitem>
2883                 <listitem>
2884                   <para>
2885 <literal>id_attribute_value</literal>: returns the value of the attribute that
2886 is declared with type ID. There is at most one such attribute. The method raises
2887 <literal>Not_found</literal> if there is no declared ID attribute for the
2888 element type.</para>
2889                 </listitem>
2890                 <listitem>
2891                   <para>
2892 <literal>idref_attribute_names</literal>: returns the list of attribute names
2893 that are declared as IDREF or IDREFS.</para>
2894                 </listitem>
2895               </itemizedlist>
2896           </para>
2897           </formalpara>
2898
2899           <anchor id="type-node-mods">
2900           <formalpara>
2901             <title id="type-node-mods.title">
2902               <link linkend="type-node-mods.sig">Modifying methods</link>
2903             </title>
2904
2905             <para>
2906 The following methods are only defined for element nodes (more exactly:
2907 the methods are defined for data nodes, too, but fail always).
2908
2909               <itemizedlist mark="bullet" spacing="compact">
2910                 <listitem>
2911                   <para>
2912 <literal>add_node sn</literal>: Adds sub node <literal>sn</literal> to the list
2913 of children. This operation is illustrated in the picture
2914 <link linkend="node-add" endterm="node-add"></link>. This method expects that
2915 <literal>sn</literal> is a root, and it requires that <literal>sn</literal> and
2916 the current object share the same DTD.
2917 </para>
2918
2919 <para>Because <literal>add_node</literal> is the method the parser itself uses
2920 to add new nodes to the tree, it performs by default some simple validation
2921 checks: If the content model is a regular expression, it is not allowed to add
2922 data nodes to this node unless the new nodes consist only of whitespace. In
2923 this case, the new data nodes are silently dropped (you can change this by
2924 invoking <literal>keep_always_whitespace_mode</literal>).
2925 </para>
2926
2927 <para>If the document is flagged as stand-alone, these data nodes only
2928 containing whitespace are even forbidden if the element declaration is
2929 contained in an external entity. This case is detected and rejected.</para>
2930
2931 <para>If the content model is <literal>EMPTY</literal>, it is not allowed to
2932 add any data node unless the data node is empty. In this case, the new data
2933 node is silently dropped.
2934 </para>
2935
2936 <para>These checks only apply if there is a DTD. In well-formedness mode, it is
2937 assumed that every element is declared with content model
2938 <literal>ANY</literal> which prohibits any validation check. Furthermore, you
2939 turn these checks off by passing <literal>~force:true</literal> as first
2940 argument.</para>
2941                 </listitem>
2942                 <listitem>
2943                   <para>
2944 <literal>add_pinstr pi</literal>: Adds the processing instruction
2945 <literal>pi</literal> to the list of processing instructions.
2946 </para>
2947                 </listitem>
2948
2949                 <listitem>
2950                   <para>
2951 <literal>delete</literal>: Deletes this node from the tree. After this
2952 operation, this node is no longer the child of the former father node; and the
2953 node loses the connection to the father as well. This operation is illustrated
2954 by the figure <link linkend="node-delete" endterm="node-delete"></link>.
2955 </para>
2956                 </listitem>
2957                 <listitem>
2958                   <para>
2959 <literal>set_nodes nl</literal>: Sets the list of children to
2960 <literal>nl</literal>. It is required that every member of <literal>nl</literal>
2961 is a root, and that all members and the current object share the same DTD.
2962 Unlike <literal>add_node</literal>, no validation checks are performed.
2963 </para>
2964               </listitem>
2965               <listitem>
2966                   <para>
2967 <literal>quick_set_attributes atts</literal>: sets the attributes of this
2968 element to <literal>atts</literal>. It is <emphasis>not</emphasis> checked
2969 whether <literal>atts</literal> matches the DTD or not; it is up to the
2970 caller of this method to ensure this. (This method may be useful to transform
2971 the attribute values, i.e. apply a mapping to every attribute.)
2972 </para>
2973                 </listitem>
2974                 <listitem>
2975                   <para>
2976 <literal>set_comment text</literal>: This method is only applicable to
2977 <literal>T_comment</literal> nodes; it sets the comment text contained by such
2978 nodes. </para>
2979                 </listitem>
2980               </itemizedlist>
2981 </para>
2982           </formalpara>
2983
2984           <anchor id="type-node-cloning">
2985           <formalpara>
2986             <title id="type-node-cloning.title">
2987               <link linkend="type-node-cloning.sig">Cloning methods</link>
2988             </title>
2989
2990             <para>
2991               <itemizedlist mark="bullet" spacing="compact">
2992                 <listitem>
2993                   <para>
2994 <literal>orphaned_clone</literal>: Returns a clone of the node and the complete
2995 tree below this node (deep clone). The clone does not have a parent (i.e. the
2996 reference to the parent node is <emphasis>not</emphasis> cloned). While
2997 copying the subtree, strings are skipped; it is likely that the original tree
2998 and the copy tree share strings. Extension objects are cloned by invoking
2999 the <literal>clone</literal> method on the original objects; how much of
3000 the extension objects is cloned depends on the implemention of this method.
3001 </para>
3002                   <para>This operation is illustrated by the figure
3003 <link linkend="node-clone" endterm="node-clone"></link>.
3004 </para>
3005                 </listitem>
3006                 <listitem>
3007                   <para>
3008 <literal>orphaned_flat_clone</literal>: Returns a clone of the node,
3009 but sets the list of sub nodes to [], i.e. the sub nodes are not cloned.
3010 </para>
3011                 </listitem>
3012                 <listitem>
3013                   <para>
3014 <anchor id="type-node-meth-create-element">
3015 <literal>create_element dtd nt al</literal>: Returns a flat copy of this node
3016 (which must be an element) with the following modifications: The DTD is set to
3017 <literal>dtd</literal>; the node type is set to <literal>nt</literal>, and the
3018 new attribute list is set to <literal>al</literal> (given as list of
3019 (name,value) pairs). The copy does not have children nor a parent. It does not
3020 contain processing instructions. See
3021 <link linkend="type-node-ex-create-element">the example below</link>.
3022 </para>
3023
3024                   <para>Note that you can specify the position of the new node
3025 by the optional argument <literal>~position</literal>.</para>
3026                 </listitem>
3027                 <listitem>
3028                   <para>
3029 <anchor id="type-node-meth-create-data">
3030 <literal>create_data dtd cdata</literal>: Returns a flat copy of this node
3031 (which must be a data node) with the following modifications: The DTD is set to
3032 <literal>dtd</literal>; the node type is set to <literal>T_data</literal>; the
3033 attribute list is empty (data nodes never have attributes); the list of
3034 children and PIs is empty, too (same reason). The new node does not have a
3035 parent. The value <literal>cdata</literal> is the new character content of the
3036 node. See
3037 <link linkend="type-node-ex-create-data">the example below</link>.
3038 </para>
3039                 </listitem>
3040                 <listitem>
3041                   <para>
3042 <literal>keep_always_whitespace_mode</literal>: Even data nodes which are
3043 normally dropped because they only contain ignorable whitespace, can added to
3044 this node once this mode is turned on. (This mode is useful to produce
3045 canonical XML.)
3046 </para>
3047                 </listitem>
3048               </itemizedlist>
3049 </para>
3050           </formalpara>
3051
3052           <anchor id="type-node-weird">
3053           <formalpara>
3054             <title id="type-node-weird.title">
3055               <link linkend="type-node-weird.sig">Validating methods</link>
3056             </title>
3057             <para>
3058 There is one method which locally validates the node, i.e. checks whether the
3059 subnodes match the content model of this node.
3060
3061               <itemizedlist mark="bullet" spacing="compact">
3062                 <listitem>
3063                   <para>
3064 <literal>local_validate</literal>: Checks that this node conforms to the
3065 DTD by comparing the type of the subnodes with the content model for this
3066 node. (Applications need not call this method unless they add new nodes
3067 themselves to the tree.)
3068 </para>
3069                 </listitem>
3070               </itemizedlist>
3071 </para>
3072           </formalpara>
3073         </sect2>
3074
3075         <sect2>
3076           <title>The class <literal>element_impl</literal></title>
3077           <para>
3078 This class is an implementation of <literal>node</literal> which
3079 realizes element nodes:
3080
3081 <programlisting>
3082 <![CDATA[
3083 class [ 'ext ] element_impl : 'ext -> [ 'ext ] node
3084 ]]>
3085 </programlisting>
3086
3087 </para>
3088           <formalpara>
3089             <title>Constructor</title>
3090             <para>
3091 You can create a new instance by
3092
3093 <programlisting>
3094 new element_impl <replaceable>extension_object</replaceable>
3095 </programlisting>
3096
3097 which creates a special form of empty element which already contains a
3098 reference to the <replaceable>extension_object</replaceable>, but is
3099 otherwise empty. This special form is called an
3100 <emphasis>exemplar</emphasis>. The purpose of exemplars is that they serve as
3101 patterns that can be duplicated and filled with data. The method
3102 <link linkend="type-node-meth-create-element">
3103 <literal>create_element</literal></link> is designed to perform this action.
3104 </para>
3105           </formalpara>
3106
3107           <anchor id="type-node-ex-create-element">
3108           <formalpara>
3109             <title>Example</title>
3110
3111             <para>First, create an exemplar by
3112
3113 <programlisting>
3114 let exemplar_ext = ... in
3115 let exemplar     = new element_impl exemplar_ext in
3116 </programlisting>
3117
3118 The <literal>exemplar</literal> is not used in node trees, but only as
3119 a pattern when the element nodes are created:
3120
3121 <programlisting>
3122 let element = exemplar # <link linkend="type-node-meth-create-element">create_element</link> dtd (T_element name) attlist
3123 </programlisting>
3124
3125 The <literal>element</literal> is a copy of <literal>exemplar</literal>
3126 (even the extension <literal>exemplar_ext</literal> has been copied)
3127 which ensures that <literal>element</literal> and its extension are objects
3128 of the same class as the exemplars; note that you need not to pass a
3129 class name or other meta information. The copy is initially connected
3130 with the <literal>dtd</literal>, it gets a node type, and the attribute list
3131 is filled. The <literal>element</literal> is now fully functional; it can
3132 be added to another element as child, and it can contain references to
3133 subnodes.
3134 </para>
3135           </formalpara>
3136
3137         </sect2>
3138
3139         <sect2>
3140           <title>The class <literal>data_impl</literal></title>
3141           <para>
3142 This class is an implementation of <literal>node</literal> which
3143 should be used for all character data nodes:
3144
3145 <programlisting>
3146 <![CDATA[
3147 class [ 'ext ] data_impl : 'ext -> [ 'ext ] node
3148 ]]>
3149 </programlisting>
3150
3151 </para>
3152
3153           <formalpara>
3154             <title>Constructor</title>
3155             <para>
3156 You can create a new instance by
3157
3158 <programlisting>
3159 new data_impl <replaceable>extension_object</replaceable>
3160 </programlisting>
3161
3162 which creates an empty exemplar node which is connected to
3163 <replaceable>extension_object</replaceable>. The node does not contain a
3164 reference to any DTD, and because of this it cannot be added to node trees.
3165 </para>
3166           </formalpara>
3167
3168           <para>To get a fully working data node, apply the method
3169 <link linkend="type-node-meth-create-data"><literal>create_data</literal>
3170 </link> to the exemplar (see example).
3171 </para>
3172
3173           <anchor id="type-node-ex-create-data">
3174           <formalpara>
3175             <title>Example</title>
3176
3177             <para>First, create an exemplar by
3178
3179 <programlisting>
3180 let exemplar_ext = ... in
3181 let exemplar     = new exemplar_ext data_impl in
3182 </programlisting>
3183
3184 The <literal>exemplar</literal> is not used in node trees, but only as
3185 a pattern when the data nodes are created:
3186
3187 <programlisting>
3188 let data_node = exemplar # <link
3189                                  linkend="type-node-meth-create-data">create_data</link> dtd "The characters contained in the data node"
3190 </programlisting>
3191
3192 The <literal>data_node</literal> is a copy of <literal>exemplar</literal>.
3193 The copy is initially connected
3194 with the <literal>dtd</literal>, and it is filled with character material.
3195 The <literal>data_node</literal> is now fully functional; it can
3196 be added to an element as child.
3197 </para>
3198           </formalpara>
3199         </sect2>
3200
3201         <sect2>
3202           <title>The type <literal>spec</literal></title>
3203           <para>
3204 The type <literal>spec</literal> defines a way to handle the details of
3205 creating nodes from exemplars.
3206
3207 <programlisting><![CDATA[
3208 type 'ext spec
3209 constraint 'ext = 'ext node #extension
3210
3211 val make_spec_from_mapping :
3212       ?super_root_exemplar : 'ext node ->
3213       ?comment_exemplar : 'ext node ->
3214       ?default_pinstr_exemplar : 'ext node ->
3215       ?pinstr_mapping : (string, 'ext node) Hashtbl.t ->
3216       data_exemplar: 'ext node ->
3217       default_element_exemplar: 'ext node ->
3218       element_mapping: (string, 'ext node) Hashtbl.t ->
3219       unit ->
3220         'ext spec
3221
3222 val make_spec_from_alist :
3223       ?super_root_exemplar : 'ext node ->
3224       ?comment_exemplar : 'ext node ->
3225       ?default_pinstr_exemplar : 'ext node ->
3226       ?pinstr_alist : (string * 'ext node) list ->
3227       data_exemplar: 'ext node ->
3228       default_element_exemplar: 'ext node ->
3229       element_alist: (string * 'ext node) list ->
3230       unit ->
3231         'ext spec
3232 ]]></programlisting>
3233
3234 The two functions <literal>make_spec_from_mapping</literal> and
3235 <literal>make_spec_from_alist</literal> create <literal>spec</literal>
3236 values. Both functions are functionally equivalent and the only difference is
3237 that the first function prefers hashtables and the latter associative lists to
3238 describe mappings from names to exemplars.
3239 </para>
3240
3241 <para>
3242 You can specify exemplars for the various kinds of nodes that need to be
3243 generated when an XML document is parsed:
3244
3245 <itemizedlist mark="bullet" spacing="compact">
3246               <listitem>
3247                 <para><literal>~super_root_exemplar</literal>: This exemplar
3248 is used to create the super root. This special node is only created if the
3249 corresponding configuration option has been selected; it is the parent node of
3250 the root node which may be convenient if every working node must have a parent.</para>
3251               </listitem>
3252               <listitem>
3253                 <para><literal>~comment_exemplar</literal>: This exemplar is
3254 used when a comment node must be created. Note that such nodes are only created
3255 if the corresponding configuration option is "on".
3256 </para>
3257               </listitem>
3258               <listitem>
3259                 <para><literal>~default_pinstr_exemplar</literal>: If a node
3260 for a processing instruction must be created, and the instruction is not listed
3261 in the table passed by <literal>~pinstr_mapping</literal> or
3262 <literal>~pinstr_alist</literal>, this exemplar is used.
3263 Again the configuration option must be "on" in order to create such nodes at
3264 all.
3265 </para>
3266               </listitem>
3267               <listitem>
3268                 <para><literal>~pinstr_mapping</literal> or
3269 <literal>~pinstr_alist</literal>: Map the target names of processing
3270 instructions to exemplars. These mappings are only used when nodes for
3271 processing instructions are created.</para>
3272               </listitem>
3273               <listitem>
3274                 <para><literal>~data_exemplar</literal>: The exemplar for
3275 ordinary data nodes.</para>
3276               </listitem>
3277               <listitem>
3278                 <para><literal>~default_element_exemplar</literal>: This
3279 exemplar is used if an element node must be created, but the element type
3280 cannot be found in the tables <literal>element_mapping</literal> or
3281 <literal>element_alist</literal>.</para>
3282               </listitem>
3283               <listitem>
3284                 <para><literal>~element_mapping</literal> or
3285 <literal>~element_alist</literal>: Map the element types to exemplars. These
3286 mappings are used to create element nodes.</para>
3287               </listitem>
3288             </itemizedlist>
3289
3290 In most cases, you only want to create <literal>spec</literal> values to pass
3291 them to the parser functions found in <literal>Pxp_yacc</literal>. However, it
3292 might be useful to apply <literal>spec</literal> values directly.
3293 </para>
3294
3295 <para>The following functions create various types of nodes by selecting the
3296 corresponding exemplar from the passed <literal>spec</literal> value, and by
3297 calling <literal>create_element</literal> or <literal>create_data</literal> on
3298 the exemplar.
3299
3300 <programlisting><![CDATA[
3301 val create_data_node :
3302       'ext spec ->
3303       dtd ->
3304       (* data material: *) string ->
3305           'ext node
3306
3307 val create_element_node :
3308       ?position:(string * int * int) ->
3309       'ext spec ->
3310       dtd ->
3311       (* element type: *) string ->
3312       (* attributes: *) (string * string) list ->
3313           'ext node
3314
3315 val create_super_root_node :
3316       ?position:(string * int * int) ->
3317       'ext spec ->
3318        dtd ->
3319            'ext node
3320
3321 val create_comment_node :
3322       ?position:(string * int * int) ->
3323       'ext spec ->
3324       dtd ->
3325       (* comment text: *) string ->
3326           'ext node
3327
3328 val create_pinstr_node :
3329       ?position:(string * int * int) ->
3330       'ext spec ->
3331       dtd ->
3332       proc_instruction ->
3333           'ext node
3334 ]]></programlisting>
3335 </para>
3336         </sect2>
3337
3338         <sect2>
3339           <title>Examples</title>
3340
3341           <formalpara>
3342             <title>Building trees.</title>
3343
3344             <para>Here is the piece of code that creates the tree of
3345 the figure <link linkend="node-term" endterm="node-term"></link>. The extension
3346 object and the DTD are beyond the scope of this example.
3347
3348 <programlisting>
3349 let exemplar_ext = ... (* some extension *) in
3350 let dtd = ... (* some DTD *) in
3351
3352 let element_exemplar = new element_impl exemplar_ext in
3353 let data_exemplar    = new data_impl    exemplar_ext in
3354
3355 let a1 = element_exemplar # create_element dtd (T_element "a") ["att", "apple"]
3356 and b1 = element_exemplar # create_element dtd (T_element "b") []
3357 and c1 = element_exemplar # create_element dtd (T_element "c") []
3358 and a2 = element_exemplar # create_element dtd (T_element "a") ["att", "orange"]
3359 in
3360
3361 let cherries = data_exemplar # create_data dtd "Cherries" in
3362 let orange   = data_exemplar # create_data dtd "An orange" in
3363
3364 a1 # add_node b1;
3365 a1 # add_node c1;
3366 b1 # add_node a2;
3367 b1 # add_node cherries;
3368 a2 # add_node orange;
3369 </programlisting>
3370
3371 Alternatively, the last block of statements could also be written as:
3372
3373 <programlisting>
3374 a1 # set_nodes [b1; c1];
3375 b1 # set_nodes [a2; cherries];
3376 a2 # set_nodes [orange];
3377 </programlisting>
3378
3379 The root of the tree is <literal>a1</literal>, i.e. it is true that
3380
3381 <programlisting>
3382 x # root == a1
3383 </programlisting>
3384
3385 for every x from { <literal>a1</literal>, <literal>a2</literal>,
3386 <literal>b1</literal>, <literal>c1</literal>, <literal>cherries</literal>,
3387 <literal>orange</literal> }.
3388 </para>
3389           </formalpara>
3390           <para>
3391 Furthermore, the following properties hold:
3392
3393 <programlisting>
3394   a1 # attribute "att" = Value "apple"
3395 & a2 # attribute "att" = Value "orange"
3396
3397 & cherries # data = "Cherries"
3398 &   orange # data = "An orange"
3399 &       a1 # data = "CherriesAn orange"
3400
3401 &       a1 # node_type = T_element "a"
3402 &       a2 # node_type = T_element "a"
3403 &       b1 # node_type = T_element "b"
3404 &       c1 # node_type = T_element "c"
3405 & cherries # node_type = T_data
3406 &   orange # node_type = T_data
3407
3408 &       a1 # sub_nodes = [ b1; c1 ]
3409 &       a2 # sub_nodes = [ orange ]
3410 &       b1 # sub_nodes = [ a2; cherries ]
3411 &       c1 # sub_nodes = []
3412 & cherries # sub_nodes = []
3413 &   orange # sub_nodes = []
3414
3415 &       a2 # parent == a1
3416 &       b1 # parent == b1
3417 &       c1 # parent == a1
3418 & cherries # parent == b1
3419 &   orange # parent == a2
3420 </programlisting>
3421 </para>
3422           <formalpara>
3423             <title>Searching nodes.</title>
3424
3425             <para>The following function searches all nodes of a tree
3426 for which a certain condition holds:
3427
3428 <programlisting>
3429 let rec search p t =
3430   if p t then
3431     t :: search_list p (t # sub_nodes)
3432   else
3433     search_list p (t # sub_nodes)
3434
3435 and search_list p l =
3436   match l with
3437     []      -&gt; []
3438   | t :: l' -&gt; (search p t) @ (search_list p l')
3439 ;;
3440 </programlisting>
3441 </para>
3442           </formalpara>
3443
3444           <para>For example, if you want to search all elements of a certain
3445 type <literal>et</literal>, the function <literal>search</literal> can be
3446 applied as follows:
3447
3448 <programlisting>
3449 let search_element_type et t =
3450   search (fun x -&gt; x # node_type = T_element et) t
3451 ;;
3452 </programlisting>
3453 </para>
3454
3455           <formalpara>
3456             <title>Getting attribute values.</title>
3457
3458             <para>Suppose we have the declaration:
3459
3460 <programlisting><![CDATA[
3461 <!ATTLIST e a CDATA #REQUIRED
3462             b CDATA #IMPLIED
3463             c CDATA "12345">]]>
3464 </programlisting>
3465
3466 In this case, every element <literal>e</literal> must have an attribute
3467 <literal>a</literal>, otherwise the parser would indicate an error. If
3468 the O'Caml variable <literal>n</literal> holds the node of the tree
3469 corresponding to the element, you can get the value of the attribute
3470 <literal>a</literal> by
3471
3472 <programlisting>
3473 let value_of_a = n # required_string_attribute "a"
3474 </programlisting>
3475
3476 which is more or less an abbreviation for
3477
3478 <programlisting><![CDATA[
3479 let value_of_a =
3480   match n # attribute "a" with
3481     Value s -> s
3482   | _       -> assert false]]>
3483 </programlisting>
3484
3485 - as the attribute is required, the <literal>attribute</literal> method always
3486 returns a <literal>Value</literal>.
3487 </para>
3488           </formalpara>
3489
3490           <para>In contrast to this, the attribute <literal>b</literal> can be
3491 omitted. In this case, the method <literal>required_string_attribute</literal>
3492 works only if the attribute is there, and the method will fail if the attribute
3493 is missing. To get the value, you can apply the method
3494 <literal>optional_string_attribute</literal>:
3495
3496 <programlisting>
3497 let value_of_b = n # optional_string_attribute "b"
3498 </programlisting>
3499
3500 Now, <literal>value_of_b</literal> is of type <literal>string option</literal>,
3501 and <literal>None</literal> represents the omitted attribute. Alternatively,
3502 you could also use <literal>attribute</literal>:
3503
3504 <programlisting><![CDATA[
3505 let value_of_b =
3506   match n # attribute "b" with
3507     Value s       -> Some s
3508   | Implied_value -> None
3509   | _             -> assert false]]>
3510 </programlisting>
3511 </para>
3512
3513           <para>The attribute <literal>c</literal> behaves much like
3514 <literal>a</literal>, because it has always a value. If the attribute is
3515 omitted, the default, here "12345", will be returned instead. Because of this,
3516 you can again use <literal>required_string_attribute</literal> to get the
3517 value.
3518 </para>
3519
3520           <para>The type <literal>CDATA</literal> is the most general string
3521 type. The types <literal>NMTOKEN</literal>, <literal>ID</literal>,
3522 <literal>IDREF</literal>, <literal>ENTITY</literal>, and all enumerators and
3523 notations are special forms of string types that restrict the possible
3524 values. From O'Caml, they behave like <literal>CDATA</literal>, i.e. you can
3525 use the methods <literal>required_string_attribute</literal> and
3526 <literal>optional_string_attribute</literal>, too.
3527 </para>
3528
3529           <para>In contrast to this, the types <literal>NMTOKENS</literal>,
3530 <literal>IDREFS</literal>, and <literal>ENTITIES</literal> mean lists of
3531 strings. Suppose we have the declaration:
3532
3533 <programlisting><![CDATA[
3534 <!ATTLIST f d NMTOKENS #REQUIRED
3535             e NMTOKENS #IMPLIED>]]>
3536 </programlisting>
3537
3538 The type <literal>NMTOKENS</literal> stands for lists of space-separated
3539 tokens; for example the value <literal>"1 abc 23ef"</literal> means the list
3540 <literal>["1"; "abc"; "23ef"]</literal>. (Again, <literal>IDREFS</literal>
3541 and <literal>ENTITIES</literal> have more restricted values.) To get the
3542 value of attribute <literal>d</literal>, one can use
3543
3544 <programlisting>
3545 let value_of_d = n # required_list_attribute "d"
3546 </programlisting>
3547
3548 or
3549
3550 <programlisting><![CDATA[
3551 let value_of_d =
3552   match n # attribute "d" with
3553     Valuelist l -> l
3554   | _           -> assert false]]>
3555 </programlisting>
3556
3557 As <literal>d</literal> is required, the attribute cannot be omitted, and
3558 the <literal>attribute</literal> method returns always a
3559 <literal>Valuelist</literal>.
3560 </para>
3561
3562           <para>For optional attributes like <literal>e</literal>, apply
3563
3564 <programlisting>
3565 let value_of_e = n # optional_list_attribute "e"
3566 </programlisting>
3567
3568 or
3569
3570 <programlisting><![CDATA[
3571 let value_of_e =
3572   match n # attribute "e" with
3573     Valuelist l   -> l
3574   | Implied_value -> []
3575   | _             -> assert false]]>
3576 </programlisting>
3577
3578 Here, the case that the attribute is missing counts like the empty list.
3579 </para>
3580
3581         </sect2>
3582
3583
3584         <sect2>
3585           <title>Iterators</title>
3586
3587           <para>There are also several iterators in Pxp_document; please see
3588 the mli file for details. You can find examples for them in the
3589 "simple_transformation" directory.
3590
3591 <programlisting><![CDATA[
3592 val find : ?deeply:bool ->
3593            f:('ext node -> bool) -> 'ext node -> 'ext node
3594
3595 val find_all : ?deeply:bool ->
3596                f:('ext node -> bool) -> 'ext node -> 'ext node list
3597
3598 val find_element : ?deeply:bool ->
3599                    string -> 'ext node -> 'ext node
3600
3601 val find_all_elements : ?deeply:bool ->
3602                         string -> 'ext node -> 'ext node list
3603
3604 exception Skip
3605 val map_tree :  pre:('exta node -> 'extb node) ->
3606                ?post:('extb node -> 'extb node) ->
3607                'exta node ->
3608                    'extb node
3609
3610
3611 val map_tree_sibl :
3612         pre: ('exta node option -> 'exta node -> 'exta node option ->
3613                   'extb node) ->
3614        ?post:('extb node option -> 'extb node -> 'extb node option ->
3615                   'extb node) ->
3616        'exta node ->
3617            'extb node
3618
3619 val iter_tree : ?pre:('ext node -> unit) ->
3620                 ?post:('ext node -> unit) ->
3621                 'ext node ->
3622                     unit
3623
3624 val iter_tree_sibl :
3625        ?pre: ('ext node option -> 'ext node -> 'ext node option -> unit) ->
3626        ?post:('ext node option -> 'ext node -> 'ext node option -> unit) ->
3627        'ext node ->
3628            unit
3629 ]]></programlisting>
3630 </para>
3631         </sect2>
3632
3633       </sect1>
3634
3635 <!-- ********************************************************************** -->
3636
3637       <sect1>
3638         <title>The class type <literal>extension</literal></title>
3639         <para>
3640
3641 <programlisting>
3642 <![CDATA[
3643 class type [ 'node ] extension =
3644   object ('self)
3645     method clone : 'self
3646       (* "clone" should return an exact deep copy of the object. *)
3647     method node : 'node
3648       (* "node" returns the corresponding node of this extension. This method
3649        * intended to return exactly what previously has been set by "set_node".
3650        *)
3651     method set_node : 'node -> unit
3652       (* "set_node" is invoked once the extension is associated to a new
3653        * node object.
3654        *)
3655   end
3656 ]]>
3657 </programlisting>
3658
3659 This is the type of classes used for node extensions. For every node of the
3660 document tree, there is not only the <literal>node</literal> object, but also
3661 an <literal>extension</literal> object. The latter has minimal
3662 functionality; it has only the necessary methods to be attached to the node
3663 object containing the details of the node instance. The extension object is
3664 called extension because its purpose is extensibility.</para>
3665
3666         <para>For some reasons, it is impossible to derive the
3667 <literal>node</literal> classes (i.e. <literal>element_impl</literal> and
3668 <literal>data_impl</literal>) such that the subclasses can be extended by new
3669 new methods. But
3670 subclassing nodes is a great feature, because it allows the user to provide
3671 different classes for different types of nodes. The extension objects are a
3672 workaround that is as powerful as direct subclassing, the costs are
3673 some notation overhead.
3674 </para>
3675
3676 <figure id="extension-general" float="1">
3677 <title>The structure of nodes and extensions</title>
3678 <graphic fileref="pic/extension_general" format="GIF">
3679 </graphic>
3680 </figure>
3681
3682         <para>The picture shows how the nodes and extensions are linked
3683 together. Every node has a reference to its extension, and every extension has
3684 a reference to its node. The methods <literal>extension</literal> and
3685 <literal>node</literal> follow these references; a typical phrase is
3686
3687 <programlisting>
3688 self # node # attribute "xy"
3689 </programlisting>
3690
3691 to get the value of an attribute from a method defined in the extension object;
3692 or
3693
3694 <programlisting>
3695 self # node # iter
3696   (fun n -&gt; n # extension # my_method ...)
3697 </programlisting>
3698
3699 to iterate over the subnodes and to call <literal>my_method</literal> of the
3700 corresponding extension objects.
3701 </para>
3702
3703         <para>Note that extension objects do not have references to subnodes
3704 (or "subextensions") themselves; in order to get one of the children of an
3705 extension you must first go to the node object, then get the child node, and
3706 finally reach the extension that is logically the child of the extension you
3707 started with.</para>
3708
3709         <sect2>
3710           <title>How to define an extension class</title>
3711
3712           <para>At minimum, you must define the methods
3713 <literal>clone</literal>, <literal>node</literal>, and
3714 <literal>set_node</literal> such that your class is compatible with the type
3715 <literal>extension</literal>. The method <literal>set_node</literal> is called
3716 during the initialization of the node, or after a node has been cloned; the
3717 node object invokes <literal>set_node</literal> on the extension object to tell
3718 it that this node is now the object the extension is linked to. The extension
3719 must return the node object passed as argument of <literal>set_node</literal>
3720 when the <literal>node</literal> method is called.</para>
3721
3722           <para>The <literal>clone</literal> method must return a copy of the
3723 extension object; at least the object itself must be duplicated, but if
3724 required, the copy should deeply duplicate all objects and values that are
3725 referred by the extension, too. Whether this is required, depends on the
3726 application; <literal>clone</literal> is invoked by the node object when one of
3727 its cloning methods is called.</para>
3728
3729           <para>A good starting point for an extension class:
3730
3731 <programlisting>
3732 <![CDATA[class custom_extension =
3733   object (self)
3734
3735     val mutable node = (None : custom_extension node option)
3736
3737     method clone = {< >}
3738
3739     method node =
3740       match node with
3741           None ->
3742             assert false
3743         | Some n -> n
3744
3745     method set_node n =
3746       node <- Some n
3747
3748   end
3749 ]]>
3750 </programlisting>
3751
3752 This class is compatible with <literal>extension</literal>. The purpose of
3753 defining such a class is, of course, adding further methods; and you can do it
3754 without restriction.
3755 </para>
3756
3757           <para>Often, you want not only one extension class. In this case,
3758 it is the simplest way that all your classes (for one kind of document) have
3759 the same type (with respect to the interface; i.e. it does not matter if your
3760 classes differ in the defined private methods and instance variables, but
3761 public methods count). This approach avoids lots of coercions and problems with
3762 type incompatibilities. It is simple to implement:
3763
3764 <programlisting>
3765 <![CDATA[class custom_extension =
3766   object (self)
3767     val mutable node = (None : custom_extension node option)
3768
3769     method clone = ...      (* see above *)
3770     method node = ...       (* see above *)
3771     method set_node n = ... (* see above *)
3772
3773     method virtual my_method1 : ...
3774     method virtual my_method2 : ...
3775     ... (* etc. *)
3776   end
3777
3778 class custom_extension_kind_A =
3779   object (self)
3780     inherit custom_extension
3781
3782     method my_method1 = ...
3783     method my_method2 = ...
3784   end
3785
3786 class custom_extension_kind_B =
3787   object (self)
3788     inherit custom_extension
3789
3790     method my_method1 = ...
3791     method my_method2 = ...
3792   end
3793 ]]>
3794 </programlisting>
3795
3796 If a class does not need a method (e.g. because it does not make sense, or it
3797 would violate some important condition), it is possible to define the method
3798 and to always raise an exception when the method is invoked
3799 (e.g. <literal>assert false</literal>).
3800 </para>
3801
3802           <para>The latter is a strong recommendation: do not try to further
3803 specialize the types of extension objects. It is difficult, sometimes even
3804 impossible, and almost never worth-while.</para>
3805         </sect2>
3806
3807         <sect2>
3808           <title>How to bind extension classes to element types</title>
3809
3810           <para>Once you have defined your extension classes, you can bind them
3811 to element types. The simplest case is that you have only one class and that
3812 this class is to be always used. The parsing functions in the module
3813 <literal>Pxp_yacc</literal> take a <literal>spec</literal> argument which
3814 can be customized. If your single class has the name <literal>c</literal>,
3815 this argument should be
3816
3817 <programlisting>
3818 let spec =
3819   make_spec_from_alist
3820     ~data_exemplar:            (new data_impl c)
3821     ~default_element_exemplar: (new element_impl c)
3822     ~element_alist:            []
3823     ()
3824 </programlisting>
3825
3826 This means that data nodes will be created from the exemplar passed by
3827 ~data_exemplar and that all element nodes will be made from the exemplar
3828 specified by ~default_element_exemplar. In ~element_alist, you can
3829 pass that different exemplars are to be used for different element types; but
3830 this is an optional feature. If you do not need it, pass the empty list.
3831 </para>
3832
3833 <para>
3834 Remember that an exemplar is a (node, extension) pair that serves as pattern
3835 when new nodes (and the corresponding extension objects) are added to the
3836 document tree. In this case, the exemplar contains <literal>c</literal> as
3837 extension, and when nodes are created, the exemplar is cloned, and cloning
3838 makes also a copy of <literal>c</literal> such that all nodes of the document
3839 tree will have a copy of <literal>c</literal> as extension.
3840 </para>
3841
3842           <para>The <literal>~element_alist</literal> argument can bind
3843 specific element types to specific exemplars; as exemplars may be instances of
3844 different classes it is effectively possible to bind element types to
3845 classes. For example, if the element type "p" is implemented by class "c_p",
3846 and "q" is realized by "c_q", you can pass the following value:
3847
3848 <programlisting>
3849 let spec =
3850   make_spec_from_alist
3851     ~data_exemplar:            (new data_impl c)
3852     ~default_element_exemplar: (new element_impl c)
3853     ~element_alist:
3854       [ "p", new element_impl c_p;
3855         "q", new element_impl c_q;
3856       ]
3857     ()
3858 </programlisting>
3859
3860 The extension object <literal>c</literal> is still used for all data nodes and
3861 for all other element types.
3862 </para>
3863
3864         </sect2>
3865
3866       </sect1>
3867
3868 <!-- ********************************************************************** -->
3869
3870       <sect1>
3871         <title>Details of the mapping from XML text to the tree representation
3872 </title>
3873
3874         <sect2>
3875           <title>The representation of character-free elements</title>
3876
3877           <para>If an element declaration does not allow the element to
3878 contain character data, the following rules apply.</para>
3879
3880           <para>If the element must be empty, i.e. it is declared with the
3881 keyword <literal>EMPTY</literal>, the element instance must be effectively
3882 empty (it must not even contain whitespace characters). The parser guarantees
3883 that a declared <literal>EMPTY</literal> element does never contain a data
3884 node, even if the data node represents the empty string.</para>
3885
3886           <para>If the element declaration only permits other elements to occur
3887 within that element but not character data, it is still possible to insert
3888 whitespace characters between the subelements. The parser ignores these
3889 characters, too, and does not create data nodes for them.</para>
3890
3891           <formalpara>
3892             <title>Example.</title>
3893
3894             <para>Consider the following element types:
3895
3896 <programlisting><![CDATA[
3897 <!ELEMENT x ( #PCDATA | z )* >
3898 <!ELEMENT y ( z )* >
3899 <!ELEMENT z EMPTY>
3900 ]]></programlisting>
3901
3902 Only <literal>x</literal> may contain character data, the keyword
3903 <literal>#PCDATA</literal> indicates this. The other types are character-free.
3904 </para>
3905           </formalpara>
3906
3907           <para>The XML term
3908
3909 <programlisting><![CDATA[
3910 <x><z/> <z/></x>
3911 ]]></programlisting>
3912
3913 will be internally represented by an element node for <literal>x</literal>
3914 with three subnodes: the first <literal>z</literal> element, a data node
3915 containing the space character, and the second <literal>z</literal> element.
3916 In contrast to this, the term
3917
3918 <programlisting><![CDATA[
3919 <y><z/> <z/></y>
3920 ]]></programlisting>
3921
3922 is represented by an  element node for <literal>y</literal> with only
3923 <emphasis>two</emphasis> subnodes, the two <literal>z</literal> elements. There
3924 is no data node for the space character because spaces are ignored in the
3925 character-free element <literal>y</literal>.
3926 </para>
3927
3928         </sect2>
3929
3930         <sect2>
3931           <title>The representation of character data</title>
3932
3933           <para>The XML specification allows all Unicode characters in XML
3934 texts. This parser can be configured such that UTF-8 is used to represent the
3935 characters internally; however, the default character encoding is
3936 ISO-8859-1. (Currently, no other encodings are possible for the internal string
3937 representation; the type <literal>Pxp_types.rep_encoding</literal> enumerates
3938 the possible encodings. Principially, the parser could use any encoding that is
3939 ASCII-compatible, but there are currently only lexical analyzers for UTF-8 and
3940 ISO-8859-1. It is currently impossible to use UTF-16 or UCS-4 as internal
3941 encodings (or other multibyte encodings which are not ASCII-compatible) unless
3942 major parts of the parser are rewritten - unlikely...)
3943 </para>
3944
3945 <para>
3946 The internal encoding may be different from the external encoding (specified
3947 in the XML declaration <literal>&lt;?xml ... encoding="..."?&gt;</literal>); in
3948 this case the strings are automatically converted to the internal encoding.
3949 </para>
3950
3951 <para>
3952 If the internal encoding is ISO-8859-1, it is possible that there are
3953 characters that cannot be represented. In this case, the parser ignores such
3954 characters and prints a warning (to the <literal>collect_warning</literal>
3955 object that must be passed when the parser is called).
3956 </para>
3957
3958           <para>The XML specification allows lines to be separated by single LF
3959 characters, by CR LF character sequences, or by single CR
3960 characters. Internally, these separators are always converted to single LF
3961 characters.</para>
3962
3963           <para>The parser guarantees that there are never two adjacent data
3964 nodes; if necessary, data material that would otherwise be represented by
3965 several nodes is collapsed into one node. Note that you can still create node
3966 trees with adjacent data nodes; however, the parser does not return such trees.
3967 </para>
3968
3969           <para>Note that CDATA sections are not represented specially; such
3970 sections are added to the current data material that being collected for the
3971 next data node.</para>
3972         </sect2>
3973
3974
3975         <sect2>
3976           <title>The representation of entities within documents</title>
3977
3978           <para><emphasis>Entities are not represented within
3979 documents!</emphasis> If the parser finds an entity reference in the document
3980 content, the reference is immediately expanded, and the parser reads the
3981 expansion text instead of the reference.
3982 </para>
3983         </sect2>
3984
3985         <sect2>
3986           <title>The representation of attributes</title> <para>As attribute
3987 values are composed of Unicode characters, too, the same problems with the
3988 character encoding arise as for character material. Attribute values are
3989 converted to the internal encoding, too; and if there are characters that
3990 cannot be represented, these are dropped, and a warning is printed.</para>
3991
3992           <para>Attribute values are normalized before they are returned by
3993 methods like <literal>attribute</literal>. First, any remaining entity
3994 references are expanded; if necessary, expansion is performed recursively.
3995 Second, newline characters (any of LF, CR LF, or CR characters) are converted
3996 to single space characters. Note that especially the latter action is
3997 prescribed by the XML standard (but <literal>&#10;</literal> is not converted
3998 such that it is still possible to include line feeds into attributes).
3999 </para>
4000         </sect2>
4001
4002         <sect2>
4003           <title>The representation of processing instructions</title>
4004 <para>Processing instructions are parsed to some extent: The first word of the
4005 PI is called the target, and it is stored separated from the rest of the PI:
4006
4007 <programlisting><![CDATA[
4008 <?target rest?>
4009 ]]></programlisting>
4010
4011 The exact location where a PI occurs is not represented (by default). The
4012 parser puts the PI into the object that represents the embracing construct (an
4013 element, a DTD, or the whole document); that means you can find out which PIs
4014 occur in a certain element, in the DTD, or in the whole document, but you
4015 cannot lookup the exact position within the construct.
4016 </para>
4017
4018           <para>If you require the exact location of PIs, it is possible to
4019 create extra nodes for them. This mode is controled by the option
4020 <literal>enable_pinstr_nodes</literal>. The additional nodes have the node type
4021 <literal>T_pinstr <replaceable>target</replaceable></literal>, and are created
4022 from special exemplars contained in the <literal>spec</literal> (see
4023 pxp_document.mli).</para>
4024         </sect2>
4025
4026         <sect2>
4027           <title>The representation of comments</title>
4028
4029 <para>Normally, comments are not represented; they are dropped by
4030 default. However, if you require them, it is possible to create
4031 <literal>T_comment</literal> nodes for them. This mode can be specified by the
4032 option <literal>enable_comment_nodes</literal>. Comment nodes are created from
4033 special exemplars contained in the <literal>spec</literal> (see
4034 pxp_document.mli). You can access the contents of comments through the
4035 method <literal>comment</literal>.</para>
4036         </sect2>
4037
4038         <sect2>
4039           <title>The attributes <literal>xml:lang</literal> and
4040 <literal>xml:space</literal></title>
4041
4042           <para>These attributes are not supported specially; they are handled
4043 like any other attribute.</para>
4044         </sect2>
4045
4046
4047         <sect2>
4048           <title>And what about namespaces?</title>
4049           <para>Currently, there is no special support for namespaces.
4050 However, the parser allows it that the colon occurs in names such that it is
4051 possible to implement namespaces on top of the current API.</para>
4052
4053           <para>Some future release of PXP will support namespaces as built-in
4054 feature...</para>
4055         </sect2>
4056
4057       </sect1>
4058
4059     </chapter>
4060
4061 <!-- ********************************************************************** -->
4062
4063     <chapter>
4064       <title>Configuring and calling the parser</title>
4065
4066 <!--
4067       <para>
4068 <emphasis>
4069 Sorry, this chapter has not yet been written. For an introduction into parser
4070 configuration, see the previous chapters. As a first approximation, the
4071 interface definition of Markup_yacc outlines what could go here.
4072 </emphasis>
4073 </para>
4074 -->
4075
4076 <!--
4077       <para>
4078 <programlisting>&markup-yacc.mli;</programlisting>
4079 </para>
4080 -->
4081
4082       <sect1>
4083         <title>Overview</title>
4084         <para>
4085 There are the following main functions invoking the parser (in Pxp_yacc):
4086
4087           <itemizedlist mark="bullet" spacing="compact">
4088             <listitem>
4089               <para><emphasis>parse_document_entity:</emphasis> You want to
4090 parse a complete and closed document consisting of a DTD and the document body;
4091 the body is validated against the DTD. This mode is interesting if you have a
4092 file
4093
4094 <programlisting><![CDATA[
4095 <!DOCTYPE root ... [ ... ] > <root> ... </root>
4096 ]]></programlisting>
4097
4098 and you can accept any DTD that is included in the file (e.g. because the file
4099 is under your control).
4100 </para>
4101             </listitem>
4102             <listitem>
4103               <para><emphasis>parse_wfdocument_entity:</emphasis> You want to
4104 parse a complete and closed document consisting of a DTD and the document body;
4105 but the body is not validated, only checked for well-formedness. This mode is
4106 preferred if validation costs too much time or if the DTD is missing.
4107 </para>
4108             </listitem>
4109             <listitem>
4110               <para><emphasis>parse_dtd_entity:</emphasis> You want only to
4111 parse an entity (file) containing the external subset of a DTD. Sometimes it is
4112 interesting to read such a DTD, for example to compare it with the DTD included
4113 in a document, or to apply the next mode:
4114 </para>
4115             </listitem>
4116             <listitem>
4117               <para><emphasis>parse_content_entity:</emphasis> You want only to
4118 parse an entity (file) containing a fragment of a document body; this fragment
4119 is validated against the DTD you pass to the function. Especially, the fragment
4120 must not have a <literal> &lt;!DOCTYPE&gt;</literal> clause, and must directly
4121 begin with an element.  The element is validated against the DTD.  This mode is
4122 interesting if you want to check documents against a fixed, immutable DTD.
4123 </para>
4124             </listitem>
4125             <listitem>
4126               <para><emphasis>parse_wfcontent_entity:</emphasis> This function
4127 also parses a single element without DTD, but does not validate it.</para>
4128             </listitem>
4129             <listitem>
4130               <para><emphasis>extract_dtd_from_document_entity:</emphasis> This
4131 function extracts the DTD from a closed document consisting of a DTD and a
4132 document body. Both the internal and the external subsets are extracted.</para>
4133             </listitem>
4134           </itemizedlist>
4135 </para>
4136
4137 <para>
4138 In many cases, <literal>parse_document_entity</literal> is the preferred mode
4139 to parse a document in a validating way, and
4140 <literal>parse_wfdocument_entity</literal> is the mode of choice to parse a
4141 file while only checking for well-formedness.
4142 </para>
4143
4144 <para>
4145 There are a number of variations of these modes. One important application of a
4146 parser is to check documents of an untrusted source against a fixed DTD. One
4147 solution is to not allow the <literal>&lt;!DOCTYPE&gt;</literal> clause in
4148 these documents, and treat the document like a fragment (using mode
4149 <emphasis>parse_content_entity</emphasis>). This is very simple, but
4150 inflexible; users of such a system cannot even define additional entities to
4151 abbreviate frequent phrases of their text.
4152 </para>
4153
4154 <para>
4155 It may be necessary to have a more intelligent checker. For example, it is also
4156 possible to parse the document to check fully, i.e. with DTD, and to compare
4157 this DTD with the prescribed one. In order to fully parse the document, mode
4158 <emphasis>parse_document_entity</emphasis> is applied, and to get the DTD to
4159 compare with mode <emphasis>parse_dtd_entity</emphasis> can be used.
4160 </para>
4161
4162 <para>
4163 There is another very important configurable aspect of the parser: the
4164 so-called resolver. The task of the resolver is to locate the contents of an
4165 (external) entity for a given entity name, and to make the contents accessible
4166 as a character stream. (Furthermore, it also normalizes the character set;
4167 but this is a detail we can ignore here.) Consider you have a file called
4168 <literal>"main.xml"</literal> containing
4169
4170 <programlisting><![CDATA[
4171 <!ENTITY % sub SYSTEM "sub/sub.xml">
4172 %sub;
4173 ]]></programlisting>
4174
4175 and a file stored in the subdirectory <literal>"sub"</literal> with name
4176 <literal>"sub.xml"</literal> containing
4177
4178 <programlisting><![CDATA[
4179 <!ENTITY % subsub SYSTEM "subsub/subsub.xml">
4180 %subsub;
4181 ]]></programlisting>
4182
4183 and a file stored in the subdirectory <literal>"subsub"</literal> of
4184 <literal>"sub"</literal> with name <literal>"subsub.xml"</literal> (the
4185 contents of this file do not matter). Here, the resolver must track that
4186 the second entity <literal>subsub</literal> is located in the directory
4187 <literal>"sub/subsub"</literal>, i.e. the difficulty is to interpret the
4188 system (file) names of entities relative to the entities containing them,
4189 even if the entities are deeply nested.
4190 </para>
4191
4192 <para>
4193 There is not a fixed resolver already doing everything right - resolving entity
4194 names is a task that highly depends on the environment. The XML specification
4195 only demands that <literal>SYSTEM</literal> entities are interpreted like URLs
4196 (which is not very precise, as there are lots of URL schemes in use), hoping
4197 that this helps overcoming the local peculiarities of the environment; the idea
4198 is that if you do not know your environment you can refer to other entities by
4199 denoting URLs for them. I think that this interpretation of
4200 <literal>SYSTEM</literal> names may have some applications in the internet, but
4201 it is not the first choice in general. Because of this, the resolver is a
4202 separate module of the parser that can be exchanged by another one if
4203 necessary; more precisely, the parser already defines several resolvers.
4204 </para>
4205
4206 <para>
4207 The following resolvers do already exist:
4208
4209           <itemizedlist mark="bullet" spacing="compact">
4210             <listitem>
4211               <para>Resolvers reading from arbitrary input channels. These
4212 can be configured such that a certain ID is associated with the channel; in
4213 this case inner references to external entities can be resolved. There is also
4214 a special resolver that interprets SYSTEM IDs as URLs; this resolver can
4215 process relative SYSTEM names and determine the corresponding absolute URL.
4216 </para>
4217             </listitem>
4218             <listitem>
4219               <para>A resolver that reads always from a given O'Caml
4220 string. This resolver is not able to resolve further names unless the string is
4221 not associated with any name, i.e. if the document contained in the string
4222 refers to an external entity, this reference cannot be followed in this
4223 case.</para>
4224             </listitem>
4225             <listitem>
4226               <para>A resolver for file names. The <literal>SYSTEM</literal>
4227 name is interpreted as file URL with the slash "/" as separator for
4228 directories. - This resolver is derived from the generic URL resolver.</para>
4229             </listitem>
4230           </itemizedlist>
4231
4232 The interface a resolver must have is documented, so it is possible to write
4233 your own resolver. For example, you could connect the parser with an HTTP
4234 client, and resolve URLs of the HTTP namespace. The resolver classes support
4235 that several independent resolvers are combined to one more powerful resolver;
4236 thus it is possible to combine a self-written resolver with the already
4237 existing resolvers.
4238 </para>
4239
4240 <para>
4241 Note that the existing resolvers only interpret <literal>SYSTEM</literal>
4242 names, not <literal>PUBLIC</literal> names. If it helps you, it is possible to
4243 define resolvers for <literal>PUBLIC</literal> names, too; for example, such a
4244 resolver could look up the public name in a hash table, and map it to a system
4245 name which is passed over to the existing resolver for system names. It is
4246 relatively simple to provide such a resolver.
4247 </para>
4248
4249
4250       </sect1>
4251
4252       <sect1>
4253         <title>Resolvers and sources</title>
4254
4255         <sect2>
4256           <title>Using the built-in resolvers (called sources)</title>
4257
4258           <para>The type <literal>source</literal> enumerates the two
4259 possibilities where the document to parse comes from.
4260
4261 <programlisting>
4262 type source =
4263     Entity of ((dtd -&gt; Pxp_entity.entity) * Pxp_reader.resolver)
4264   | ExtID of (ext_id * Pxp_reader.resolver)
4265 </programlisting>
4266
4267 You normally need not to worry about this type as there are convenience
4268 functions that create <literal>source</literal> values:
4269
4270
4271             <itemizedlist mark="bullet" spacing="compact">
4272               <listitem>
4273                 <para><literal>from_file s</literal>: The document is read from
4274 file <literal>s</literal>; you may specify absolute or relative path names.
4275 The file name must be encoded as UTF-8 string.
4276 </para>
4277
4278 <para>There is an optional argument <literal>~system_encoding</literal>
4279 specifying the character encoding which is used for the names of the file
4280 system. For example, if this encoding is ISO-8859-1 and <literal>s</literal> is
4281 also a ISO-8859-1 string, you can form the source:
4282
4283 <programlisting><![CDATA[
4284 let s_utf8  =  recode_string ~in_enc:`Enc_iso88591 ~out_enc:`Enc_utf8 s in
4285 from_file ~system_encoding:`Enc_iso88591 s_utf8
4286 ]]></programlisting>
4287 </para>
4288
4289 <para>
4290 This <literal>source</literal> has the advantage that
4291 it is able to resolve inner external entities; i.e. if your document includes
4292 data from another file (using the <literal>SYSTEM</literal> attribute), this
4293 mode will find that file. However, this mode cannot resolve
4294 <literal>PUBLIC</literal> identifiers nor <literal>SYSTEM</literal> identifiers
4295 other than "file:".
4296 </para>
4297               </listitem>
4298               <listitem>
4299                 <para><literal>from_channel ch</literal>: The document is read
4300 from the channel <literal>ch</literal>. In general, this source also supports
4301 file URLs found in the document; however, by default only absolute URLs are
4302 understood. It is possible to associate an ID with the channel such that the
4303 resolver knows how to interpret relative URLs:
4304
4305 <programlisting>
4306 from_channel ~id:(System "file:///dir/dir1/") ch
4307 </programlisting>
4308
4309 There is also the ~system_encoding argument specifying how file names are
4310 encoded. - The example from above can also be written (but it is no
4311 longer possible to interpret relative URLs because there is no ~id argument,
4312 and computing this argument is relatively complicated because it must
4313 be a valid URL):
4314
4315 <programlisting>
4316 let ch = open_in s in
4317 let src = from_channel ~system_encoding:`Enc_iso88591 ch in
4318 ...;
4319 close_in ch
4320 </programlisting>
4321 </para>
4322               </listitem>
4323               <listitem>
4324                 <para><literal>from_string s</literal>: The string
4325 <literal>s</literal> is the document to parse. This mode is not able to
4326 interpret file names of <literal>SYSTEM</literal> clauses, nor it can look up
4327 <literal>PUBLIC</literal> identifiers. </para>
4328
4329                 <para>Normally, the encoding of the string is detected as usual
4330 by analyzing the XML declaration, if any. However, it is also possible to
4331 specify the encoding directly:
4332
4333 <programlisting>
4334 let src = from_string ~fixenc:`ISO-8859-2 s
4335 </programlisting>
4336 </para>
4337               </listitem>
4338               <listitem>
4339                 <para><literal>ExtID (id, r)</literal>: The document to parse
4340 is denoted by the identifier <literal>id</literal> (either a
4341 <literal>SYSTEM</literal> or <literal>PUBLIC</literal> clause), and this
4342 identifier is interpreted by the resolver <literal>r</literal>. Use this mode
4343 if you have written your own resolver.</para>
4344                 <para>Which character sets are possible depends on the passed
4345 resolver <literal>r</literal>.</para>
4346               </listitem>
4347               <listitem>
4348                 <para><literal>Entity (get_entity, r)</literal>: The document
4349 to parse is returned by the function invocation <literal>get_entity
4350 dtd</literal>, where <literal>dtd</literal> is the DTD object to use (it may be
4351 empty). Inner external references occuring in this entity are resolved using
4352 the resolver <literal>r</literal>.</para>
4353                 <para>Which character sets are possible depends on the passed
4354 resolver <literal>r</literal>.</para>
4355               </listitem>
4356             </itemizedlist></para>
4357         </sect2>
4358
4359
4360         <sect2>
4361           <title>The resolver API</title>
4362
4363           <para>A resolver is an object that can be opened like a file, but you
4364 do not pass the file name to the resolver, but the XML identifier of the entity
4365 to read from (either a <literal>SYSTEM</literal> or <literal>PUBLIC</literal>
4366 clause). When opened, the resolver must return the
4367 <literal>Lexing.lexbuf</literal> that reads the characters.  The resolver can
4368 be closed, and it can be cloned. Furthermore, it is possible to tell the
4369 resolver which character set it should assume. - The following from Pxp_reader:
4370
4371 <programlisting><![CDATA[
4372 exception Not_competent
4373 exception Not_resolvable of exn
4374
4375 class type resolver =
4376   object
4377     method init_rep_encoding : rep_encoding -> unit
4378     method init_warner : collect_warnings -> unit
4379     method rep_encoding : rep_encoding
4380     method open_in : ext_id -> Lexing.lexbuf
4381     method close_in : unit
4382     method change_encoding : string -> unit
4383     method clone : resolver
4384     method close_all : unit
4385   end
4386 ]]></programlisting>
4387
4388 The resolver object must work as follows:</para>
4389
4390 <para>
4391             <itemizedlist mark="bullet" spacing="compact">
4392               <listitem>
4393                 <para>When the parser is called, it tells the resolver the
4394 warner object and the internal encoding by invoking
4395 <literal>init_warner</literal> and <literal>init_rep_encoding</literal>. The
4396 resolver should store these values. The method <literal>rep_encoding</literal>
4397 should return the internal encoding.
4398 </para>
4399               </listitem>
4400               <listitem>
4401                 <para>If the parser wants to read from the resolver, it invokes
4402 the method <literal>open_in</literal>. Either the resolver succeeds, in which
4403 case the <literal>Lexing.lexbuf</literal> reading from the file or stream must
4404 be returned, or opening fails. In the latter case the method implementation
4405 should raise an exception (see below).</para>
4406               </listitem>
4407               <listitem>
4408                 <para>If the parser finishes reading, it calls the
4409 <literal>close_in</literal> method.</para>
4410               </listitem>
4411               <listitem>
4412                 <para>If the parser finds a reference to another external
4413 entity in the input stream, it calls <literal>clone</literal> to get a second
4414 resolver which must be initially closed (not yet connected with an input
4415 stream).  The parser then invokes <literal>open_in</literal> and the other
4416 methods as described.</para>
4417               </listitem>
4418               <listitem>
4419                 <para>If you already know the character set of the input
4420 stream, you should recode it to the internal encoding, and define the method
4421 <literal>change_encoding</literal> as an empty method.</para>
4422               </listitem>
4423               <listitem>
4424                 <para>If you want to support multiple external character sets,
4425 the object must follow a much more complicated protocol. Directly after
4426 <literal>open_in</literal> has been called, the resolver must return a lexical
4427 buffer that only reads one byte at a time. This is only possible if you create
4428 the lexical buffer with <literal>Lexing.from_function</literal>; the function
4429 must then always return 1 if the EOF is not yet reached, and 0 if EOF is
4430 reached. If the parser has read the first line of the document, it will invoke
4431 <literal>change_encoding</literal> to tell the resolver which character set to
4432 assume. From this moment, the object can return more than one byte at once. The
4433 argument of <literal>change_encoding</literal> is either the parameter of the
4434 "encoding" attribute of the XML declaration, or the empty string if there is
4435 not any XML declaration or if the declaration does not contain an encoding
4436 attribute. </para>
4437
4438                 <para>At the beginning the resolver must only return one
4439 character every time something is read from the lexical buffer. The reason for
4440 this is that you otherwise would not exactly know at which position in the
4441 input stream the character set changes.</para>
4442
4443                 <para>If you want automatic recognition of the character set,
4444 it is up to the resolver object to implement this.</para>
4445               </listitem>
4446
4447               <listitem><para>If an error occurs, the parser calls the method
4448 <literal>close_all</literal> for the top-level resolver; this method should
4449 close itself (if not already done) and all clones.</para>
4450               </listitem>
4451             </itemizedlist>
4452 </para>
4453           <formalpara><title>Exceptions</title>
4454             <para>
4455 It is possible to chain resolvers such that when the first resolver is not able
4456 to open the entity, the other resolvers of the chain are tried in turn. The
4457 method <literal>open_in</literal> should raise the exception
4458 <literal>Not_competent</literal> to indicate that the next resolver should try
4459 to open the entity. If the resolver is able to handle the ID, but some other
4460 error occurs, the exception <literal>Not_resolvable</literal> should be raised
4461 to force that the chain breaks.
4462           </para>
4463           </formalpara>
4464
4465         <para>Example: How to define a resolver that is equivalent to
4466 from_string: ...</para>
4467
4468         </sect2>
4469
4470         <sect2>
4471           <title>Predefined resolver components</title>
4472           <para>
4473 There are some classes in Pxp_reader that define common resolver behaviour.
4474
4475 <programlisting><![CDATA[
4476 class resolve_read_this_channel :
4477     ?id:ext_id ->
4478     ?fixenc:encoding ->
4479     ?auto_close:bool ->
4480     in_channel ->
4481         resolver
4482 ]]></programlisting>
4483
4484 Reads from the passed channel (it may be even a pipe). If the
4485 <literal>~id</literal> argument is passed to the object, the created resolver
4486 accepts only this ID. Otherwise all IDs are accepted.  - Once the resolver has
4487 been cloned, it does not accept any ID. This means that this resolver cannot
4488 handle inner references to external entities. Note that you can combine this
4489 resolver with another resolver that can handle inner references (such as
4490 resolve_as_file); see class 'combine' below.  - If you pass the
4491 <literal>~fixenc</literal> argument, the encoding of the channel is set to the
4492 passed value, regardless of any auto-recognition or any XML declaration. - If
4493 <literal>~auto_close = true</literal> (which is the default), the channel is
4494 closed after use. If <literal>~auto_close = false</literal>, the channel is
4495 left open.
4496  </para>
4497
4498           <para>
4499 <programlisting><![CDATA[
4500 class resolve_read_any_channel :
4501     ?auto_close:bool ->
4502     channel_of_id:(ext_id -> (in_channel * encoding option)) ->
4503         resolver
4504 ]]></programlisting>
4505
4506 This resolver calls the function <literal>~channel_of_id</literal> to open a
4507 new channel for the passed <literal>ext_id</literal>. This function must either
4508 return the channel and the encoding, or it must fail with Not_competent.  The
4509 function must return <literal>None</literal> as encoding if the default
4510 mechanism to recognize the encoding should be used. It must return
4511 <literal>Some e</literal> if it is already known that the encoding of the
4512 channel is <literal>e</literal>.  If <literal>~auto_close = true</literal>
4513 (which is the default), the channel is closed after use. If
4514 <literal>~auto_close = false</literal>, the channel is left open.
4515 </para>
4516
4517           <para>
4518 <programlisting><![CDATA[
4519 class resolve_read_url_channel :
4520     ?base_url:Neturl.url ->
4521     ?auto_close:bool ->
4522     url_of_id:(ext_id -> Neturl.url) ->
4523     channel_of_url:(Neturl.url -> (in_channel * encoding option)) ->
4524         resolver
4525 ]]></programlisting>
4526
4527 When this resolver gets an ID to read from, it calls the function
4528 <literal>~url_of_id</literal> to get the corresponding URL. This URL may be a
4529 relative URL; however, a URL scheme must be used which contains a path.  The
4530 resolver converts the URL to an absolute URL if necessary.  The second
4531 function, <literal>~channel_of_url</literal>, is fed with the absolute URL as
4532 input. This function opens the resource to read from, and returns the channel
4533 and the encoding of the resource.
4534 </para>
4535 <para>
4536 Both functions, <literal>~url_of_id</literal> and
4537 <literal>~channel_of_url</literal>, can raise Not_competent to indicate that
4538 the object is not able to read from the specified resource. However, there is a
4539 difference: A Not_competent from <literal>~url_of_id</literal> is left as it
4540 is, but a Not_competent from <literal>~channel_of_url</literal> is converted to
4541 Not_resolvable. So only <literal>~url_of_id</literal> decides which URLs are
4542 accepted by the resolver and which not.
4543 </para>
4544 <para>
4545 The function <literal>~channel_of_url</literal> must return
4546 <literal>None</literal> as encoding if the default mechanism to recognize the
4547 encoding should be used. It must return <literal>Some e</literal> if it is
4548 already known that the encoding of the channel is <literal>e</literal>.
4549 </para>
4550 <para>
4551 If <literal>~auto_close = true</literal> (which is the default), the channel is
4552 closed after use. If <literal>~auto_close = false</literal>, the channel is
4553 left open.
4554 </para>
4555 <para>
4556 Objects of this class contain a base URL relative to which relative URLs are
4557 interpreted. When creating a new object, you can specify the base URL by
4558 passing it as <literal>~base_url</literal> argument. When an existing object is
4559 cloned, the base URL of the clone is the URL of the original object. - Note
4560 that the term "base URL" has a strict definition in RFC 1808.
4561 </para>
4562
4563           <para>
4564 <programlisting><![CDATA[
4565 class resolve_read_this_string :
4566     ?id:ext_id ->
4567     ?fixenc:encoding ->
4568     string ->
4569         resolver
4570 ]]></programlisting>
4571
4572 Reads from the passed string. If the <literal>~id</literal> argument is passed
4573 to the object, the created resolver accepts only this ID. Otherwise all IDs are
4574 accepted. - Once the resolver has been cloned, it does not accept any ID. This
4575 means that this resolver cannot handle inner references to external
4576 entities. Note that you can combine this resolver with another resolver that
4577 can handle inner references (such as resolve_as_file); see class 'combine'
4578 below. - If you pass the <literal>~fixenc</literal> argument, the encoding of
4579 the string is set to the passed value, regardless of any auto-recognition or
4580 any XML declaration.
4581 </para>
4582
4583           <para>
4584 <programlisting><![CDATA[
4585 class resolve_read_any_string :
4586     string_of_id:(ext_id -> (string * encoding option)) ->
4587         resolver
4588 ]]></programlisting>
4589
4590 This resolver calls the function <literal>~string_of_id</literal> to get the
4591 string for the passed <literal>ext_id</literal>. This function must either
4592 return the string and the encoding, or it must fail with Not_competent.  The
4593 function must return <literal>None</literal> as encoding if the default
4594 mechanism to recognize the encoding should be used. It must return
4595 <literal>Some e</literal> if it is already known that the encoding of the
4596 string is <literal>e</literal>.
4597 </para>
4598
4599           <para>
4600 <programlisting><![CDATA[
4601 class resolve_as_file :
4602     ?file_prefix:[ `Not_recognized | `Allowed | `Required ] ->
4603     ?host_prefix:[ `Not_recognized | `Allowed | `Required ] ->
4604     ?system_encoding:encoding ->
4605     ?url_of_id:(ext_id -> Neturl.url) ->
4606     ?channel_of_url: (Neturl.url -> (in_channel * encoding option)) ->
4607     unit ->
4608         resolver
4609 ]]></programlisting>
4610 Reads from the local file system. Every file name is interpreted as
4611 file name of the local file system, and the referred file is read.
4612 </para>
4613 <para>
4614 The full form of a file URL is: file://host/path, where
4615 'host' specifies the host system where the file identified 'path'
4616 resides. host = "" or host = "localhost" are accepted; other values
4617 will raise Not_competent. The standard for file URLs is
4618 defined in RFC 1738.
4619 </para>
4620 <para>
4621 Option <literal>~file_prefix</literal>: Specifies how the "file:" prefix of
4622 file names is handled:
4623             <itemizedlist mark="bullet" spacing="compact">
4624               <listitem>
4625                 <para><literal>`Not_recognized:</literal>The prefix is not
4626 recognized.</para>
4627               </listitem>
4628               <listitem>
4629                 <para><literal>`Allowed:</literal> The prefix is allowed but
4630 not required (the default).</para>
4631               </listitem>
4632               <listitem>
4633                 <para><literal>`Required:</literal> The prefix is
4634 required.</para>
4635               </listitem>
4636             </itemizedlist>
4637 </para>
4638 <para>
4639 Option <literal>~host_prefix:</literal> Specifies how the "//host" phrase of
4640 file names is handled:
4641             <itemizedlist mark="bullet" spacing="compact">
4642               <listitem>
4643                 <para><literal>`Not_recognized:</literal>The prefix is not
4644 recognized.</para>
4645               </listitem>
4646               <listitem>
4647                 <para><literal>`Allowed:</literal> The prefix is allowed but
4648 not required (the default).</para>
4649               </listitem>
4650               <listitem>
4651                 <para><literal>`Required:</literal> The prefix is
4652 required.</para>
4653               </listitem>
4654             </itemizedlist>
4655 </para>
4656 <para>
4657 Option <literal>~system_encoding:</literal> Specifies the encoding of file
4658 names of the local file system. Default: UTF-8.
4659 </para>
4660 <para>
4661 Options <literal>~url_of_id</literal>, <literal>~channel_of_url</literal>: Not
4662 for the casual user!
4663 </para>
4664
4665           <para>
4666 <programlisting><![CDATA[
4667 class combine :
4668     ?prefer:resolver ->
4669     resolver list ->
4670         resolver
4671 ]]></programlisting>
4672
4673 Combines several resolver objects. If a concrete entity with an
4674 <literal>ext_id</literal> is to be opened, the combined resolver tries the
4675 contained resolvers in turn until a resolver accepts opening the entity
4676 (i.e. it does not raise Not_competent on open_in).
4677 </para>
4678 <para>
4679 Clones: If the 'clone' method is invoked before 'open_in', all contained
4680 resolvers are cloned separately and again combined. If the 'clone' method is
4681 invoked after 'open_in' (i.e. while the resolver is open), additionally the
4682 clone of the active resolver is flagged as being preferred, i.e. it is tried
4683 first.
4684 </para>
4685
4686         </sect2>
4687       </sect1>
4688
4689       <sect1>
4690         <title>The DTD classes</title> <para><emphasis>Sorry, not yet
4691 written. Perhaps the interface definition of Pxp_dtd expresses the same:
4692 </emphasis></para>
4693         <para>
4694 <programlisting>&markup-dtd1.mli;&markup-dtd2.mli;</programlisting>
4695 </para>
4696       </sect1>
4697
4698       <sect1>
4699         <title>Invoking the parser</title>
4700
4701         <para>Here a description of Pxp_yacc.</para>
4702
4703         <sect2>
4704           <title>Defaults</title>
4705           <para>The following defaults are available:
4706
4707 <programlisting>
4708 val default_config : config
4709 val default_extension : ('a node extension) as 'a
4710 val default_spec : ('a node extension as 'a) spec
4711 </programlisting>
4712 </para>
4713         </sect2>
4714
4715         <sect2>
4716           <title>Parsing functions</title>
4717           <para>In the following, the term "closed document" refers to
4718 an XML structure like
4719
4720 <programlisting>
4721 &lt;!DOCTYPE ... [ <replaceable>declarations</replaceable> ] &gt;
4722 &lt;<replaceable>root</replaceable>&gt;
4723 ...
4724 &lt;/<replaceable>root</replaceable>&gt;
4725 </programlisting>
4726
4727 The term "fragment" refers to an XML structure like
4728
4729 <programlisting>
4730 &lt;<replaceable>root</replaceable>&gt;
4731 ...
4732 &lt;/<replaceable>root</replaceable>&gt;
4733 </programlisting>
4734
4735 i.e. only to one isolated element instance.
4736 </para>
4737
4738           <para>
4739 <programlisting><![CDATA[
4740 val parse_dtd_entity : config -> source -> dtd
4741 ]]></programlisting>
4742
4743 Parses the declarations which are contained in the entity, and returns them as
4744 <literal>dtd</literal> object.
4745 </para>
4746
4747           <para>
4748 <programlisting><![CDATA[
4749 val extract_dtd_from_document_entity : config -> source -> dtd
4750 ]]></programlisting>
4751
4752 Extracts the DTD from a closed document. Both the internal and the external
4753 subsets are extracted and combined to one <literal>dtd</literal> object. This
4754 function does not parse the whole document, but only the parts that are
4755 necessary to extract the DTD.
4756 </para>
4757
4758           <para>
4759 <programlisting><![CDATA[
4760 val parse_document_entity :
4761     ?transform_dtd:(dtd -> dtd) ->
4762     ?id_index:('ext index) ->
4763     config ->
4764     source ->
4765     'ext spec ->
4766         'ext document
4767 ]]></programlisting>
4768
4769 Parses a closed document and validates it against the DTD that is contained in
4770 the document (internal and external subsets). The option
4771 <literal>~transform_dtd</literal> can be used to transform the DTD in the
4772 document, and to use the transformed DTD for validation. If
4773 <literal>~id_index</literal> is specified, an index of all ID attributes is
4774 created.
4775 </para>
4776
4777           <para>
4778 <programlisting><![CDATA[
4779 val parse_wfdocument_entity :
4780     config ->
4781     source ->
4782     'ext spec ->
4783         'ext document
4784 ]]></programlisting>
4785
4786 Parses a closed document, but checks it only on well-formedness.
4787 </para>
4788
4789           <para>
4790 <programlisting><![CDATA[
4791 val parse_content_entity  :
4792     ?id_index:('ext index) ->
4793     config ->
4794     source ->
4795     dtd ->
4796     'ext spec ->
4797         'ext node
4798 ]]></programlisting>
4799
4800 Parses a fragment, and validates the element.
4801 </para>
4802
4803           <para>
4804 <programlisting><![CDATA[
4805 val parse_wfcontent_entity :
4806     config ->
4807     source ->
4808     'ext spec ->
4809         'ext node
4810 ]]></programlisting>
4811
4812 Parses a fragment, but checks it only on well-formedness.
4813 </para>
4814         </sect2>
4815
4816         <sect2>
4817           <title>Configuration options</title>
4818           <para>
4819
4820 <programlisting><![CDATA[
4821 type config =
4822     { warner : collect_warnings;
4823       errors_with_line_numbers : bool;
4824       enable_pinstr_nodes : bool;
4825       enable_super_root_node : bool;
4826       enable_comment_nodes : bool;
4827       encoding : rep_encoding;
4828       recognize_standalone_declaration : bool;
4829       store_element_positions : bool;
4830       idref_pass : bool;
4831       validate_by_dfa : bool;
4832       accept_only_deterministic_models : bool;
4833       ...
4834     }
4835 ]]></programlisting>
4836
4837 <itemizedlist mark="bullet" spacing="compact">
4838               <listitem><para><literal>warner:</literal>The parser prints
4839 warnings by invoking the method <literal>warn</literal> for this warner
4840 object. (Default: all warnings are dropped)</para>
4841               </listitem>
4842               <listitem><para><literal>errors_with_line_numbers:</literal>If
4843 true, errors contain line numbers; if false, errors contain only byte
4844 positions. The latter mode is faster. (Default: true)</para>
4845               </listitem>
4846               <listitem><para><literal>enable_pinstr_nodes:</literal>If true,
4847 the parser creates extra nodes for processing instructions. If false,
4848 processing instructions are simply added to the element or document surrounding
4849 the instructions. (Default: false)</para>
4850               </listitem>
4851               <listitem><para><literal>enable_super_root_node:</literal>If
4852 true, the parser creates an extra node which is the parent of the root of the
4853 document tree. This node is called super root; it is an element with type
4854 <literal>T_super_root</literal>. - If there are processing instructions outside
4855 the root element and outside the DTD, they are added to the super root instead
4856 of the document. - If false, the super root node is not created. (Default:
4857 false)</para>
4858               </listitem>
4859               <listitem><para><literal>enable_comment_nodes:</literal>If true,
4860 the parser creates nodes for comments with type <literal>T_comment</literal>;
4861 if false, such nodes are not created. (Default: false)</para>
4862               </listitem>
4863               <listitem><para><literal>encoding:</literal>Specifies the
4864 internal encoding of the parser. Most strings are then represented according to
4865 this encoding; however there are some exceptions (especially
4866 <literal>ext_id</literal> values which are always UTF-8 encoded).
4867 (Default: `Enc_iso88591)</para>
4868               </listitem>
4869               <listitem><para><literal>
4870 recognize_standalone_declaration:</literal> If true and if the parser is
4871 validating, the <literal>standalone="yes"</literal> declaration forces that it
4872 is checked whether the document is a standalone document. - If false, or if the
4873 parser is in well-formedness mode, such declarations are ignored.
4874 (Default: true)
4875 </para>
4876               </listitem>
4877               <listitem><para><literal>store_element_positions:</literal> If
4878 true, for every non-data node the source position is stored. If false, the
4879 position information is lost. If available, you can get the positions of nodes
4880 by invoking the <literal>position</literal> method.
4881 (Default: true)</para>
4882               </listitem>
4883               <listitem><para><literal>idref_pass:</literal>If true and if
4884 there is an ID index, the parser checks whether every IDREF or IDREFS attribute
4885 refer to an existing node; this requires that the parser traverses the whole
4886 doument tree. If false, this check is left out. (Default: false)</para>
4887               </listitem>
4888               <listitem><para><literal>validate_by_dfa:</literal>If true and if
4889 the content model for an element type is deterministic, a deterministic finite
4890 automaton is used to validate whether the element contents match the content
4891 model of the type. If false, or if a DFA is not available, a backtracking
4892 algorithm is used for validation. (Default: true)
4893 </para>
4894               </listitem>
4895               <listitem><para><literal>
4896 accept_only_deterministic_models:</literal> If true, only deterministic content
4897 models are accepted; if false, any syntactically correct content models can be
4898 processed. (Default: true)</para>
4899               </listitem>
4900             </itemizedlist></para>
4901         </sect2>
4902
4903         <sect2>
4904           <title>Which configuration should I use?</title>
4905           <para>First, I recommend to vary the default configuration instead of
4906 creating a new configuration record. For instance, to set
4907 <literal>idref_pass</literal> to <literal>true</literal>, change the default
4908 as in:
4909 <programlisting>
4910 let config = { default_config with idref_pass = true }
4911 </programlisting>
4912 The background is that I can add more options to the record in future versions
4913 of the parser without breaking your programs.</para>
4914
4915           <formalpara>
4916             <title>Do I need extra nodes for processing instructions?</title>
4917 <para>By default, such nodes are not created. This does not mean that the
4918 processing instructions are lost; however, you cannot find out the exact
4919 location where they occur. For example, the following XML text
4920
4921 <programlisting><![CDATA[
4922 <x><?pi1?><y/><?pi2?></x>
4923 ]]></programlisting>
4924
4925 will normally create one element node for <literal>x</literal> containing
4926 <emphasis>one</emphasis> subnode for <literal>y</literal>. The processing
4927 instructions are attached to <literal>x</literal> in a separate hash table; you
4928 can access them using <literal>x # pinstr "pi1"</literal> and <literal>x #
4929 pinstr "pi2"</literal>, respectively. The information is lost where the
4930 instructions occur within <literal>x</literal>.
4931 </para>
4932           </formalpara>
4933
4934             <para>If the option <literal>enable_pinstr_nodes</literal> is
4935 turned on, the parser creates extra nodes <literal>pi1</literal> and
4936 <literal>pi2</literal> such that the subnodes of <literal>x</literal> are now:
4937
4938 <programlisting><![CDATA[
4939 x # sub_nodes = [ pi1; y; pi2 ]
4940 ]]></programlisting>
4941
4942 The extra nodes contain the processing instructions in the usual way, i.e. you
4943 can access them using <literal>pi1 # pinstr "pi1"</literal> and <literal>pi2 #
4944 pinstr "pi2"</literal>, respectively.
4945 </para>
4946
4947           <para>Note that you will need an exemplar for the PI nodes (see
4948 <literal>make_spec_from_alist</literal>).</para>
4949
4950           <formalpara>
4951             <title>Do I need a super root node?</title>
4952             <para>By default, there is no super root node. The
4953 <literal>document</literal> object refers directly to the node representing the
4954 root element of the document, i.e.
4955
4956 <programlisting><![CDATA[
4957 doc # root = r
4958 ]]></programlisting>
4959
4960 if <literal>r</literal> is the root node. This is sometimes inconvenient: (1)
4961 Some algorithms become simpler if every node has a parent, even the root
4962 node. (2) Some standards such as XPath call the "root node" the node whose
4963 child represents the root of the document. (3) The super root node can serve
4964 as a container for processing instructions outside the root element. Because of
4965 these reasons, it is possible to create an extra super root node, whose child
4966 is the root node:
4967
4968 <programlisting><![CDATA[
4969 doc # root = sr         &&
4970 sr # sub_nodes = [ r ]
4971 ]]></programlisting>
4972
4973 When extra nodes are also created for processing instructions, these nodes can
4974 be added to the super root node if they occur outside the root element (reason
4975 (3)), and the order reflects the order in the source text.</para>
4976           </formalpara>
4977
4978           <para>Note that you will need an exemplar for the super root node
4979 (see <literal>make_spec_from_alist</literal>).</para>
4980
4981           <formalpara>
4982             <title>What is the effect of the UTF-8 encoding?</title>
4983             <para>By default, the parser represents strings (with few
4984 exceptions) as ISO-8859-1 strings. These are well-known, and there are tools
4985 and fonts for this encoding.</para>
4986           </formalpara>
4987           <para>However, internationalization may require that you switch over
4988 to UTF-8 encoding. In most environments, the immediate effect will be that you
4989 cannot read strings with character codes >= 160 any longer; your terminal will
4990 only show funny glyph combinations. It is strongly recommended to install
4991 Unicode fonts (<ulink URL="http://czyborra.com/unifont/">GNU Unifont</ulink>,
4992 <ulink URL="http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz">
4993 Markus Kuhn's fonts</ulink>) and <ulink
4994 URL="http://myweb.clark.net/pub/dickey/xterm/xterm.html">terminal emulators
4995 that can handle UTF-8 byte sequences</ulink>. Furthermore, a Unicode editor may
4996 be helpful (such as <ulink
4997 URL="ftp://metalab.unc.edu/pub/Linux/apps/editors/X/">Yudit</ulink>). There are
4998 also <ulink URL="http://www.cl.cam.ac.uk/~mgk25/unicode.html">FAQ</ulink> by
4999 Markus Kuhn.
5000 </para>
5001           <para>By setting <literal>encoding</literal> to
5002 <literal>`Enc_utf8</literal> all strings originating from the parsed XML
5003 document are represented as UTF-8 strings. This includes not only character
5004 data and attribute values but also element names, attribute names and so on, as
5005 it is possible to use any Unicode letter to form such names.  Strictly
5006 speaking, PXP is only XML-compliant if the UTF-8 mode is used; otherwise it
5007 will have difficulties when validating documents containing
5008 non-ISO-8859-1-names.
5009 </para>
5010
5011           <para>This mode does not have any impact on the external
5012 representation of documents. The character set assumed when reading a document
5013 is set in the XML declaration, and character set when writing a document must
5014 be passed to the <literal>write</literal> method.
5015 </para>
5016
5017           <formalpara>
5018             <title>How do I check that nodes exist which are referred by IDREF attributes?</title>
5019             <para>First, you must create an index of all occurring ID
5020 attributes:
5021
5022 <programlisting><![CDATA[
5023 let index = new hash_index
5024 ]]></programlisting>
5025
5026 This index must be passed to the parsing function:
5027
5028 <programlisting><![CDATA[
5029 parse_document_entity
5030   ~id_index:(index :> index)
5031   config source spec
5032 ]]></programlisting>
5033
5034 Next, you must turn on the <literal>idref_pass</literal> mode:
5035
5036 <programlisting><![CDATA[
5037 let config = { default_config with idref_pass = true }
5038 ]]></programlisting>
5039
5040 Note that now the whole document tree will be traversed, and every node will be
5041 checked for IDREF and IDREFS attributes. If the tree is big, this may take some
5042 time.
5043 </para>
5044           </formalpara>
5045
5046           <formalpara>
5047             <title>What are deterministic content models?</title>
5048             <para>These type of models can speed up the validation checks;
5049 furthermore they ensure SGML-compatibility. In particular, a content model is
5050 deterministic if the parser can determine the actually used alternative by
5051 inspecting only the current token. For example, this element has
5052 non-deterministic contents:
5053
5054 <programlisting><![CDATA[
5055 <!ELEMENT x ((u,v) | (u,y+) | v)>
5056 ]]></programlisting>
5057
5058 If the first element in <literal>x</literal> is <literal>u</literal>, the
5059 parser does not know which of the alternatives <literal>(u,v)</literal> or
5060 <literal>(u,y+)</literal> will work; the parser must also inspect the second
5061 element to be able to distinguish between the alternatives. Because such
5062 look-ahead (or "guessing") is required, this example is
5063 non-deterministic.</para>
5064           </formalpara>
5065
5066           <para>The XML standard demands that content models must be
5067 deterministic. So it is recommended to turn the option
5068 <literal>accept_only_deterministic_models</literal> on; however, PXP can also
5069 process non-deterministic models using a backtracking algorithm.</para>
5070
5071           <para>Deterministic models ensure that validation can be performed in
5072 linear time. In order to get the maximum benefits, PXP also implements a
5073 special validator that profits from deterministic models; this is the
5074 deterministic finite automaton (DFA). This validator is enabled per element
5075 type if the element type has a deterministic model and if the option
5076 <literal>validate_by_dfa</literal> is turned on.</para>
5077
5078           <para>In general, I expect that the DFA method is faster than the
5079 backtracking method; especially in the worst case the DFA takes only linear
5080 time. However, if the content model has only few alternatives and the
5081 alternatives do not nest, the backtracking algorithm may be better.</para>
5082
5083         </sect2>
5084
5085
5086       </sect1>
5087
5088
5089       <sect1>
5090         <title>Updates</title>
5091
5092         <para><emphasis>Some (often later added) features that are otherwise
5093 not explained in the manual but worth to be mentioned.</emphasis></para>
5094
5095         <itemizedlist mark="bullet" spacing="compact">
5096           <listitem><para>Methods node_position, node_path, nth_node,
5097 previous_node, next_node for nodes: See pxp_document.mli</para>
5098           </listitem>
5099           <listitem><para>Functions to determine the document order of nodes:
5100 compare, create_ord_index, ord_number, ord_compare: See pxp_document.mli</para>
5101           </listitem>
5102         </itemizedlist>
5103       </sect1>
5104
5105     </chapter>
5106
5107   </part>
5108 </book>
5109