X-Git-Url: http://matita.cs.unibo.it/gitweb/?a=blobdiff_plain;f=helm%2FDEVEL%2Fpxp%2Fpxp%2Fdoc%2Fmanual%2Fsrc%2Fmarkup.sgml;fp=helm%2FDEVEL%2Fpxp%2Fpxp%2Fdoc%2Fmanual%2Fsrc%2Fmarkup.sgml;h=0000000000000000000000000000000000000000;hb=c7514aaa249a96c5fdd39b1123fbdb38d92f20b6;hp=1cb2064cbe929408fd111826e8d866decc441be0;hpb=1c7fb836e2af4f2f3d18afd0396701f2094265ff;p=helm.git diff --git a/helm/DEVEL/pxp/pxp/doc/manual/src/markup.sgml b/helm/DEVEL/pxp/pxp/doc/manual/src/markup.sgml deleted file mode 100644 index 1cb2064cb..000000000 --- a/helm/DEVEL/pxp/pxp/doc/manual/src/markup.sgml +++ /dev/null @@ -1,5109 +0,0 @@ -PXP"> -PXP"> - - - - - -%readme.code.to-html; -%get.markup-yacc.mli; -%get.markup-dtd.mli; - - - -]> - - - - - The PXP user's guide - - - - - Gerd - Stolpmann - - -
- gerd@gerd-stolpmann.de -
-
-
-
-
- - - 1999, 2000Gerd Stolpmann - - - - - -&markup; is a validating parser for XML-1.0 which has been -written entirely in Objective Caml. - - - Download &markup;: - -The free &markup; library can be downloaded at - -http://www.ocaml-programming.de/packages/ -. This user's guide is included. -Newest releases of &markup; will be announced in -The OCaml Link -Database. - - - - - - License - -This document, and the described software, "&markup;", are copyright by -Gerd Stolpmann. - - - -Permission is hereby granted, free of charge, to any person obtaining -a copy of this document and the "&markup;" software (the -"Software"), to deal in the Software without restriction, including -without limitation the rights to use, copy, modify, merge, publish, -distribute, sublicense, and/or sell copies of the Software, and to -permit persons to whom the Software is furnished to do so, subject to -the following conditions: - - -The above copyright notice and this permission notice shall be included -in all copies or substantial portions of the Software. - - -The Software is provided ``as is'', without warranty of any kind, express -or implied, including but not limited to the warranties of -merchantability, fitness for a particular purpose and noninfringement. -In no event shall Gerd Stolpmann be liable for any claim, damages or -other liability, whether in an action of contract, tort or otherwise, -arising from, out of or in connection with the Software or the use or -other dealings in the software. - - - -
- - - - - - User's guide - - - What is XML? - - - Introduction - - XML (short for Extensible Markup Language) -generalizes the idea that text documents are typically structured in sections, -sub-sections, paragraphs, and so on. The format of the document is not fixed -(as, for example, in HTML), but can be declared by a so-called DTD (document -type definition). The DTD describes only the rules how the document can be -structured, but not how the document can be processed. For example, if you want -to publish a book that uses XML markup, you will need a processor that converts -the XML file into a printable format such as Postscript. On the one hand, the -structure of XML documents is configurable; on the other hand, there is no -longer a canonical interpretation of the elements of the document; for example -one XML DTD might want that paragraphes are delimited by -para tags, and another DTD expects p tags -for the same purpose. As a result, for every DTD a new processor is required. - - - -Although XML can be used to express structured text documents it is not limited -to this kind of application. For example, XML can also be used to exchange -structured data over a network, or to simply store structured data in -files. Note that XML documents cannot contain arbitrary binary data because -some characters are forbidden; for some applications you need to encode binary -data as text (e.g. the base 64 encoding). - - - - - The "hello world" example - -The following example shows a very simple DTD, and a corresponding document -instance. The document is structured such that it consists of sections, and -that sections consist of paragraphs, and that paragraphs contain plain text: - - - - - - -]]> - - - The following document is an instance of this DTD: - - - - - -
- This is a paragraph of the first section. - This is another paragraph of the first section. -
-
- This is the only paragraph of the second section. -
-
-]]> -
- - As in HTML (and, of course, in grand-father SGML), the "pieces" of -the document are delimited by element braces, i.e. such a piece begins with -<name-of-the-type-of-the-piece> and ends with -</name-of-the-type-of-the-piece>, and the pieces are -called elements. Unlike HTML and SGML, both start tags and -end tags (i.e. the delimiters written in angle brackets) can never be left -out. For example, HTML calls the paragraphs simply p, and -because paragraphs never contain paragraphs, a sequence of several paragraphs -can be written as: - -First paragraph -

Second paragraph]]> - -This is not possible in XML; continuing our example above we must always write - -First paragraph -Second paragraph]]> - -The rationale behind that is to (1) simplify the development of XML parsers -(you need not convert the DTD into a deterministic finite automaton which is -required to detect omitted tags), and to (2) make it possible to parse the -document independent of whether the DTD is known or not. - - - -The first line of our sample document, - - -]]> - - -is the so-called XML declaration. It expresses that the -document follows the conventions of XML version 1.0, and that the document is -encoded using characters from the ISO-8859-1 character set (often known as -"Latin 1", mostly used in Western Europe). Although the XML declaration is not -mandatory, it is good style to include it; everybody sees at the first glance -that the document uses XML markup and not the similar-looking HTML and SGML -markup languages. If you omit the XML declaration, the parser will assume -that the document is encoded as UTF-8 or UTF-16 (there is a rule that makes -it possible to distinguish between UTF-8 and UTF-16 automatically); these -are encodings of Unicode's universal character set. (Note that &pxp;, unlike its -predecessor "Markup", fully supports Unicode.) - - - -The second line, - - -]]> - - -names the DTD that is going to be used for the rest of the document. In -general, it is possible that the DTD consists of two parts, the so-called -external and the internal subset. "External" means that the DTD exists as a -second file; "internal" means that the DTD is included in the same file. In -this example, there is only an external subset, and the system identifier -"simple.dtd" specifies where the DTD file can be found. System identifiers are -interpreted as URLs; for instance this would be legal: - - -]]> - - -Please note that &pxp; cannot interpret HTTP identifiers by default, but it is -possible to change the interpretation of system identifiers. - - - -The word immediately following DOCTYPE determines which of -the declared element types (here "document", "section", and "paragraph") is -used for the outermost element, the root element. In this -example it is document because the outermost element is -delimited by <document> and -</document>. - - - -The DTD consists of three declarations for element types: -document, section, and -paragraph. Such a declaration has two parts: - - -<!ELEMENT name content-model> - - -The content model is a regular expression which describes the possible inner -structure of the element. Here, document contains one or -more sections, and a section contains one or more -paragraphs. Note that these two element types are not allowed to contain -arbitrary text. Only the paragraph element type is declared -such that parsed character data (indicated by the symbol -#PCDATA) is permitted. - - - -See below for a detailed discussion of content models. - - - - - XML parsers and processors - -XML documents are human-readable, but this is not the main purpose of this -language. XML has been designed such that documents can be read by a program -called an XML parser. The parser checks that the document -is well-formatted, and it represents the document as objects of the programming -language. There are two aspects when checking the document: First, the document -must follow some basic syntactic rules, such as that tags are written in angle -brackets, that for every start tag there must be a corresponding end tag and so -on. A document respecting these rules is -well-formed. Second, the document must match the DTD in -which case the document is valid. Many parsers check only -on well-formedness and ignore the DTD; &pxp; is designed such that it can -even validate the document. - - - -A parser does not make a sensible application, it only reads XML -documents. The whole application working with XML-formatted data is called an -XML processor. Often XML processors convert documents into -another format, such as HTML or Postscript. Sometimes processors extract data -of the documents and output the processed data again XML-formatted. The parser -can help the application processing the document; for example it can provide -means to access the document in a specific manner. &pxp; supports an -object-oriented access layer specially. - - - - - Discussion - -As we have seen, there are two levels of description: On the one hand, XML can -define rules about the format of a document (the DTD), on the other hand, XML -expresses structured documents. There are a number of possible applications: - - - - - -XML can be used to express structured texts. Unlike HTML, there is no canonical -interpretation; one would have to write a backend for the DTD that translates -the structured texts into a format that existing browsers, printers -etc. understand. The advantage of a self-defined document format is that it is -possible to design the format in a more problem-oriented way. For example, if -the task is to extract reports from a database, one can use a DTD that reflects -the structure of the report or the database. A possible approach would be to -have an element type for every database table and for every column. Once the -DTD has been designed, the report procedure can be splitted up in a part that -selects the database rows and outputs them as an XML document according to the -DTD, and in a part that translates the document into other formats. Of course, -the latter part can be solved in a generic way, e.g. there may be configurable -backends for all DTDs that follow the approach and have element types for -tables and columns. - - - -XML plays the role of a configurable intermediate format. The database -extraction function can be written without having to know the details of -typesetting; the backends can be written without having to know the details of -the database. - - - -Of course, there are traditional solutions. One can define an ad hoc -intermediate text file format. This disadvantage is that there are no names for -the pieces of the format, and that such formats usually lack of documentation -because of this. Another solution would be to have a binary representation, -either as language-dependent or language-independent structure (example of the -latter can be found in RPC implementations). The disadvantage is that it is -harder to view such representations, one has to write pretty printers for this -purpose. It is also more difficult to enter test data; XML is plain text that -can be written using an arbitrary editor (Emacs has even a good XML mode, -PSGML). All these alternatives suffer from a missing structure checker, -i.e. the programs processing these formats usually do not check the input file -or input object in detail; XML parsers check the syntax of the input (the -so-called well-formedness check), and the advanced parsers like &markup; even -verify that the structure matches the DTD (the so-called validation). - - - - - - -XML can be used as configurable communication language. A fundamental problem -of every communication is that sender and receiver must follow the same -conventions about the language. For data exchange, the question is usually -which data records and fields are available, how they are syntactically -composed, and which values are possible for the various fields. Similar -questions arise for text document exchange. XML does not answer these problems -completely, but it reduces the number of ambiguities for such conventions: The -outlines of the syntax are specified by the DTD (but not necessarily the -details), and XML introduces canonical names for the components of documents -such that it is simpler to describe the rest of the syntax and the semantics -informally. - - - - - -XML is a data storage format. Currently, every software product tends to use -its own way to store data; commercial software often does not describe such -formats, and it is a pain to integrate such software into a bigger project. -XML can help to improve this situation when several applications share the same -syntax of data files. DTDs are then neutral instances that check the format of -data files independent of applications. - - - - - - - - - - - - - Highlights of XML - - -This section explains many of the features of XML, but not all, and some -features not in detail. For a complete description, see the XML -specification. - - - - The DTD and the instance - -The DTD contains various declarations; in general you can only use a feature if -you have previously declared it. The document instance file may contain the -full DTD, but it is also possible to split the DTD into an internal and an -external subset. A document must begin as follows if the full DTD is included: - - -<?xml version="1.0" encoding="Your encoding"?> -<!DOCTYPE root [ - Declarations -]> - - -These declarations are called the internal subset. Note -that the usage of entities and conditional sections is restricted within the -internal subset. - - -If the declarations are located in a different file, you can refer to this file -as follows: - - -<?xml version="1.0" encoding="Your encoding"?> -<!DOCTYPE root SYSTEM "file name"> - - -The declarations in the file are called the external -subset. The file name is called the system -identifier. -It is also possible to refer to the file by a so-called -public identifier, but most XML applications won't use -this feature. - - -You can also specify both internal and external subsets. In this case, the -declarations of both subsets are mixed, and if there are conflicts, the -declaration of the internal subset overrides those of the external subset with -the same name. This looks as follows: - - -<?xml version="1.0" encoding="Your encoding"?> -<!DOCTYPE root SYSTEM "file name" [ - Declarations -]> - - - - -The XML declaration (the string beginning with <?xml and -ending at ?>) should specify the encoding of the -file. Common values are UTF-8, and the ISO-8859 series of character sets. Note -that every file parsed by the XML processor can begin with an XML declaration -and that every file may have its own encoding. - - - -The name of the root element must be mentioned directly after the -DOCTYPE string. This means that a full document instance -looks like - - -<?xml version="1.0" encoding="Your encoding"?> -<!DOCTYPE root SYSTEM "file name" [ - Declarations -]> - -<root> - inner contents -</root> - - - - - - - - Reserved characters - -Some characters are generally reserved to indicate markup such that they cannot -be used for character data. These characters are <, >, and -&. Furthermore, single and double quotes are sometimes reserved. If you -want to include such a character as character, write it as follows: - - - - -&lt; instead of < - - - - -&gt; instead of > - - - - -&amp; instead of & - - - - -&apos; instead of ' - - - - -&quot; instead of " - - - - -All other characters are free in the document instance. It is possible to -include a character by its position in the Unicode alphabet: - - -&#n; - - -where n is the decimal number of the -character. Alternatively, you can specify the character by its hexadecimal -number: - - -&#xn; - - -In the scope of declarations, the character % is no longer free. To include it -as character, you must use the notations &#37; or -&#x25;. - - - Note that besides &lt;, &gt;, &amp;, -&apos;, and &quot; there are no predefines character entities. This is -different from HTML which defines a list of characters that can be referenced -by name (e.g. &auml; for รค); however, if you prefer named characters, you -can declare such entities yourself (see below). - - - - - - - Elements and ELEMENT declarations - - -Elements structure the document instance in a hierarchical way. There is a -top-level element, the root element, which contains a -sequence of inner elements and character sections. The inner elements are -structured in the same way. Every element has an element -type. The beginning of the element is indicated by a start -tag, written - - -<element-type> - - -and the element continues until the corresponding end tag -is reached: - - -</element-type> - - -In XML, it is not allowed to omit start or end tags, even if the DTD would -permit this. Note that there are no special rules how to interpret spaces or -newlines near start or end tags; all spaces and newlines count. - - - -Every element type must be declared before it can be used. The declaration -consists of two parts: the ELEMENT declaration describes the content model, -i.e. which inner elements are allowed; the ATTLIST declaration describes the -attributes of the element. - - - -An element can simply allow everything as content. This is written: - - -<!ELEMENT name ANY> - - -On the opposite, an element can be forced to be empty; declared by: - - -<!ELEMENT name EMPTY> - - -Note that there is an abbreviated notation for empty element instances: -<name/>. - - - -There are two more sophisticated forms of declarations: so-called -mixed declarations, and regular -expressions. An element with mixed content contains character data -interspersed with inner elements, and the set of allowed inner elements can be -specified. In contrast to this, a regular expression declaration does not allow -character data, but the inner elements can be described by the more powerful -means of regular expressions. - - - -A declaration for mixed content looks as follows: - - -<!ELEMENT name (#PCDATA | element1 | ... | elementn )*> - - -or if you do not want to allow any inner element, simply - - -<!ELEMENT name (#PCDATA)> - - - - -

- Example - -If element type q is declared as - - -]]> - - -this is a legal instance: - - -This is character datawith inner elements]]> - - -But this is illegal because t has not been enumerated in the -declaration: - - -This is character datawith inner elements]]> - - -
- - -The other form uses a regular expression to describe the possible contents: - - -<!ELEMENT name regexp> - - -The following well-known regexp operators are allowed: - - - - -element-name - - - - - -(subexpr1 , ... , subexprn ) - - - - - -(subexpr1 | ... | subexprn ) - - - - - -subexpr* - - - - - -subexpr+ - - - - - -subexpr? - - - - -The , operator indicates a sequence of sub-models, the -| operator describes alternative sub-models. The -* indicates zero or more repetitions, and -+ one or more repetitions. Finally, ? can -be used for optional sub-models. As atoms the regexp can contain names of -elements; note that it is not allowed to include #PCDATA. - - - -The exact syntax of the regular expressions is rather strange. This can be -explained best by a list of constraints: - - - - -The outermost expression must not be -element-name. - - Illegal: -]]>; this must be written as -]]>. - - - -For the unary operators subexpr*, -subexpr+, and -subexpr?, the -subexpr must not be again an -unary operator. - - Illegal: -]]>; this must be written as -]]>. - - - -Between ) and one of the unary operatory -*, +, or ?, there must -not be whitespace. - Illegal: -]]>; this must be written as -]]>. - - There is the additional constraint that the -right parenthsis must be contained in the same entity as the left parenthesis; -see the section about parsed entities below. - - - - - - -Note that there is another restriction on regular expressions which must be -deterministic. This means that the parser must be able to see by looking at the -next token which alternative is actually used, or whether the repetition -stops. The reason for this is simply compatability with SGML (there is no -intrinsic reason for this rule; XML can live without this restriction). - - -
- Example - -The elements are declared as follows: - - - - - - -]]> - -This is a legal instance: - - -Some characters]]> - - -(Note: <s/> is an abbreviation for -<s></s>.) - -It would be illegal to leave ]]> out because at -least one instance of s or t must be -present. It would be illegal, too, if characters existed outside the -r element; the only exception is white space. -- This is -legal, too: - - -]]> - - -
- -
- - - - - Attribute lists and ATTLIST declarations - -Elements may have attributes. These are put into the start tag of an element as -follows: - - -<element-name attribute1="value1" ... attributen="valuen"> - - -Instead of -"valuek" -it is also possible to use single quotes as in -'valuek'. -Note that you cannot use double quotes literally within the value of the -attribute if double quotes are the delimiters; the same applies to single -quotes. You can generally not use < and & as characters in attribute -values. It is possible to include the paraphrases &lt;, &gt;, -&amp;, &apos;, and &quot; (and any other reference to a general -entity as long as the entity is not defined by an external file) as well as -&#n;. - - - -Before you can use an attribute you must declare it. An ATTLIST declaration -looks as follows: - - -<!ATTLIST element-name - attribute-name attribute-type attribute-default - ... - attribute-name attribute-type attribute-default -> - - -There are a lot of types, but most important are: - - - - -CDATA: Every string is allowed as attribute value. - - - - -NMTOKEN: Every nametoken is allowed as attribute -value. Nametokens consist (mainly) of letters, digits, ., :, -, _ in arbitrary -order. - - - - -NMTOKENS: A space-separated list of nametokens is allowed as -attribute value. - - - - -The most interesting default declarations are: - - - - -#REQUIRED: The attribute must be specified. - - - - -#IMPLIED: The attribute can be specified but also can be -left out. The application can find out whether the attribute was present or -not. - - - - -"value" or -'value': This particular value is -used as default if the attribute is omitted in the element. - - - - - -
- Example - -This is a valid attribute declaration for element type r: - - - -]]> - -This means that x is a required attribute that cannot be -left out, while y and z are optional. The -XML parser indicates the application whether y is present or -not, but if z is missing the default value -"one two three" is returned automatically. - - - -This is a valid example of these attributes: - - -]]> - - -
- -
- - - Parsed entities - -Elements describe the logical structure of the document, while -entities determine the physical structure. Entities are -the pieces of text the parser operates on, mostly files and macros. Entities -may be parsed in which case the parser reads the text and -interprets it as XML markup, or unparsed which simply -means that the data of the entity has a foreign format (e.g. a GIF icon). - - - If the parsed entity is going to be used as part of the DTD, it -is called a parameter entity. You can declare a parameter -entity with a fixed text as content by: - - -<!ENTITY % name "value"> - - -Within the DTD, you can refer to this entity, i.e. read -the text of the entity, by: - - -%name; - - -Such entities behave like macros, i.e. when they are referred to, the -macro text is inserted and read instead of the original text. - -
- Example - -For example, you can declare two elements with the same content model by: - - - - - -]]> - - - -
- -If the contents of the entity are given as string constant, the entity is -called an internal entity. It is also possible to name a -file to be used as content (an external entity): - - -<!ENTITY % name SYSTEM "file name"> - - -There are some restrictions for parameter entities: - - - - -If the internal parameter entity contains the first token of a declaration -(i.e. <!), it must also contain the last token of the -declaration, i.e. the >. This means that the entity -either contains a whole number of complete declarations, or some text from the -middle of one declaration. - -Illegal: - -"> - Because <! is contained in the main -entity, and the corresponding > is contained in the -entity e. - - - -If the internal parameter entity contains a left paranthesis, it must also -contain the corresponding right paranthesis. - -Illegal: - - - -]]> Because ( is contained in the entity -e, and the corresponding ) is -contained in the main entity. - - - -When reading text from an entity, the parser automatically inserts one space -character before the entity text and one space character after the entity -text. However, this rule is not applied within the definition of another -entity. -Legal: - - - -]]> Because %suffix; is referenced within -the definition text for iconfile, no additional spaces are -added. - -Illegal: - - - -]]> -Because %suffix; is referenced outside the definition -text of another entity, the parser replaces %suffix; by -spacetestspace. -Illegal: - - - -]]> Because there is a whitespace between ) -and *, which is illegal. - - - -An external parameter entity must always consist of a whole number of complete -declarations. - - - - -In the internal subset of the DTD, a reference to a parameter entity (internal -or external) is only allowed at positions where a new declaration can start. - - - -
- - -If the parsed entity is going to be used in the document instance, it is called -a general entity. Such entities can be used as -abbreviations for frequent phrases, or to include external files. Internal -general entities are declared as follows: - - -<!ENTITY name "value"> - - -External general entities are declared this way: - - -<!ENTITY name SYSTEM "file name"> - - -References to general entities are written as: - - -&name; - - -The main difference between parameter and general entities is that the former -are only recognized in the DTD and that the latter are only recognized in the -document instance. As the DTD is parsed before the document, the parameter -entities are expanded first; for example it is possible to use the content of a -parameter entity as the name of a general entity: -&#38;%name;;This construct is only -allowed within the definition of another entity; otherwise extra spaces would -be added (as explained above). Such indirection is not recommended. - -Complete example: - - - - - -]]> -You can now write &text; in the document instance, and -depending on the value of variant either -text-a or text-b is inserted. -. - - -General entities must respect the element hierarchy. This means that there must -be an end tag for every start tag in the entity value, and that end tags -without corresponding start tags are not allowed. - - -
- Example - -If the author of a document changes sometimes, it is worthwhile to set up a -general entity containing the names of the authors. If the author changes, you -need only to change the definition of the entity, and do not need to check all -occurrences of authors' names: - - - -]]> - - -In the document text, you can now refer to the author names by writing -&authors;. - - - -Illegal: -The following two entities are illegal because the elements in the definition -do not nest properly: - - -"> -"> -]]> - -
- - -Earlier in this introduction we explained that there are substitutes for -reserved characters: &lt;, &gt;, &amp;, &apos;, and -&quot;. These are simply predefined general entities; note that they are -the only predefined entities. It is allowed to define these entities again -as long as the meaning is unchanged. - -
- - - Notations and unparsed entities - -Unparsed entities have a foreign format and can thus not be read by the XML -parser. Unparsed entities are always external. The format of an unparsed entity -must have been declared, such a format is called a -notation. The entity can then be declared by referring to -this notation. As unparsed entities do not contain XML text, it is not possible -to include them directly into the document; you can only declare attributes -such that names of unparsed entities are acceptable values. - - - -As you can see, unparsed entities are too complicated in order to have any -purpose. It is almost always better to simply pass the name of the data file as -normal attribute value, and let the application recognize and process the -foreign format. - - - -
- - - - - - - A complete example: The <emphasis>readme</emphasis> DTD - -The reason for readme was that I often wrote two versions -of files such as README and INSTALL which explain aspects of a distributed -software archive; one version was ASCII-formatted, the other was written in -HTML. Maintaining both versions means double amount of work, and changes -of one version may be forgotten in the other version. To improve this situation -I invented the readme DTD which allows me to maintain only -one source written as XML document, and to generate the ASCII and the HTML -version from it. - - - -In this section, I explain only the DTD. The readme DTD is -contained in the &markup; distribution together with the two converters to -produce ASCII and HTML. Another section of this manual describes the HTML -converter. - - - -The documents have a simple structure: There are up to three levels of nested -sections, paragraphs, item lists, footnotes, hyperlinks, and text emphasis. The -outermost element has usually the type readme, it is -declared by - - - - -]]> - -This means that this element contains one or more sections of the first level -(element type sect1), and that the element has a required -attribute title containing character data (CDATA). Note that -readme elements must not contain text data. - - - -The three levels of sections are declared as follows: - - - - - - - -]]> - -Every section has a title element as first subelement. After -the title an arbitrary but non-empty sequence of inner sections, paragraphs and -item lists follows. Note that the inner sections must belong to the next higher -section level; sect3 elements must not contain inner -sections because there is no next higher level. - - - -Obviously, all three declarations allow paragraphs (p) and -item lists (ul). The definition can be simplified at this -point by using a parameter entity: - - - - - - - - - -]]> - -Here, the entity p.like is nothing but a macro abbreviating -the same sequence of declarations; if new elements on the same level as -p and ul are later added, it is -sufficient only to change the entity definition. Note that there are some -restrictions on the usage of entities in this context; most important, entities -containing a left paranthesis must also contain the corresponding right -paranthesis. - - - -Note that the entity p.like is a -parameter entity, i.e. the ENTITY declaration contains a -percent sign, and the entity is referred to by -%p.like;. This kind of entity must be used to abbreviate -parts of the DTD; the general entities declared without -percent sign and referred to as &name; are not allowed -in this context. - - - -The title element specifies the title of the section in -which it occurs. The title is given as character data, optionally interspersed -with line breaks (br): - - - -]]> - -Compared with the title attribute of -the readme element, this element allows inner markup -(i.e. br) while attribute values do not: It is an error if -an attribute value contains the left angle bracket < literally such that it -is impossible to include inner elements. - - - -The paragraph element p has a structure similar to -title, but it allows more inner elements: - - - - - -]]> - -Line breaks do not have inner structure, so they are declared as being empty: - - - -]]> - -This means that really nothing is allowed within br; you -must always write
]]>
or abbreviated -]]>. -
- - -Code samples should be marked up by the code tag; emphasized -text can be indicated by em: - - - - - -]]> - -That code elements are not allowed to contain further markup -while em elements do is a design decision by the author of -the DTD. - - - -Unordered lists simply consists of one or more list items, and a list item may -contain paragraph-level material: - - - - - -]]> - -Footnotes are described by the text of the note; this text may contain -text-level markup. There is no mechanism to describe the numbering scheme of -footnotes, or to specify how footnote references are printed. - - - -]]> - -Hyperlinks are written as in HTML. The anchor tag contains the text describing -where the link points to, and the href attribute is the -pointer (as URL). There is no way to describe locations of "hash marks". If the -link refers to another readme document, the attribute -readmeref should be used instead of href. -The reason is that the converted document has usually a different system -identifier (file name), and the link to a converted document must be -converted, too. - - - - -]]> - -Note that although it is only sensible to specify one of the two attributes, -the DTD has no means to express this restriction. - - - -So far the DTD. Finally, here is a document for it: - - - - - - - Usage -

- The readme converter is invoked on the command line by: -

-

- readme [ -text | -html ] input.xml -

-

- Here a list of options: -

-
    -
  • -

    -text: specifies that ASCII output should be produced

    -
  • -
  • -

    -html: specifies that HTML output should be produced

    -
  • -
-

- The input file must be given on the command line. The converted output is - printed to stdout. -

-
- - Author -

- The program has been written by - Gerd Stolpmann. -

-
-
-]]>
- -
- - -
-
- - - - - Using &markup; - - - Validation - -The parser can be used to validate a document. This means -that all the constraints that must hold for a valid document are actually -checked. Validation is the default mode of &markup;, i.e. every document is -validated while it is being parsed. - - - -In the examples directory of the distribution you find the -pxpvalidate application. It is invoked in the following way: - - -pxpvalidate [ -wf ] file... - - -The files mentioned on the command line are validated, and every warning and -every error messages are printed to stderr. - - - -The -wf switch modifies the behaviour such that a well-formedness parser is -simulated. In this mode, the ELEMENT, ATTLIST, and NOTATION declarations of the -DTD are ignored, and only the ENTITY declarations will take effect. This mode -is intended for documents lacking a DTD. Please note that the parser still -scans the DTD fully and will report all errors in the DTD; such checks are not -required by a well-formedness parser. - - - -The pxpvalidate application is the simplest sensible program -using &markup;, you may consider it as "hello world" program. - - - - - - - - - How to parse a document from an application - -Let me first give a rough overview of the object model of the parser. The -following items are represented by objects: - - - - -Documents: The document representation is more or less the -anchor for the application; all accesses to the parsed entities start here. It -is described by the class document contained in the module -Pxp_document. You can get some global information, such -as the XML declaration the document begins with, the DTD of the document, -global processing instructions, and most important, the document tree. - - - - - -The contents of documents: The contents have the structure -of a tree: Elements contain other elements and textElements may -also contain processing instructions. Unlike other document models, &markup; -separates processing instructions from the rest of the text and provides a -second interface to access them (method pinstr). However, -there is a parser option (enable_pinstr_nodes) which changes -the behaviour of the parser such that extra nodes for processing instructions -are included into the tree. -Furthermore, the tree does normally not contain nodes for XML comments; -they are ignored by default. Again, there is an option -(enable_comment_nodes) changing this. -. - -The common type to represent both kinds of content is node -which is a class type that unifies the properties of elements and character -data. Every node has a list of children (which is empty if the element is empty -or the node represents text); nodes may have attributes; nodes have always text -contents. There are two implementations of node, the class -element_impl for elements, and the class -data_impl for text data. You find these classes and class -types in the module Pxp_document, too. - - - -Note that attribute lists are represented by non-class values. - - - - - -The node extension: For advanced usage, every node of the -document may have an associated extension which is simply -a second object. This object must have the three methods -clone, node, and -set_node as bare minimum, but you are free to add methods as -you want. This is the preferred way to add functionality to the document -treeDue to the typing system it is more or less impossible to -derive recursive classes in O'Caml. To get around this, it is common practice -to put the modifiable or extensible part of recursive objects into parallel -objects. . The class type extension is -defined in Pxp_document, too. - - - - - -The DTD: Sometimes it is necessary to access the DTD of a -document; the average application does not need this feature. The class -dtd describes DTDs, and makes it possible to get -representations of element, entity, and notation declarations as well as -processing instructions contained in the DTD. This class, and -dtd_element, dtd_notation, and -proc_instruction can be found in the module -Pxp_dtd. There are a couple of classes representing -different kinds of entities; these can be found in the module -Pxp_entity. - - - - -Additionally, the following modules play a role: - - - - -Pxp_yacc: Here the main parsing functions such as -parse_document_entity are located. Some additional types and -functions allow the parser to be configured in a non-standard way. - - - - - -Pxp_types: This is a collection of basic types and -exceptions. - - - - -There are some further modules that are needed internally but are not part of -the API. - - - -Let the document to be parsed be stored in a file called -doc.xml. The parsing process is started by calling the -function - - -val parse_document_entity : config -> source -> 'ext spec -> 'ext document - - -defined in the module Pxp_yacc. The first argument -specifies some global properties of the parser; it is recommended to start with -the default_config. The second argument determines where the -document to be parsed comes from; this may be a file, a channel, or an entity -ID. To parse doc.xml, it is sufficient to pass -from_file "doc.xml". - - - -The third argument passes the object specification to use. Roughly -speaking, it determines which classes implement the node objects of which -element types, and which extensions are to be used. The 'ext -polymorphic variable is the type of the extension. For the moment, let us -simply pass default_spec as this argument, and ignore it. - - - -So the following expression parses doc.xml: - - -open Pxp_yacc -let d = parse_document_entity default_config (from_file "doc.xml") default_spec - - -Note that default_config implies that warnings are collected -but not printed. Errors raise one of the exception defined in -Pxp_types; to get readable errors and warnings catch the -exceptions as follows: - - - - print_endline (Pxp_types.string_of_exn e) -]]> - -Now d is an object of the document -class. If you want the node tree, you can get the root element by - - -let root = d # root - - -and if you would rather like to access the DTD, determine it by - - -let dtd = d # dtd - - -As it is more interesting, let us investigate the node tree now. Given the root -element, it is possible to recursively traverse the whole tree. The children of -a node n are returned by the method -sub_nodes, and the type of a node is returned by -node_type. This function traverses the tree, and prints the -type of each node: - - - - print_endline ("Element of type " ^ name); - let children = n # sub_nodes in - List.iter print_structure children - | T_data -> - print_endline "Data" - | _ -> - (* Other node types are not possible unless the parser is configured - differently. - *) - assert false -]]> - -You can call this function by - - -print_structure root - - -The type returned by node_type is either T_element -name or T_data. The name of the -element type is the string included in the angle brackets. Note that only -elements have children; data nodes are always leaves of the tree. - - - -There are some more methods in order to access a parsed node tree: - - - - -n # parent: Returns the parent node, or raises -Not_found if the node is already the root - - - - -n # root: Returns the root of the node tree. - - - - -n # attribute a: Returns the value of the attribute with -name a. The method returns a value for every -declared attribute, independently of whether the attribute -instance is defined or not. If the attribute is not declared, -Not_found will be raised. (In well-formedness mode, every -attribute is considered as being implicitly declared with type -CDATA.) - - - -The following return values are possible: Value s, -Valuelist sl , and Implied_value. -The first two value types indicate that the attribute value is available, -either because there is a definition -a="value" -in the XML text, or because there is a default value (declared in the -DTD). Only if both the instance definition and the default declaration are -missing, the latter value Implied_value will be returned. - - - -In the DTD, every attribute is typed. There are single-value types (CDATA, ID, -IDREF, ENTITY, NMTOKEN, enumerations), in which case the method passes -Value s back, where s is the normalized -string value of the attribute. The other types (IDREFS, ENTITIES, NMTOKENS) -represent list values, and the parser splits the XML literal into several -tokens and returns these tokens as Valuelist sl. - - - -Normalization means that entity references (the -&name; tokens) and -character references -(&#number;) are replaced -by the text they represent, and that white space characters are converted into -plain spaces. - - - - -n # data: Returns the character data contained in the -node. For data nodes, the meaning is obvious as this is the main content of -data nodes. For element nodes, this method returns the concatenated contents of -all inner data nodes. - - -Note that entity references included in the text are resolved while they are -being parsed; for example the text will be returned -as b"]]> by this method. Spaces of data nodes are always -preserved. Newlines are preserved, but always converted to \n characters even -if newlines are encoded as \r\n or \r. Normally you will never see two adjacent -data nodes because the parser collapses all data material at one location into -one node. (However, if you create your own tree or transform the parsed tree, -it is possible to have adjacent data nodes.) - - -Note that elements that do not allow #PCDATA as content -will not have data nodes as children. This means that spaces and newlines, the -only character material allowed for such elements, are silently dropped. - - - - -For example, if the task is to print all contents of elements with type -"valuable" whose attribute "priority" is "1", this function can help: - - - - print_endline "Valuable node with priotity 1 found:"; - print_endline (n # data) - | (T_element _ | T_data) -> - let children = n # sub_nodes in - List.iter print_valuable_prio1 children - | _ -> - assert false -]]> - -You can call this function by: - - -print_valuable_prio1 root - - -If you like a DSSSL-like style, you can make the function -process_children explicit: - - - - print_endline "Valuable node with priority 1 found:"; - print_endline (n # data) - | (T_element _ | T_data) -> - process_children n - | _ -> - assert false -]]> - -So far, O'Caml is now a simple "style-sheet language": You can form a big -"match" expression to distinguish between all significant cases, and provide -different reactions on different conditions. But this technique has -limitations; the "match" expression tends to get larger and larger, and it is -difficult to store intermediate values as there is only one big -recursion. Alternatively, it is also possible to represent the various cases as -classes, and to use dynamic method lookup to find the appropiate class. The -next section explains this technique in detail. - - - - - - - - - - Class-based processing of the node tree - -By default, the parsed node tree consists of objects of the same class; this is -a good design as long as you want only to access selected parts of the -document. For complex transformations, it may be better to use different -classes for objects describing different element types. - - - -For example, if the DTD declares the element types a, -b, and c, and if the task is to convert -an arbitrary document into a printable format, the idea is to define for every -element type a separate class that has a method print. The -classes are eltype_a, eltype_b, and -eltype_c, and every class implements -print such that elements of the type corresponding to the -class are converted to the output format. - - - -The parser supports such a design directly. As it is impossible to derive -recursive classes in O'CamlThe problem is that the subclass is -usually not a subtype in this case because O'Caml has a contravariant subtyping -rule. , the specialized element classes cannot be formed by -simply inheriting from the built-in classes of the parser and adding methods -for customized functionality. To get around this limitation, every node of the -document tree is represented by two objects, one called -"the node" and containing the recursive definition of the tree, one called "the -extension". Every node object has a reference to the extension, and the -extension has a reference to the node. The advantage of this model is that it -is now possible to customize the extension without affecting the typing -constraints of the recursive node definition. - - - -Every extension must have the three methods clone, -node, and set_node. The method -clone creates a deep copy of the extension object and -returns it; node returns the node object for this extension -object; and set_node is used to tell the extension object -which node is associated with it, this method is automatically called when the -node tree is initialized. The following definition is a good starting point -for these methods; usually clone must be further refined -when instance variables are added to the class: - - -} - method node = - match node with - None -> - assert false - | Some n -> n - method set_node n = - node <- Some n - - end -]]> - - -This part of the extension is usually the same for all classes, so it is a good -idea to consider custom_extension as the super-class of the -further class definitions. Continuining the example of above, we can define the -element type classes as follows: - - - unit - end - -class eltype_a = - object (self) - inherit custom_extension - method print ch = ... - end - -class eltype_b = - object (self) - inherit custom_extension - method print ch = ... - end - -class eltype_c = - object (self) - inherit custom_extension - method print ch = ... - end -]]> - -The method print can now be implemented for every element -type separately. Note that you get the associated node by invoking - - -self # node - - -and you get the extension object of a node n by writing - - -n # extension - - -It is guaranteed that - - -self # node # extension == self - - -always holds. - - - Here are sample definitions of the print -methods: - -... are only containers: *) - output_string ch "("; - List.iter - (fun n -> n # extension # print ch) - (self # node # sub_nodes); - output_string ch ")"; - end - -class eltype_b = - object (self) - inherit custom_extension - method print ch = - (* Print the value of the CDATA attribute "print": *) - match self # node # attribute "print" with - Value s -> output_string ch s - | Implied_value -> output_string ch "" - | Valuelist l -> assert false - (* not possible because the att is CDATA *) - end - -class eltype_c = - object (self) - inherit custom_extension - method print ch = - (* Print the contents of this element: *) - output_string ch (self # node # data) - end - -class null_extension = - object (self) - inherit custom_extension - method print ch = assert false - end -]]> - - - - -The remaining task is to configure the parser such that these extension classes -are actually used. Here another problem arises: It is not possible to -dynamically select the class of an object to be created. As workaround, -&markup; allows the user to specify exemplar objects for -the various element types; instead of creating the nodes of the tree by -applying the new operator the nodes are produced by -duplicating the exemplars. As object duplication preserves the class of the -object, one can create fresh objects of every class for which previously an -exemplar has been registered. - - - -Exemplars are meant as objects without contents, the only interesting thing is -that exemplars are instances of a certain class. The creation of an exemplar -for an element node can be done by: - - -let element_exemplar = new element_impl extension_exemplar - - -And a data node exemplar is created by: - - -let data_exemplar = new data_impl extension_exemplar - - -The classes element_impl and data_impl -are defined in the module Pxp_document. The constructors -initialize the fresh objects as empty objects, i.e. without children, without -data contents, and so on. The extension_exemplar is the -initial extension object the exemplars are associated with. - - - -Once the exemplars are created and stored somewhere (e.g. in a hash table), you -can take an exemplar and create a concrete instance (with contents) by -duplicating it. As user of the parser you are normally not concerned with this -as this is part of the internal logic of the parser, but as background knowledge -it is worthwhile to mention that the two methods -create_element and create_data actually -perform the duplication of the exemplar for which they are invoked, -additionally apply modifications to the clone, and finally return the new -object. Moreover, the extension object is copied, too, and the new node object -is associated with the fresh extension object. Note that this is the reason why -every extension object must have a clone method. - - - -The configuration of the set of exemplars is passed to the -parse_document_entity function as third argument. In our -example, this argument can be set up as follows: - - - - -The ~element_alist function argument defines the mapping -from element types to exemplars as associative list. The argument -~data_exemplar specifies the exemplar for data nodes, and -the ~default_element_exemplar is used whenever the parser -finds an element type for which the associative list does not define an -exemplar. - - - -The configuration is now complete. You can still use the same parsing -functions, only the initialization is a bit different. For example, call the -parser by: - - -let d = parse_document_entity default_config (from_file "doc.xml") spec - - -Note that the resulting document d has a usable type; -especially the print method we added is visible. So you can -print your document by - - -d # root # extension # print stdout - - - - -This object-oriented approach looks rather complicated; this is mostly caused -by working around some problems of the strict typing system of O'Caml. Some -auxiliary concepts such as extensions were needed, but the practical -consequences are low. In the next section, one of the examples of the -distribution is explained, a converter from readme -documents to HTML. - - - - - - - - - - Example: An HTML backend for the <emphasis>readme</emphasis> -DTD - - The converter from readme documents to HTML -documents follows strictly the approach to define one class per element -type. The HTML code is similar to the readme source, -because of this most elements can be converted in the following way: Given the -input element - - -content]]> - - -the conversion text is the concatenation of a computed prefix, the recursively -converted content, and a computed suffix. - - - -Only one element type cannot be handled by this scheme: -footnote. Footnotes are collected while they are found in -the input text, and they are printed after the main text has been converted and -printed. - - - - Header - -&readme.code.header; - - - - - Type declarations - -&readme.code.footnote-printer; - - - - - Class <literal>store</literal> - -The store is a container for footnotes. You can add a -footnote by invoking alloc_footnote; the argument is an -object of the class footnote_printer, the method returns the -number of the footnote. The interesting property of a footnote is that it can -be converted to HTML, so a footnote_printer is an object -with a method footnote_to_html. The class -footnote which is defined below has a compatible method -footnote_to_html such that objects created from it can be -used as footnote_printers. - - -The other method, print_footnotes prints the footnotes as -definition list, and is typically invoked after the main material of the page -has already been printed. Every item of the list is printed by -footnote_to_html. - - - -&readme.code.store; - - - - - Function <literal>escape_html</literal> - -This function converts the characters <, >, &, and " to their HTML -representation. For example, -escape_html "<>" = "&lt;&gt;". Other -characters are left unchanged. - -&readme.code.escape-html; - - - - - Virtual class <literal>shared</literal> - -This virtual class is the abstract superclass of the extension classes shown -below. It defines the standard methods clone, -node, and set_node, and declares the type -of the virtual method to_html. This method recursively -traverses the whole element tree, and prints the converted HTML code to the -output channel passed as second argument. The first argument is the reference -to the global store object which collects the footnotes. - -&readme.code.shared; - - - - - Class <literal>only_data</literal> - -This class defines to_html such that the character data of -the current node is converted to HTML. Note that self is an -extension object, self # node is the node object, and -self # node # data returns the character data of the node. - -&readme.code.only-data; - - - - - Class <literal>readme</literal> - -This class converts elements of type readme to HTML. Such an -element is (by definition) always the root element of the document. First, the -HTML header is printed; the title attribute of the element -determines the title of the HTML page. Some aspects of the HTML page can be -configured by setting certain parameter entities, for example the background -color, the text color, and link colors. After the header, the -body tag, and the headline have been printed, the contents -of the page are converted by invoking to_html on all -children of the current node (which is the root node). Then, the footnotes are -appended to this by telling the global store object to print -the footnotes. Finally, the end tags of the HTML pages are printed. - - - -This class is an example how to access the value of an attribute: The value is -determined by invoking self # node # attribute "title". As -this attribute has been declared as CDATA and as being required, the value has -always the form Value s where s is the -string value of the attribute. - - - -You can also see how entity contents can be accessed. A parameter entity object -can be looked up by self # node # dtd # par_entity "name", -and by invoking replacement_text the value of the entity -is returned after inner parameter and character entities have been -processed. Note that you must use gen_entity instead of -par_entity to access general entities. - - - -&readme.code.readme; - - - - - Classes <literal>section</literal>, <literal>sect1</literal>, -<literal>sect2</literal>, and <literal>sect3</literal> - -As the conversion process is very similar, the conversion classes of the three -section levels are derived from the more general section -class. The HTML code of the section levels only differs in the type of the -headline, and because of this the classes describing the section levels can be -computed by replacing the class argument the_tag of -section by the HTML name of the headline tag. - - - -Section elements are converted to HTML by printing a headline and then -converting the contents of the element recursively. More precisely, the first -sub-element is always a title element, and the other -elements are the contents of the section. This structure is declared in the -DTD, and it is guaranteed that the document matches the DTD. Because of this -the title node can be separated from the rest without any checks. - - - -Both the title node, and the body nodes are then converted to HTML by calling -to_html on them. - - - -&readme.code.section; - - - - - Classes <literal>map_tag</literal>, <literal>p</literal>, -<literal>em</literal>, <literal>ul</literal>, <literal>li</literal> - -Several element types are converted to HTML by simply mapping them to -corresponding HTML element types. The class map_tag -implements this, and the class argument the_target_tag -determines the tag name to map to. The output consists of the start tag, the -recursively converted inner elements, and the end tag. - -&readme.code.map-tag; - - - - - Class <literal>br</literal> - -Element of type br are mapped to the same HTML type. Note -that HTML forbids the end tag of br. - -&readme.code.br; - - - - - Class <literal>code</literal> - -The code type is converted to a pre -section (preformatted text). As the meaning of tabs is unspecified in HTML, -tabs are expanded to spaces. - -&readme.code.code; - - - - - Class <literal>a</literal> - -Hyperlinks, expressed by the a element type, are converted -to the HTML a type. If the target of the hyperlink is given -by href, the URL of this attribute can be used -directly. Alternatively, the target can be given by -readmeref in which case the ".html" suffix must be added to -the file name. - - - -Note that within a only #PCDATA is allowed, so the contents -can be converted directly by applying escape_html to the -character data contents. - -&readme.code.a; - - - - - Class <literal>footnote</literal> - -The footnote class has two methods: -to_html to convert the footnote reference to HTML, and -footnote_to_html to convert the footnote text itself. - - - -The footnote reference is converted to a local hyperlink; more precisely, to -two anchor tags which are connected with each other. The text anchor points to -the footnote anchor, and the footnote anchor points to the text anchor. - - - -The footnote must be allocated in the store object. By -allocating the footnote, you get the number of the footnote, and the text of -the footnote is stored until the end of the HTML page is reached when the -footnotes can be printed. The to_html method stores simply -the object itself, such that the footnote_to_html method is -invoked on the same object that encountered the footnote. - - - -The to_html only allocates the footnote, and prints the -reference anchor, but it does not print nor convert the contents of the -note. This is deferred until the footnotes actually get printed, i.e. the -recursive call of to_html on the sub nodes is done by -footnote_to_html. - - - -Note that this technique does not work if you make another footnote within a -footnote; the second footnote gets allocated but not printed. - - - -&readme.code.footnote; - - - - - The specification of the document model - -This code sets up the hash table that connects element types with the exemplars -of the extension classes that convert the elements to HTML. - -&readme.code.tag-map; - - - - - - - - - - - - The objects representing the document - - -This description might be out-of-date. See the module interface files -for updated information. - - - The <literal>document</literal> class - - - - object - method init_xml_version : string -> unit - method init_root : 'ext node -> unit - - method xml_version : string - method xml_standalone : bool - method dtd : dtd - method root : 'ext node - - method encoding : Pxp_types.rep_encoding - - method add_pinstr : proc_instruction -> unit - method pinstr : string -> proc_instruction list - method pinstr_names : string list - - method write : Pxp_types.output_stream -> Pxp_types.encoding -> unit - - end -;; -]]> - - -The methods beginning with init_ are only for internal use -of the parser. - - - - - -xml_version: returns the version string at the beginning of -the document. For example, "1.0" is returned if the document begins with -<?xml version="1.0"?>. - - - -xml_standalone: returns the boolean value of -standalone declaration in the XML declaration. If the -standalone attribute is missing, false is -returned. - - - -dtd: returns a reference to the global DTD object. - - - -root: returns a reference to the root element. - - - -encoding: returns the internal encoding of the -document. This means that all strings of which the document consists are -encoded in this character set. - - - - -pinstr: returns the processing instructions outside the DTD -and outside the root element. The argument passed to the method names a -target, and the method returns all instructions with this -target. The target is the first word inside <? and -?>. - - - -pinstr_names: returns the names of the processing instructions - - - -add_pinstr: adds another processing instruction. This method -is used by the parser itself to enter the instructions returned by -pinstr, but you can also enter additional instructions. - - - - -write: writes the document to the passed stream as XML -text using the passed (external) encoding. The generated text is always valid -XML and can be parsed by PXP; however, the text is badly formatted (this is not -a pretty printer). - - - - - - - - The class type <literal>node</literal> - - -From Pxp_document: - - -type node_type = - T_data -| T_element of string -| T_super_root -| T_pinstr of string -| T_comment -and some other, reserved types -;; - -class type [ 'ext ] node = - object ('self) - constraint 'ext = 'ext node #extension - - (* *) - - method extension : 'ext - method dtd : dtd - method parent : 'ext node - method root : 'ext node - method sub_nodes : 'ext node list - method iter_nodes : ('ext node &fun; unit) &fun; unit - method iter_nodes_sibl : - ('ext node option &fun; 'ext node &fun; 'ext node option &fun; unit) &fun; unit - method node_type : node_type - method encoding : Pxp_types.rep_encoding - method data : string - method position : (string * int * int) - method comment : string option - method pinstr : string &fun; proc_instruction list - method pinstr_names : string list - method write : Pxp_types.output_stream -> Pxp_types.encoding -> unit - - (* *) - - method attribute : string &fun; Pxp_types.att_value - method required_string_attribute : string &fun; string - method optional_string_attribute : string &fun; string option - method required_list_attribute : string &fun; string list - method optional_list_attribute : string &fun; string list - method attribute_names : string list - method attribute_type : string &fun; Pxp_types.att_type - method attributes : (string * Pxp_types.att_value) list - method id_attribute_name : string - method id_attribute_value : string - method idref_attribute_names : string - - (* *) - - method add_node : ?force:bool &fun; 'ext node &fun; unit - method add_pinstr : proc_instruction &fun; unit - method delete : unit - method set_nodes : 'ext node list &fun; unit - method quick_set_attributes : (string * Pxp_types.att_value) list &fun; unit - method set_comment : string option &fun; unit - - (* *) - - method orphaned_clone : 'self - method orphaned_flat_clone : 'self - method create_element : - ?position:(string * int * int) &fun; - dtd &fun; node_type &fun; (string * string) list &fun; - 'ext node - method create_data : dtd &fun; string &fun; 'ext node - method keep_always_whitespace_mode : unit - - (* *) - - method local_validate : ?use_dfa:bool -> unit -> unit - - (* ... Internal methods are undocumented. *) - - end -;; - - -In the module Pxp_types you can find another type -definition that is important in this context: - - -type Pxp_types.att_value = - Value of string - | Valuelist of string list - | Implied_value -;; - - - - - The structure of document trees - - -A node represents either an element or a character data section. There are two -classes implementing the two aspects of nodes: element_impl -and data_impl. The latter class does not implement all -methods because some methods do not make sense for data nodes. - - - -(Note: PXP also supports a mode which forces that processing instructions and -comments are represented as nodes of the document tree. However, these nodes -are instances of element_impl with node types -T_pinstr and T_comment, -respectively. This mode must be explicitly configured; the basic representation -knows only element and data nodes.) - - - The following figure -() shows an example how -a tree is constructed from element and data nodes. The circular areas -represent element nodes whereas the ovals denote data nodes. Only elements -may have subnodes; data nodes are always leaves of the tree. The subnodes -of an element can be either element or data nodes; in both cases the O'Caml -objects storing the nodes have the class type node. - - Attributes (the clouds in the picture) are not directly -integrated into the tree; there is always an extra link to the attribute -list. This is also true for processing instructions (not shown in the -picture). This means that there are separated access methods for attributes and -processing instructions. - -
-A tree with element nodes, data nodes, and attributes - -
- - Only elements, data sections, attributes and processing -instructions (and comments, if configured) can, directly or indirectly, occur -in the document tree. It is impossible to add entity references to the tree; if -the parser finds such a reference, not the reference as such but the referenced -text (i.e. the tree representing the structured text) is included in the -tree. - - Note that the parser collapses as much data material into one -data node as possible such that there are normally never two adjacent data -nodes. This invariant is enforced even if data material is included by entity -references or CDATA sections, or if a data sequence is interrupted by -comments. So a &amp; b <-- comment --> c <![CDATA[ -<> d]]> is represented by only one data node, for -instance. However, you can create document trees manually which break this -invariant; it is only the way the parser forms the tree. - - -
-Nodes are doubly linked trees - -
- - -The node tree has links in both directions: Every node has a link to its parent -(if any), and it has links to the subnodes (see -figure ). Obviously, -this doubly-linked structure simplifies the navigation in the tree; but has -also some consequences for the possible operations on trees. - - -Because every node must have at most one parent node, -operations are illegal if they violate this condition. The following figure -() shows on the left side -that node y is added to x as new subnode -which is allowed because y does not have a parent yet. The -right side of the picture illustrates what would happen if y -had a parent node; this is illegal because y would have two -parents after the operation. - -
-A node can only be added if it is a root - - -
- - -The "delete" operation simply removes the links between two nodes. In the -picture () the node -x is deleted from the list of subnodes of -y. After that, x becomes the root of the -subtree starting at this node. - -
-A deleted node becomes the root of the subtree - -
- - -It is also possible to make a clone of a subtree; illustrated in -. In this case, the -clone is a copy of the original subtree except that it is no longer a -subnode. Because cloning never keeps the connection to the parent, the clones -are called orphaned. - - -
-The clone of a subtree - -
-
- - - The methods of the class type <literal>node</literal> - - - - - <link linkend="type-node-general.sig">General observers</link> - - - - - - -extension: The reference to the extension object which -belongs to this node (see ...). - - - -dtd: Returns a reference to the global DTD. All nodes -of a tree must share the same DTD. - - - - -parent: Get the father node. Raises -Not_found in the case the node does not have a -parent, i.e. the node is the root. - - - -root: Gets the reference to the root node of the tree. -Every node is contained in a tree with a root, so this method always -succeeds. Note that this method searches the root, -which costs time proportional to the length of the path to the root. - - - - -sub_nodes: Returns references to the children. The returned -list reflects the order of the children. For data nodes, this method returns -the empty list. - - - - -iter_nodes f: Iterates over the children, and calls -f for every child in turn. - - - - -iter_nodes_sibl f: Iterates over the children, and calls -f for every child in turn. f gets as -arguments the previous node, the current node, and the next node. - - - -node_type: Returns either T_data which -means that the node is a data node, or T_element n -which means that the node is an element of type n. -If configured, possible node types are also T_pinstr t -indicating that the node represents a processing instruction with target -t, and T_comment in which case the node -is a comment. - - - - -encoding: Returns the encoding of the strings. - - - -data: Returns the character data of this node and all -children, concatenated as one string. The encoding of the string is what -the method encoding returns. -- For data nodes, this method simply returns the represented characters. -For elements, the meaning of the method has been extended such that it -returns something useful, i.e. the effectively contained characters, without -markup. (For T_pinstr and T_comment -nodes, the method returns the empty string.) - - - - -position: If configured, this method returns the position of -the element as triple (entity, line, byteposition). For data nodes, the -position is not stored. If the position is not available the triple -"?", 0, 0 is returned. - - - - -comment: Returns Some text for comment -nodes, and None for other nodes. The text -is everything between the comment delimiters <-- and --->. - - - - -pinstr n: Returns all processing instructions that are -directly contained in this element and that have a target -specification of n. The target is the first word after -the <?. - - - - -pinstr_names: Returns the list of all targets of processing -instructions directly contained in this element. - - - -write s enc: Prints the node and all subnodes to the passed -output stream as valid XML text, using the passed external encoding. - - - - - - - - - - <link linkend="type-node-atts.sig">Attribute observers</link> - - - - - -attribute n: Returns the value of the attribute with name -n. This method returns a value for every declared -attribute, and it raises Not_found for any undeclared -attribute. Note that it even returns a value if the attribute is actually -missing but is declared as #IMPLIED or has a default -value. - Possible values are: - - - -Implied_value: The attribute has been declared with the -keyword #IMPLIED, and the attribute is missing in the -attribute list of this element. - - - -Value s: The attribute has been declared as type -CDATA, as ID, as -IDREF, as ENTITY, or as -NMTOKEN, or as enumeration or notation, and one of the two -conditions holds: (1) The attribute value is present in the attribute list in -which case the value is returned in the string s. (2) The -attribute has been omitted, and the DTD declared the attribute with a default -value. The default value is returned in s. -- Summarized, Value s is returned for non-implied, non-list -attribute values. - - - - -Valuelist l: The attribute has been declared as type -IDREFS, as ENTITIES, or -as NMTOKENS, and one of the two conditions holds: (1) The -attribute value is present in the attribute list in which case the -space-separated tokens of the value are returned in the string list -l. (2) The attribute has been omitted, and the DTD declared -the attribute with a default value. The default value is returned in -l. -- Summarized, Valuelist l is returned for all list-type -attribute values. - - - - -Note that before the attribute value is returned, the value is normalized. This -means that newlines are converted to spaces, and that references to character -entities (i.e. &#n;) and -general entities -(i.e. &name;) are expanded; -if necessary, expansion is performed recursively. - - - -In well-formedness mode, there is no DTD which could declare an -attribute. Because of this, every occuring attribute is considered as a CDATA -attribute. - - - - -required_string_attribute n: returns the Value attribute -called n, or the Valuelist attribute as a string where the list elements -are separated by spaces. If the attribute value is implied, or if the -attribute does not exists, the method will fail. - This method is convenient -if you expect a non-implied and non-list attribute value. - - - - -optional_string_attribute n: returns the Value attribute -called n, or the Valuelist attribute as a string where the list elements -are separated by spaces. If the attribute value is implied, or if the -attribute does not exists, the method returns None. - This method is -convenient if you expect a non-list attribute value including the implied -value. - - - - -required_list_attribute n: returns the Valuelist attribute -called n, or the Value attribute as a list with a single element. -If the attribute value is implied, or if the -attribute does not exists, the method will fail. - This method is -convenient if you expect a list attribute value. - - - - -optional_list_attribute n: returns the Valuelist attribute -called n, or the Value attribute as a list with a single element. -If the attribute value is implied, or if the -attribute does not exists, an empty list will be returned. - This method -is convenient if you expect a list attribute value or the implied value. - - - - -attribute_names: returns the list of all attribute names of -this element. As this is a validating parser, this list is equal to the -list of declared attributes. - - - - -attribute_type n: returns the type of the attribute called -n. See the module Pxp_types for a -description of the encoding of the types. - - - - -attributes: returns the list of pairs of names and values -for all attributes of -this element. - - - -id_attribute_name: returns the name of the attribute that is -declared with type ID. There is at most one such attribute. The method raises -Not_found if there is no declared ID attribute for the -element type. - - - -id_attribute_value: returns the value of the attribute that -is declared with type ID. There is at most one such attribute. The method raises -Not_found if there is no declared ID attribute for the -element type. - - - -idref_attribute_names: returns the list of attribute names -that are declared as IDREF or IDREFS. - - - - - - - - - <link linkend="type-node-mods.sig">Modifying methods</link> - - - -The following methods are only defined for element nodes (more exactly: -the methods are defined for data nodes, too, but fail always). - - - - -add_node sn: Adds sub node sn to the list -of children. This operation is illustrated in the picture -. This method expects that -sn is a root, and it requires that sn and -the current object share the same DTD. - - -Because add_node is the method the parser itself uses -to add new nodes to the tree, it performs by default some simple validation -checks: If the content model is a regular expression, it is not allowed to add -data nodes to this node unless the new nodes consist only of whitespace. In -this case, the new data nodes are silently dropped (you can change this by -invoking keep_always_whitespace_mode). - - -If the document is flagged as stand-alone, these data nodes only -containing whitespace are even forbidden if the element declaration is -contained in an external entity. This case is detected and rejected. - -If the content model is EMPTY, it is not allowed to -add any data node unless the data node is empty. In this case, the new data -node is silently dropped. - - -These checks only apply if there is a DTD. In well-formedness mode, it is -assumed that every element is declared with content model -ANY which prohibits any validation check. Furthermore, you -turn these checks off by passing ~force:true as first -argument. - - - -add_pinstr pi: Adds the processing instruction -pi to the list of processing instructions. - - - - - -delete: Deletes this node from the tree. After this -operation, this node is no longer the child of the former father node; and the -node loses the connection to the father as well. This operation is illustrated -by the figure . - - - - -set_nodes nl: Sets the list of children to -nl. It is required that every member of nl -is a root, and that all members and the current object share the same DTD. -Unlike add_node, no validation checks are performed. - - - - -quick_set_attributes atts: sets the attributes of this -element to atts. It is not checked -whether atts matches the DTD or not; it is up to the -caller of this method to ensure this. (This method may be useful to transform -the attribute values, i.e. apply a mapping to every attribute.) - - - - -set_comment text: This method is only applicable to -T_comment nodes; it sets the comment text contained by such -nodes. - - - - - - - - - <link linkend="type-node-cloning.sig">Cloning methods</link> - - - - - - -orphaned_clone: Returns a clone of the node and the complete -tree below this node (deep clone). The clone does not have a parent (i.e. the -reference to the parent node is not cloned). While -copying the subtree, strings are skipped; it is likely that the original tree -and the copy tree share strings. Extension objects are cloned by invoking -the clone method on the original objects; how much of -the extension objects is cloned depends on the implemention of this method. - - This operation is illustrated by the figure -. - - - - -orphaned_flat_clone: Returns a clone of the node, -but sets the list of sub nodes to [], i.e. the sub nodes are not cloned. - - - - - -create_element dtd nt al: Returns a flat copy of this node -(which must be an element) with the following modifications: The DTD is set to -dtd; the node type is set to nt, and the -new attribute list is set to al (given as list of -(name,value) pairs). The copy does not have children nor a parent. It does not -contain processing instructions. See -the example below. - - - Note that you can specify the position of the new node -by the optional argument ~position. - - - - -create_data dtd cdata: Returns a flat copy of this node -(which must be a data node) with the following modifications: The DTD is set to -dtd; the node type is set to T_data; the -attribute list is empty (data nodes never have attributes); the list of -children and PIs is empty, too (same reason). The new node does not have a -parent. The value cdata is the new character content of the -node. See -the example below. - - - - -keep_always_whitespace_mode: Even data nodes which are -normally dropped because they only contain ignorable whitespace, can added to -this node once this mode is turned on. (This mode is useful to produce -canonical XML.) - - - - - - - - - - <link linkend="type-node-weird.sig">Validating methods</link> - - -There is one method which locally validates the node, i.e. checks whether the -subnodes match the content model of this node. - - - - -local_validate: Checks that this node conforms to the -DTD by comparing the type of the subnodes with the content model for this -node. (Applications need not call this method unless they add new nodes -themselves to the tree.) - - - - - - - - - The class <literal>element_impl</literal> - -This class is an implementation of node which -realizes element nodes: - - - [ 'ext ] node -]]> - - - - - Constructor - -You can create a new instance by - - -new element_impl extension_object - - -which creates a special form of empty element which already contains a -reference to the extension_object, but is -otherwise empty. This special form is called an -exemplar. The purpose of exemplars is that they serve as -patterns that can be duplicated and filled with data. The method - -create_element is designed to perform this action. - - - - - - Example - - First, create an exemplar by - - -let exemplar_ext = ... in -let exemplar = new element_impl exemplar_ext in - - -The exemplar is not used in node trees, but only as -a pattern when the element nodes are created: - - -let element = exemplar # create_element dtd (T_element name) attlist - - -The element is a copy of exemplar -(even the extension exemplar_ext has been copied) -which ensures that element and its extension are objects -of the same class as the exemplars; note that you need not to pass a -class name or other meta information. The copy is initially connected -with the dtd, it gets a node type, and the attribute list -is filled. The element is now fully functional; it can -be added to another element as child, and it can contain references to -subnodes. - - - - - - - The class <literal>data_impl</literal> - -This class is an implementation of node which -should be used for all character data nodes: - - - [ 'ext ] node -]]> - - - - - - Constructor - -You can create a new instance by - - -new data_impl extension_object - - -which creates an empty exemplar node which is connected to -extension_object. The node does not contain a -reference to any DTD, and because of this it cannot be added to node trees. - - - - To get a fully working data node, apply the method -create_data - to the exemplar (see example). - - - - - Example - - First, create an exemplar by - - -let exemplar_ext = ... in -let exemplar = new exemplar_ext data_impl in - - -The exemplar is not used in node trees, but only as -a pattern when the data nodes are created: - - -let data_node = exemplar # create_data dtd "The characters contained in the data node" - - -The data_node is a copy of exemplar. -The copy is initially connected -with the dtd, and it is filled with character material. -The data_node is now fully functional; it can -be added to an element as child. - - - - - - The type <literal>spec</literal> - -The type spec defines a way to handle the details of -creating nodes from exemplars. - - - ?comment_exemplar : 'ext node -> - ?default_pinstr_exemplar : 'ext node -> - ?pinstr_mapping : (string, 'ext node) Hashtbl.t -> - data_exemplar: 'ext node -> - default_element_exemplar: 'ext node -> - element_mapping: (string, 'ext node) Hashtbl.t -> - unit -> - 'ext spec - -val make_spec_from_alist : - ?super_root_exemplar : 'ext node -> - ?comment_exemplar : 'ext node -> - ?default_pinstr_exemplar : 'ext node -> - ?pinstr_alist : (string * 'ext node) list -> - data_exemplar: 'ext node -> - default_element_exemplar: 'ext node -> - element_alist: (string * 'ext node) list -> - unit -> - 'ext spec -]]> - -The two functions make_spec_from_mapping and -make_spec_from_alist create spec -values. Both functions are functionally equivalent and the only difference is -that the first function prefers hashtables and the latter associative lists to -describe mappings from names to exemplars. - - - -You can specify exemplars for the various kinds of nodes that need to be -generated when an XML document is parsed: - - - - ~super_root_exemplar: This exemplar -is used to create the super root. This special node is only created if the -corresponding configuration option has been selected; it is the parent node of -the root node which may be convenient if every working node must have a parent. - - - ~comment_exemplar: This exemplar is -used when a comment node must be created. Note that such nodes are only created -if the corresponding configuration option is "on". - - - - ~default_pinstr_exemplar: If a node -for a processing instruction must be created, and the instruction is not listed -in the table passed by ~pinstr_mapping or -~pinstr_alist, this exemplar is used. -Again the configuration option must be "on" in order to create such nodes at -all. - - - - ~pinstr_mapping or -~pinstr_alist: Map the target names of processing -instructions to exemplars. These mappings are only used when nodes for -processing instructions are created. - - - ~data_exemplar: The exemplar for -ordinary data nodes. - - - ~default_element_exemplar: This -exemplar is used if an element node must be created, but the element type -cannot be found in the tables element_mapping or -element_alist. - - - ~element_mapping or -~element_alist: Map the element types to exemplars. These -mappings are used to create element nodes. - - - -In most cases, you only want to create spec values to pass -them to the parser functions found in Pxp_yacc. However, it -might be useful to apply spec values directly. - - -The following functions create various types of nodes by selecting the -corresponding exemplar from the passed spec value, and by -calling create_element or create_data on -the exemplar. - - - dtd -> - (* data material: *) string -> - 'ext node - -val create_element_node : - ?position:(string * int * int) -> - 'ext spec -> - dtd -> - (* element type: *) string -> - (* attributes: *) (string * string) list -> - 'ext node - -val create_super_root_node : - ?position:(string * int * int) -> - 'ext spec -> - dtd -> - 'ext node - -val create_comment_node : - ?position:(string * int * int) -> - 'ext spec -> - dtd -> - (* comment text: *) string -> - 'ext node - -val create_pinstr_node : - ?position:(string * int * int) -> - 'ext spec -> - dtd -> - proc_instruction -> - 'ext node -]]> - - - - - Examples - - - Building trees. - - Here is the piece of code that creates the tree of -the figure . The extension -object and the DTD are beyond the scope of this example. - - -let exemplar_ext = ... (* some extension *) in -let dtd = ... (* some DTD *) in - -let element_exemplar = new element_impl exemplar_ext in -let data_exemplar = new data_impl exemplar_ext in - -let a1 = element_exemplar # create_element dtd (T_element "a") ["att", "apple"] -and b1 = element_exemplar # create_element dtd (T_element "b") [] -and c1 = element_exemplar # create_element dtd (T_element "c") [] -and a2 = element_exemplar # create_element dtd (T_element "a") ["att", "orange"] -in - -let cherries = data_exemplar # create_data dtd "Cherries" in -let orange = data_exemplar # create_data dtd "An orange" in - -a1 # add_node b1; -a1 # add_node c1; -b1 # add_node a2; -b1 # add_node cherries; -a2 # add_node orange; - - -Alternatively, the last block of statements could also be written as: - - -a1 # set_nodes [b1; c1]; -b1 # set_nodes [a2; cherries]; -a2 # set_nodes [orange]; - - -The root of the tree is a1, i.e. it is true that - - -x # root == a1 - - -for every x from { a1, a2, -b1, c1, cherries, -orange }. - - - -Furthermore, the following properties hold: - - - a1 # attribute "att" = Value "apple" -& a2 # attribute "att" = Value "orange" - -& cherries # data = "Cherries" -& orange # data = "An orange" -& a1 # data = "CherriesAn orange" - -& a1 # node_type = T_element "a" -& a2 # node_type = T_element "a" -& b1 # node_type = T_element "b" -& c1 # node_type = T_element "c" -& cherries # node_type = T_data -& orange # node_type = T_data - -& a1 # sub_nodes = [ b1; c1 ] -& a2 # sub_nodes = [ orange ] -& b1 # sub_nodes = [ a2; cherries ] -& c1 # sub_nodes = [] -& cherries # sub_nodes = [] -& orange # sub_nodes = [] - -& a2 # parent == a1 -& b1 # parent == b1 -& c1 # parent == a1 -& cherries # parent == b1 -& orange # parent == a2 - - - - Searching nodes. - - The following function searches all nodes of a tree -for which a certain condition holds: - - -let rec search p t = - if p t then - t :: search_list p (t # sub_nodes) - else - search_list p (t # sub_nodes) - -and search_list p l = - match l with - [] -> [] - | t :: l' -> (search p t) @ (search_list p l') -;; - - - - - For example, if you want to search all elements of a certain -type et, the function search can be -applied as follows: - - -let search_element_type et t = - search (fun x -> x # node_type = T_element et) t -;; - - - - - Getting attribute values. - - Suppose we have the declaration: - -]]> - - -In this case, every element e must have an attribute -a, otherwise the parser would indicate an error. If -the O'Caml variable n holds the node of the tree -corresponding to the element, you can get the value of the attribute -a by - - -let value_of_a = n # required_string_attribute "a" - - -which is more or less an abbreviation for - - s - | _ -> assert false]]> - - -- as the attribute is required, the attribute method always -returns a Value. - - - - In contrast to this, the attribute b can be -omitted. In this case, the method required_string_attribute -works only if the attribute is there, and the method will fail if the attribute -is missing. To get the value, you can apply the method -optional_string_attribute: - - -let value_of_b = n # optional_string_attribute "b" - - -Now, value_of_b is of type string option, -and None represents the omitted attribute. Alternatively, -you could also use attribute: - - Some s - | Implied_value -> None - | _ -> assert false]]> - - - - The attribute c behaves much like -a, because it has always a value. If the attribute is -omitted, the default, here "12345", will be returned instead. Because of this, -you can again use required_string_attribute to get the -value. - - - The type CDATA is the most general string -type. The types NMTOKEN, ID, -IDREF, ENTITY, and all enumerators and -notations are special forms of string types that restrict the possible -values. From O'Caml, they behave like CDATA, i.e. you can -use the methods required_string_attribute and -optional_string_attribute, too. - - - In contrast to this, the types NMTOKENS, -IDREFS, and ENTITIES mean lists of -strings. Suppose we have the declaration: - -]]> - - -The type NMTOKENS stands for lists of space-separated -tokens; for example the value "1 abc 23ef" means the list -["1"; "abc"; "23ef"]. (Again, IDREFS -and ENTITIES have more restricted values.) To get the -value of attribute d, one can use - - -let value_of_d = n # required_list_attribute "d" - - -or - - l - | _ -> assert false]]> - - -As d is required, the attribute cannot be omitted, and -the attribute method returns always a -Valuelist. - - - For optional attributes like e, apply - - -let value_of_e = n # optional_list_attribute "e" - - -or - - l - | Implied_value -> [] - | _ -> assert false]]> - - -Here, the case that the attribute is missing counts like the empty list. - - - - - - - Iterators - - There are also several iterators in Pxp_document; please see -the mli file for details. You can find examples for them in the -"simple_transformation" directory. - - - f:('ext node -> bool) -> 'ext node -> 'ext node - -val find_all : ?deeply:bool -> - f:('ext node -> bool) -> 'ext node -> 'ext node list - -val find_element : ?deeply:bool -> - string -> 'ext node -> 'ext node - -val find_all_elements : ?deeply:bool -> - string -> 'ext node -> 'ext node list - -exception Skip -val map_tree : pre:('exta node -> 'extb node) -> - ?post:('extb node -> 'extb node) -> - 'exta node -> - 'extb node - - -val map_tree_sibl : - pre: ('exta node option -> 'exta node -> 'exta node option -> - 'extb node) -> - ?post:('extb node option -> 'extb node -> 'extb node option -> - 'extb node) -> - 'exta node -> - 'extb node - -val iter_tree : ?pre:('ext node -> unit) -> - ?post:('ext node -> unit) -> - 'ext node -> - unit - -val iter_tree_sibl : - ?pre: ('ext node option -> 'ext node -> 'ext node option -> unit) -> - ?post:('ext node option -> 'ext node -> 'ext node option -> unit) -> - 'ext node -> - unit -]]> - - - -
- - - - - The class type <literal>extension</literal> - - - - unit - (* "set_node" is invoked once the extension is associated to a new - * node object. - *) - end -]]> - - -This is the type of classes used for node extensions. For every node of the -document tree, there is not only the node object, but also -an extension object. The latter has minimal -functionality; it has only the necessary methods to be attached to the node -object containing the details of the node instance. The extension object is -called extension because its purpose is extensibility. - - For some reasons, it is impossible to derive the -node classes (i.e. element_impl and -data_impl) such that the subclasses can be extended by new -new methods. But -subclassing nodes is a great feature, because it allows the user to provide -different classes for different types of nodes. The extension objects are a -workaround that is as powerful as direct subclassing, the costs are -some notation overhead. - - -
-The structure of nodes and extensions - - -
- - The picture shows how the nodes and extensions are linked -together. Every node has a reference to its extension, and every extension has -a reference to its node. The methods extension and -node follow these references; a typical phrase is - - -self # node # attribute "xy" - - -to get the value of an attribute from a method defined in the extension object; -or - - -self # node # iter - (fun n -> n # extension # my_method ...) - - -to iterate over the subnodes and to call my_method of the -corresponding extension objects. - - - Note that extension objects do not have references to subnodes -(or "subextensions") themselves; in order to get one of the children of an -extension you must first go to the node object, then get the child node, and -finally reach the extension that is logically the child of the extension you -started with. - - - How to define an extension class - - At minimum, you must define the methods -clone, node, and -set_node such that your class is compatible with the type -extension. The method set_node is called -during the initialization of the node, or after a node has been cloned; the -node object invokes set_node on the extension object to tell -it that this node is now the object the extension is linked to. The extension -must return the node object passed as argument of set_node -when the node method is called. - - The clone method must return a copy of the -extension object; at least the object itself must be duplicated, but if -required, the copy should deeply duplicate all objects and values that are -referred by the extension, too. Whether this is required, depends on the -application; clone is invoked by the node object when one of -its cloning methods is called. - - A good starting point for an extension class: - - -} - - method node = - match node with - None -> - assert false - | Some n -> n - - method set_node n = - node <- Some n - - end -]]> - - -This class is compatible with extension. The purpose of -defining such a class is, of course, adding further methods; and you can do it -without restriction. - - - Often, you want not only one extension class. In this case, -it is the simplest way that all your classes (for one kind of document) have -the same type (with respect to the interface; i.e. it does not matter if your -classes differ in the defined private methods and instance variables, but -public methods count). This approach avoids lots of coercions and problems with -type incompatibilities. It is simple to implement: - - - - - -If a class does not need a method (e.g. because it does not make sense, or it -would violate some important condition), it is possible to define the method -and to always raise an exception when the method is invoked -(e.g. assert false). - - - The latter is a strong recommendation: do not try to further -specialize the types of extension objects. It is difficult, sometimes even -impossible, and almost never worth-while. - - - - How to bind extension classes to element types - - Once you have defined your extension classes, you can bind them -to element types. The simplest case is that you have only one class and that -this class is to be always used. The parsing functions in the module -Pxp_yacc take a spec argument which -can be customized. If your single class has the name c, -this argument should be - - -let spec = - make_spec_from_alist - ~data_exemplar: (new data_impl c) - ~default_element_exemplar: (new element_impl c) - ~element_alist: [] - () - - -This means that data nodes will be created from the exemplar passed by -~data_exemplar and that all element nodes will be made from the exemplar -specified by ~default_element_exemplar. In ~element_alist, you can -pass that different exemplars are to be used for different element types; but -this is an optional feature. If you do not need it, pass the empty list. - - - -Remember that an exemplar is a (node, extension) pair that serves as pattern -when new nodes (and the corresponding extension objects) are added to the -document tree. In this case, the exemplar contains c as -extension, and when nodes are created, the exemplar is cloned, and cloning -makes also a copy of c such that all nodes of the document -tree will have a copy of c as extension. - - - The ~element_alist argument can bind -specific element types to specific exemplars; as exemplars may be instances of -different classes it is effectively possible to bind element types to -classes. For example, if the element type "p" is implemented by class "c_p", -and "q" is realized by "c_q", you can pass the following value: - - -let spec = - make_spec_from_alist - ~data_exemplar: (new data_impl c) - ~default_element_exemplar: (new element_impl c) - ~element_alist: - [ "p", new element_impl c_p; - "q", new element_impl c_q; - ] - () - - -The extension object c is still used for all data nodes and -for all other element types. - - - - -
- - - - - Details of the mapping from XML text to the tree representation - - - - The representation of character-free elements - - If an element declaration does not allow the element to -contain character data, the following rules apply. - - If the element must be empty, i.e. it is declared with the -keyword EMPTY, the element instance must be effectively -empty (it must not even contain whitespace characters). The parser guarantees -that a declared EMPTY element does never contain a data -node, even if the data node represents the empty string. - - If the element declaration only permits other elements to occur -within that element but not character data, it is still possible to insert -whitespace characters between the subelements. The parser ignores these -characters, too, and does not create data nodes for them. - - - Example. - - Consider the following element types: - - - - -]]> - -Only x may contain character data, the keyword -#PCDATA indicates this. The other types are character-free. - - - - The XML term - - -]]> - -will be internally represented by an element node for x -with three subnodes: the first z element, a data node -containing the space character, and the second z element. -In contrast to this, the term - - -]]> - -is represented by an element node for y with only -two subnodes, the two z elements. There -is no data node for the space character because spaces are ignored in the -character-free element y. - - - - - - The representation of character data - - The XML specification allows all Unicode characters in XML -texts. This parser can be configured such that UTF-8 is used to represent the -characters internally; however, the default character encoding is -ISO-8859-1. (Currently, no other encodings are possible for the internal string -representation; the type Pxp_types.rep_encoding enumerates -the possible encodings. Principially, the parser could use any encoding that is -ASCII-compatible, but there are currently only lexical analyzers for UTF-8 and -ISO-8859-1. It is currently impossible to use UTF-16 or UCS-4 as internal -encodings (or other multibyte encodings which are not ASCII-compatible) unless -major parts of the parser are rewritten - unlikely...) - - - -The internal encoding may be different from the external encoding (specified -in the XML declaration <?xml ... encoding="..."?>); in -this case the strings are automatically converted to the internal encoding. - - - -If the internal encoding is ISO-8859-1, it is possible that there are -characters that cannot be represented. In this case, the parser ignores such -characters and prints a warning (to the collect_warning -object that must be passed when the parser is called). - - - The XML specification allows lines to be separated by single LF -characters, by CR LF character sequences, or by single CR -characters. Internally, these separators are always converted to single LF -characters. - - The parser guarantees that there are never two adjacent data -nodes; if necessary, data material that would otherwise be represented by -several nodes is collapsed into one node. Note that you can still create node -trees with adjacent data nodes; however, the parser does not return such trees. - - - Note that CDATA sections are not represented specially; such -sections are added to the current data material that being collected for the -next data node. - - - - - The representation of entities within documents - - Entities are not represented within -documents! If the parser finds an entity reference in the document -content, the reference is immediately expanded, and the parser reads the -expansion text instead of the reference. - - - - - The representation of attributes As attribute -values are composed of Unicode characters, too, the same problems with the -character encoding arise as for character material. Attribute values are -converted to the internal encoding, too; and if there are characters that -cannot be represented, these are dropped, and a warning is printed. - - Attribute values are normalized before they are returned by -methods like attribute. First, any remaining entity -references are expanded; if necessary, expansion is performed recursively. -Second, newline characters (any of LF, CR LF, or CR characters) are converted -to single space characters. Note that especially the latter action is -prescribed by the XML standard (but is not converted -such that it is still possible to include line feeds into attributes). - - - - - The representation of processing instructions -Processing instructions are parsed to some extent: The first word of the -PI is called the target, and it is stored separated from the rest of the PI: - - -]]> - -The exact location where a PI occurs is not represented (by default). The -parser puts the PI into the object that represents the embracing construct (an -element, a DTD, or the whole document); that means you can find out which PIs -occur in a certain element, in the DTD, or in the whole document, but you -cannot lookup the exact position within the construct. - - - If you require the exact location of PIs, it is possible to -create extra nodes for them. This mode is controled by the option -enable_pinstr_nodes. The additional nodes have the node type -T_pinstr target, and are created -from special exemplars contained in the spec (see -pxp_document.mli). - - - - The representation of comments - -Normally, comments are not represented; they are dropped by -default. However, if you require them, it is possible to create -T_comment nodes for them. This mode can be specified by the -option enable_comment_nodes. Comment nodes are created from -special exemplars contained in the spec (see -pxp_document.mli). You can access the contents of comments through the -method comment. - - - - The attributes <literal>xml:lang</literal> and -<literal>xml:space</literal> - - These attributes are not supported specially; they are handled -like any other attribute. - - - - - And what about namespaces? - Currently, there is no special support for namespaces. -However, the parser allows it that the colon occurs in names such that it is -possible to implement namespaces on top of the current API. - - Some future release of PXP will support namespaces as built-in -feature... - - - - -
- - - - - Configuring and calling the parser - - - - - - - Overview - -There are the following main functions invoking the parser (in Pxp_yacc): - - - - parse_document_entity: You want to -parse a complete and closed document consisting of a DTD and the document body; -the body is validated against the DTD. This mode is interesting if you have a -file - - ... -]]> - -and you can accept any DTD that is included in the file (e.g. because the file -is under your control). - - - - parse_wfdocument_entity: You want to -parse a complete and closed document consisting of a DTD and the document body; -but the body is not validated, only checked for well-formedness. This mode is -preferred if validation costs too much time or if the DTD is missing. - - - - parse_dtd_entity: You want only to -parse an entity (file) containing the external subset of a DTD. Sometimes it is -interesting to read such a DTD, for example to compare it with the DTD included -in a document, or to apply the next mode: - - - - parse_content_entity: You want only to -parse an entity (file) containing a fragment of a document body; this fragment -is validated against the DTD you pass to the function. Especially, the fragment -must not have a <!DOCTYPE> clause, and must directly -begin with an element. The element is validated against the DTD. This mode is -interesting if you want to check documents against a fixed, immutable DTD. - - - - parse_wfcontent_entity: This function -also parses a single element without DTD, but does not validate it. - - - extract_dtd_from_document_entity: This -function extracts the DTD from a closed document consisting of a DTD and a -document body. Both the internal and the external subsets are extracted. - - - - - -In many cases, parse_document_entity is the preferred mode -to parse a document in a validating way, and -parse_wfdocument_entity is the mode of choice to parse a -file while only checking for well-formedness. - - - -There are a number of variations of these modes. One important application of a -parser is to check documents of an untrusted source against a fixed DTD. One -solution is to not allow the <!DOCTYPE> clause in -these documents, and treat the document like a fragment (using mode -parse_content_entity). This is very simple, but -inflexible; users of such a system cannot even define additional entities to -abbreviate frequent phrases of their text. - - - -It may be necessary to have a more intelligent checker. For example, it is also -possible to parse the document to check fully, i.e. with DTD, and to compare -this DTD with the prescribed one. In order to fully parse the document, mode -parse_document_entity is applied, and to get the DTD to -compare with mode parse_dtd_entity can be used. - - - -There is another very important configurable aspect of the parser: the -so-called resolver. The task of the resolver is to locate the contents of an -(external) entity for a given entity name, and to make the contents accessible -as a character stream. (Furthermore, it also normalizes the character set; -but this is a detail we can ignore here.) Consider you have a file called -"main.xml" containing - - -%sub; -]]> - -and a file stored in the subdirectory "sub" with name -"sub.xml" containing - - -%subsub; -]]> - -and a file stored in the subdirectory "subsub" of -"sub" with name "subsub.xml" (the -contents of this file do not matter). Here, the resolver must track that -the second entity subsub is located in the directory -"sub/subsub", i.e. the difficulty is to interpret the -system (file) names of entities relative to the entities containing them, -even if the entities are deeply nested. - - - -There is not a fixed resolver already doing everything right - resolving entity -names is a task that highly depends on the environment. The XML specification -only demands that SYSTEM entities are interpreted like URLs -(which is not very precise, as there are lots of URL schemes in use), hoping -that this helps overcoming the local peculiarities of the environment; the idea -is that if you do not know your environment you can refer to other entities by -denoting URLs for them. I think that this interpretation of -SYSTEM names may have some applications in the internet, but -it is not the first choice in general. Because of this, the resolver is a -separate module of the parser that can be exchanged by another one if -necessary; more precisely, the parser already defines several resolvers. - - - -The following resolvers do already exist: - - - - Resolvers reading from arbitrary input channels. These -can be configured such that a certain ID is associated with the channel; in -this case inner references to external entities can be resolved. There is also -a special resolver that interprets SYSTEM IDs as URLs; this resolver can -process relative SYSTEM names and determine the corresponding absolute URL. - - - - A resolver that reads always from a given O'Caml -string. This resolver is not able to resolve further names unless the string is -not associated with any name, i.e. if the document contained in the string -refers to an external entity, this reference cannot be followed in this -case. - - - A resolver for file names. The SYSTEM -name is interpreted as file URL with the slash "/" as separator for -directories. - This resolver is derived from the generic URL resolver. - - - -The interface a resolver must have is documented, so it is possible to write -your own resolver. For example, you could connect the parser with an HTTP -client, and resolve URLs of the HTTP namespace. The resolver classes support -that several independent resolvers are combined to one more powerful resolver; -thus it is possible to combine a self-written resolver with the already -existing resolvers. - - - -Note that the existing resolvers only interpret SYSTEM -names, not PUBLIC names. If it helps you, it is possible to -define resolvers for PUBLIC names, too; for example, such a -resolver could look up the public name in a hash table, and map it to a system -name which is passed over to the existing resolver for system names. It is -relatively simple to provide such a resolver. - - - - - - - Resolvers and sources - - - Using the built-in resolvers (called sources) - - The type source enumerates the two -possibilities where the document to parse comes from. - - -type source = - Entity of ((dtd -> Pxp_entity.entity) * Pxp_reader.resolver) - | ExtID of (ext_id * Pxp_reader.resolver) - - -You normally need not to worry about this type as there are convenience -functions that create source values: - - - - - from_file s: The document is read from -file s; you may specify absolute or relative path names. -The file name must be encoded as UTF-8 string. - - -There is an optional argument ~system_encoding -specifying the character encoding which is used for the names of the file -system. For example, if this encoding is ISO-8859-1 and s is -also a ISO-8859-1 string, you can form the source: - - - - - -This source has the advantage that -it is able to resolve inner external entities; i.e. if your document includes -data from another file (using the SYSTEM attribute), this -mode will find that file. However, this mode cannot resolve -PUBLIC identifiers nor SYSTEM identifiers -other than "file:". - - - - from_channel ch: The document is read -from the channel ch. In general, this source also supports -file URLs found in the document; however, by default only absolute URLs are -understood. It is possible to associate an ID with the channel such that the -resolver knows how to interpret relative URLs: - - -from_channel ~id:(System "file:///dir/dir1/") ch - - -There is also the ~system_encoding argument specifying how file names are -encoded. - The example from above can also be written (but it is no -longer possible to interpret relative URLs because there is no ~id argument, -and computing this argument is relatively complicated because it must -be a valid URL): - - -let ch = open_in s in -let src = from_channel ~system_encoding:`Enc_iso88591 ch in -...; -close_in ch - - - - - from_string s: The string -s is the document to parse. This mode is not able to -interpret file names of SYSTEM clauses, nor it can look up -PUBLIC identifiers. - - Normally, the encoding of the string is detected as usual -by analyzing the XML declaration, if any. However, it is also possible to -specify the encoding directly: - - -let src = from_string ~fixenc:`ISO-8859-2 s - - - - - ExtID (id, r): The document to parse -is denoted by the identifier id (either a -SYSTEM or PUBLIC clause), and this -identifier is interpreted by the resolver r. Use this mode -if you have written your own resolver. - Which character sets are possible depends on the passed -resolver r. - - - Entity (get_entity, r): The document -to parse is returned by the function invocation get_entity -dtd, where dtd is the DTD object to use (it may be -empty). Inner external references occuring in this entity are resolved using -the resolver r. - Which character sets are possible depends on the passed -resolver r. - - - - - - - The resolver API - - A resolver is an object that can be opened like a file, but you -do not pass the file name to the resolver, but the XML identifier of the entity -to read from (either a SYSTEM or PUBLIC -clause). When opened, the resolver must return the -Lexing.lexbuf that reads the characters. The resolver can -be closed, and it can be cloned. Furthermore, it is possible to tell the -resolver which character set it should assume. - The following from Pxp_reader: - - unit - method init_warner : collect_warnings -> unit - method rep_encoding : rep_encoding - method open_in : ext_id -> Lexing.lexbuf - method close_in : unit - method change_encoding : string -> unit - method clone : resolver - method close_all : unit - end -]]> - -The resolver object must work as follows: - - - - - When the parser is called, it tells the resolver the -warner object and the internal encoding by invoking -init_warner and init_rep_encoding. The -resolver should store these values. The method rep_encoding -should return the internal encoding. - - - - If the parser wants to read from the resolver, it invokes -the method open_in. Either the resolver succeeds, in which -case the Lexing.lexbuf reading from the file or stream must -be returned, or opening fails. In the latter case the method implementation -should raise an exception (see below). - - - If the parser finishes reading, it calls the -close_in method. - - - If the parser finds a reference to another external -entity in the input stream, it calls clone to get a second -resolver which must be initially closed (not yet connected with an input -stream). The parser then invokes open_in and the other -methods as described. - - - If you already know the character set of the input -stream, you should recode it to the internal encoding, and define the method -change_encoding as an empty method. - - - If you want to support multiple external character sets, -the object must follow a much more complicated protocol. Directly after -open_in has been called, the resolver must return a lexical -buffer that only reads one byte at a time. This is only possible if you create -the lexical buffer with Lexing.from_function; the function -must then always return 1 if the EOF is not yet reached, and 0 if EOF is -reached. If the parser has read the first line of the document, it will invoke -change_encoding to tell the resolver which character set to -assume. From this moment, the object can return more than one byte at once. The -argument of change_encoding is either the parameter of the -"encoding" attribute of the XML declaration, or the empty string if there is -not any XML declaration or if the declaration does not contain an encoding -attribute. - - At the beginning the resolver must only return one -character every time something is read from the lexical buffer. The reason for -this is that you otherwise would not exactly know at which position in the -input stream the character set changes. - - If you want automatic recognition of the character set, -it is up to the resolver object to implement this. - - - If an error occurs, the parser calls the method -close_all for the top-level resolver; this method should -close itself (if not already done) and all clones. - - - - Exceptions - -It is possible to chain resolvers such that when the first resolver is not able -to open the entity, the other resolvers of the chain are tried in turn. The -method open_in should raise the exception -Not_competent to indicate that the next resolver should try -to open the entity. If the resolver is able to handle the ID, but some other -error occurs, the exception Not_resolvable should be raised -to force that the chain breaks. - - - - Example: How to define a resolver that is equivalent to -from_string: ... - - - - - Predefined resolver components - -There are some classes in Pxp_reader that define common resolver behaviour. - - - ?fixenc:encoding -> - ?auto_close:bool -> - in_channel -> - resolver -]]> - -Reads from the passed channel (it may be even a pipe). If the -~id argument is passed to the object, the created resolver -accepts only this ID. Otherwise all IDs are accepted. - Once the resolver has -been cloned, it does not accept any ID. This means that this resolver cannot -handle inner references to external entities. Note that you can combine this -resolver with another resolver that can handle inner references (such as -resolve_as_file); see class 'combine' below. - If you pass the -~fixenc argument, the encoding of the channel is set to the -passed value, regardless of any auto-recognition or any XML declaration. - If -~auto_close = true (which is the default), the channel is -closed after use. If ~auto_close = false, the channel is -left open. - - - - - channel_of_id:(ext_id -> (in_channel * encoding option)) -> - resolver -]]> - -This resolver calls the function ~channel_of_id to open a -new channel for the passed ext_id. This function must either -return the channel and the encoding, or it must fail with Not_competent. The -function must return None as encoding if the default -mechanism to recognize the encoding should be used. It must return -Some e if it is already known that the encoding of the -channel is e. If ~auto_close = true -(which is the default), the channel is closed after use. If -~auto_close = false, the channel is left open. - - - - - ?auto_close:bool -> - url_of_id:(ext_id -> Neturl.url) -> - channel_of_url:(Neturl.url -> (in_channel * encoding option)) -> - resolver -]]> - -When this resolver gets an ID to read from, it calls the function -~url_of_id to get the corresponding URL. This URL may be a -relative URL; however, a URL scheme must be used which contains a path. The -resolver converts the URL to an absolute URL if necessary. The second -function, ~channel_of_url, is fed with the absolute URL as -input. This function opens the resource to read from, and returns the channel -and the encoding of the resource. - - -Both functions, ~url_of_id and -~channel_of_url, can raise Not_competent to indicate that -the object is not able to read from the specified resource. However, there is a -difference: A Not_competent from ~url_of_id is left as it -is, but a Not_competent from ~channel_of_url is converted to -Not_resolvable. So only ~url_of_id decides which URLs are -accepted by the resolver and which not. - - -The function ~channel_of_url must return -None as encoding if the default mechanism to recognize the -encoding should be used. It must return Some e if it is -already known that the encoding of the channel is e. - - -If ~auto_close = true (which is the default), the channel is -closed after use. If ~auto_close = false, the channel is -left open. - - -Objects of this class contain a base URL relative to which relative URLs are -interpreted. When creating a new object, you can specify the base URL by -passing it as ~base_url argument. When an existing object is -cloned, the base URL of the clone is the URL of the original object. - Note -that the term "base URL" has a strict definition in RFC 1808. - - - - - ?fixenc:encoding -> - string -> - resolver -]]> - -Reads from the passed string. If the ~id argument is passed -to the object, the created resolver accepts only this ID. Otherwise all IDs are -accepted. - Once the resolver has been cloned, it does not accept any ID. This -means that this resolver cannot handle inner references to external -entities. Note that you can combine this resolver with another resolver that -can handle inner references (such as resolve_as_file); see class 'combine' -below. - If you pass the ~fixenc argument, the encoding of -the string is set to the passed value, regardless of any auto-recognition or -any XML declaration. - - - - (string * encoding option)) -> - resolver -]]> - -This resolver calls the function ~string_of_id to get the -string for the passed ext_id. This function must either -return the string and the encoding, or it must fail with Not_competent. The -function must return None as encoding if the default -mechanism to recognize the encoding should be used. It must return -Some e if it is already known that the encoding of the -string is e. - - - - - ?host_prefix:[ `Not_recognized | `Allowed | `Required ] -> - ?system_encoding:encoding -> - ?url_of_id:(ext_id -> Neturl.url) -> - ?channel_of_url: (Neturl.url -> (in_channel * encoding option)) -> - unit -> - resolver -]]> -Reads from the local file system. Every file name is interpreted as -file name of the local file system, and the referred file is read. - - -The full form of a file URL is: file://host/path, where -'host' specifies the host system where the file identified 'path' -resides. host = "" or host = "localhost" are accepted; other values -will raise Not_competent. The standard for file URLs is -defined in RFC 1738. - - -Option ~file_prefix: Specifies how the "file:" prefix of -file names is handled: - - - `Not_recognized:The prefix is not -recognized. - - - `Allowed: The prefix is allowed but -not required (the default). - - - `Required: The prefix is -required. - - - - -Option ~host_prefix: Specifies how the "//host" phrase of -file names is handled: - - - `Not_recognized:The prefix is not -recognized. - - - `Allowed: The prefix is allowed but -not required (the default). - - - `Required: The prefix is -required. - - - - -Option ~system_encoding: Specifies the encoding of file -names of the local file system. Default: UTF-8. - - -Options ~url_of_id, ~channel_of_url: Not -for the casual user! - - - - - resolver list -> - resolver -]]> - -Combines several resolver objects. If a concrete entity with an -ext_id is to be opened, the combined resolver tries the -contained resolvers in turn until a resolver accepts opening the entity -(i.e. it does not raise Not_competent on open_in). - - -Clones: If the 'clone' method is invoked before 'open_in', all contained -resolvers are cloned separately and again combined. If the 'clone' method is -invoked after 'open_in' (i.e. while the resolver is open), additionally the -clone of the active resolver is flagged as being preferred, i.e. it is tried -first. - - - - - - - The DTD classes Sorry, not yet -written. Perhaps the interface definition of Pxp_dtd expresses the same: - - -&markup-dtd1.mli;&markup-dtd2.mli; - - - - - Invoking the parser - - Here a description of Pxp_yacc. - - - Defaults - The following defaults are available: - - -val default_config : config -val default_extension : ('a node extension) as 'a -val default_spec : ('a node extension as 'a) spec - - - - - - Parsing functions - In the following, the term "closed document" refers to -an XML structure like - - -<!DOCTYPE ... [ declarations ] > -<root> -... -</root> - - -The term "fragment" refers to an XML structure like - - -<root> -... -</root> - - -i.e. only to one isolated element instance. - - - - source -> dtd -]]> - -Parses the declarations which are contained in the entity, and returns them as -dtd object. - - - - source -> dtd -]]> - -Extracts the DTD from a closed document. Both the internal and the external -subsets are extracted and combined to one dtd object. This -function does not parse the whole document, but only the parts that are -necessary to extract the DTD. - - - - dtd) -> - ?id_index:('ext index) -> - config -> - source -> - 'ext spec -> - 'ext document -]]> - -Parses a closed document and validates it against the DTD that is contained in -the document (internal and external subsets). The option -~transform_dtd can be used to transform the DTD in the -document, and to use the transformed DTD for validation. If -~id_index is specified, an index of all ID attributes is -created. - - - - - source -> - 'ext spec -> - 'ext document -]]> - -Parses a closed document, but checks it only on well-formedness. - - - - - config -> - source -> - dtd -> - 'ext spec -> - 'ext node -]]> - -Parses a fragment, and validates the element. - - - - - source -> - 'ext spec -> - 'ext node -]]> - -Parses a fragment, but checks it only on well-formedness. - - - - - Configuration options - - - - - - warner:The parser prints -warnings by invoking the method warn for this warner -object. (Default: all warnings are dropped) - - errors_with_line_numbers:If -true, errors contain line numbers; if false, errors contain only byte -positions. The latter mode is faster. (Default: true) - - enable_pinstr_nodes:If true, -the parser creates extra nodes for processing instructions. If false, -processing instructions are simply added to the element or document surrounding -the instructions. (Default: false) - - enable_super_root_node:If -true, the parser creates an extra node which is the parent of the root of the -document tree. This node is called super root; it is an element with type -T_super_root. - If there are processing instructions outside -the root element and outside the DTD, they are added to the super root instead -of the document. - If false, the super root node is not created. (Default: -false) - - enable_comment_nodes:If true, -the parser creates nodes for comments with type T_comment; -if false, such nodes are not created. (Default: false) - - encoding:Specifies the -internal encoding of the parser. Most strings are then represented according to -this encoding; however there are some exceptions (especially -ext_id values which are always UTF-8 encoded). -(Default: `Enc_iso88591) - - -recognize_standalone_declaration: If true and if the parser is -validating, the standalone="yes" declaration forces that it -is checked whether the document is a standalone document. - If false, or if the -parser is in well-formedness mode, such declarations are ignored. -(Default: true) - - - store_element_positions: If -true, for every non-data node the source position is stored. If false, the -position information is lost. If available, you can get the positions of nodes -by invoking the position method. -(Default: true) - - idref_pass:If true and if -there is an ID index, the parser checks whether every IDREF or IDREFS attribute -refer to an existing node; this requires that the parser traverses the whole -doument tree. If false, this check is left out. (Default: false) - - validate_by_dfa:If true and if -the content model for an element type is deterministic, a deterministic finite -automaton is used to validate whether the element contents match the content -model of the type. If false, or if a DFA is not available, a backtracking -algorithm is used for validation. (Default: true) - - - -accept_only_deterministic_models: If true, only deterministic content -models are accepted; if false, any syntactically correct content models can be -processed. (Default: true) - - - - - - Which configuration should I use? - First, I recommend to vary the default configuration instead of -creating a new configuration record. For instance, to set -idref_pass to true, change the default -as in: - -let config = { default_config with idref_pass = true } - -The background is that I can add more options to the record in future versions -of the parser without breaking your programs. - - - Do I need extra nodes for processing instructions? -By default, such nodes are not created. This does not mean that the -processing instructions are lost; however, you cannot find out the exact -location where they occur. For example, the following XML text - - -]]> - -will normally create one element node for x containing -one subnode for y. The processing -instructions are attached to x in a separate hash table; you -can access them using x # pinstr "pi1" and x # -pinstr "pi2", respectively. The information is lost where the -instructions occur within x. - - - - If the option enable_pinstr_nodes is -turned on, the parser creates extra nodes pi1 and -pi2 such that the subnodes of x are now: - - - -The extra nodes contain the processing instructions in the usual way, i.e. you -can access them using pi1 # pinstr "pi1" and pi2 # -pinstr "pi2", respectively. - - - Note that you will need an exemplar for the PI nodes (see -make_spec_from_alist). - - - Do I need a super root node? - By default, there is no super root node. The -document object refers directly to the node representing the -root element of the document, i.e. - - - -if r is the root node. This is sometimes inconvenient: (1) -Some algorithms become simpler if every node has a parent, even the root -node. (2) Some standards such as XPath call the "root node" the node whose -child represents the root of the document. (3) The super root node can serve -as a container for processing instructions outside the root element. Because of -these reasons, it is possible to create an extra super root node, whose child -is the root node: - - - -When extra nodes are also created for processing instructions, these nodes can -be added to the super root node if they occur outside the root element (reason -(3)), and the order reflects the order in the source text. - - - Note that you will need an exemplar for the super root node -(see make_spec_from_alist). - - - What is the effect of the UTF-8 encoding? - By default, the parser represents strings (with few -exceptions) as ISO-8859-1 strings. These are well-known, and there are tools -and fonts for this encoding. - - However, internationalization may require that you switch over -to UTF-8 encoding. In most environments, the immediate effect will be that you -cannot read strings with character codes >= 160 any longer; your terminal will -only show funny glyph combinations. It is strongly recommended to install -Unicode fonts (GNU Unifont, - -Markus Kuhn's fonts) and terminal emulators -that can handle UTF-8 byte sequences. Furthermore, a Unicode editor may -be helpful (such as Yudit). There are -also FAQ by -Markus Kuhn. - - By setting encoding to -`Enc_utf8 all strings originating from the parsed XML -document are represented as UTF-8 strings. This includes not only character -data and attribute values but also element names, attribute names and so on, as -it is possible to use any Unicode letter to form such names. Strictly -speaking, PXP is only XML-compliant if the UTF-8 mode is used; otherwise it -will have difficulties when validating documents containing -non-ISO-8859-1-names. - - - This mode does not have any impact on the external -representation of documents. The character set assumed when reading a document -is set in the XML declaration, and character set when writing a document must -be passed to the write method. - - - - How do I check that nodes exist which are referred by IDREF attributes? - First, you must create an index of all occurring ID -attributes: - - - -This index must be passed to the parsing function: - - index) - config source spec -]]> - -Next, you must turn on the idref_pass mode: - - - -Note that now the whole document tree will be traversed, and every node will be -checked for IDREF and IDREFS attributes. If the tree is big, this may take some -time. - - - - - What are deterministic content models? - These type of models can speed up the validation checks; -furthermore they ensure SGML-compatibility. In particular, a content model is -deterministic if the parser can determine the actually used alternative by -inspecting only the current token. For example, this element has -non-deterministic contents: - - -]]> - -If the first element in x is u, the -parser does not know which of the alternatives (u,v) or -(u,y+) will work; the parser must also inspect the second -element to be able to distinguish between the alternatives. Because such -look-ahead (or "guessing") is required, this example is -non-deterministic. - - - The XML standard demands that content models must be -deterministic. So it is recommended to turn the option -accept_only_deterministic_models on; however, PXP can also -process non-deterministic models using a backtracking algorithm. - - Deterministic models ensure that validation can be performed in -linear time. In order to get the maximum benefits, PXP also implements a -special validator that profits from deterministic models; this is the -deterministic finite automaton (DFA). This validator is enabled per element -type if the element type has a deterministic model and if the option -validate_by_dfa is turned on. - - In general, I expect that the DFA method is faster than the -backtracking method; especially in the worst case the DFA takes only linear -time. However, if the content model has only few alternatives and the -alternatives do not nest, the backtracking algorithm may be better. - - - - - - - - - Updates - - Some (often later added) features that are otherwise -not explained in the manual but worth to be mentioned. - - - Methods node_position, node_path, nth_node, -previous_node, next_node for nodes: See pxp_document.mli - - Functions to determine the document order of nodes: -compare, create_ord_index, ord_number, ord_compare: See pxp_document.mli - - - - - - -
-
-