1 ******************************************************************************
2 README - PXP, the XML parser for O'Caml
3 ******************************************************************************
6 ==============================================================================
8 ==============================================================================
10 PXP is a validating parser for XML-1.0 which has been written entirely in
13 PXP is the new name of the parser formerly known as "Markup". PXP means
14 "Polymorphic XML parser" and emphasizes its most useful property: that the API
15 is polymorphic and can be configured such that different objects are used to
16 store different types of elements.
18 ==============================================================================
20 ==============================================================================
22 You can download PXP as gzip'ed tarball [1]. The parser needs the Netstring [2]
23 package (0.9.3). Note that PXP requires O'Caml 3.00.
25 ==============================================================================
27 ==============================================================================
29 The manual is included in the distribution both as Postscript document and
30 bunch of HTML files. An online version can be found here [3].
32 ==============================================================================
33 Author, Credits, Copying
34 ==============================================================================
36 PXP has been written by Gerd Stolpmann [4]; it contains contributions by
37 Claudio Sacerdoti Coen. You may copy it as you like, you may use it even for
38 commercial purposes as long as the license conditions are respected, see the
39 file LICENSE coming with the distribution. It allows almost everything.
41 Thanks also to Alain Frisch and Haruo Hosoya for discussions and bug reports.
43 ==============================================================================
45 ==============================================================================
47 PXP is a validating XML parser for O'Caml [5]. It strictly complies to the
50 The parser is simple to call, usually only one statement (function call) is
51 sufficient to parse an XML document and to represent it as object tree.
53 Once the document is parsed, it can be accessed using a class interface. The
54 interface allows arbitrary access including transformations. One of the
55 features of the document representation is its polymorphic nature; it is simple
56 to add custom methods to the document classes. Furthermore, the parser can be
57 configured such that different XML elements are represented by objects created
58 from different classes. This is a very powerful feature, because it simplifies
59 the structure of programs processing XML documents.
61 Note that the class interface does not comply to the DOM standard. It was not a
62 development goal to realize a standard API (industrial developers can this much
63 better than I); however, the API is powerful enough to be considered as
64 equivalent with DOM. More important, the interface is compatible with the XML
65 information model required by many XML-related standards.
67 ------------------------------------------------------------------------------
69 ------------------------------------------------------------------------------
71 - The XML instance is validated against the DTD; any violation of a validation
72 constraint leads to the rejection of the instance. The validator has been
73 carefully implemented, and conforms strictly to the standard. If needed, it
74 is also possible to run the parser in a well-formedness mode.
76 - If possible, the validator applies a deterministic finite automaton to
77 validate the content models. This ensures that validation can always be
78 performed in linear time. However, in the case that the content models are
79 not deterministic, the parser uses a backtracking algorithm which can be
80 much slower. - It is also possible to reject non-deterministic content
83 - In particular, the validator also checks the complicated rules whether
84 parentheses are properly nested with respect to entities, and whether the
85 standalone declaration is satisfied. On demand, it is checked whether the
86 IDREF attributes only refer to existing nodes.
88 - Entity references are automatically resolved while the XML text is being
89 scanned. It is not possible to recognize in the object tree where a
90 referenced entity begins or ends; the object tree only represents the
93 - External entities are loaded using a configurable resolver infrastructure.
94 It is possible to connect the parser with an arbitrary XML source.
96 - The parser can read XML text encoded in a variety of character sets.
97 Independent of this, it is possible to choose the encoding of the internal
98 representation of the tree nodes; the parser automatically converts the
99 input text to this encoding. Currently, the parser supports UTF-8 and
100 ISO-8859-1 as internal encodings.
102 - The interface of the parser has been designed such that it is best
103 integrated into the language O'Caml. The first goal was simplicity of usage
104 which is achieved by many convenience methods and functions, and by allowing
105 the user to select which parts of the XML text are actually represented in
106 the tree. For example, it is possible to store processing instructions as
107 tree nodes, but the parser can also be configured such that these
108 instructions are put into hashtables. The information model is compatible
109 with the requirements of XML-related standards such as XPath.
111 - In particular, the node tree can optionally contain or leave out processing
112 instructions and comments. It is also possible to generate a "super root"
113 object which is the parent of the root element. The attributes of elements
114 are normally not stored as nodes, but it is possible to get them wrapped
117 - There is also an interface for DTDs; you can parse and access sequences of
118 declarations. The declarations are fully represented as recursive O'Caml
121 ------------------------------------------------------------------------------
123 ------------------------------------------------------------------------------
125 This distribution contains several examples:
127 - validate: simply parses a document and prints all error messages
129 - readme: Defines a DTD for simple "README"-like documents, and offers
130 conversion to HTML and text files [7].
132 - xmlforms: This is already a sophisticated application that uses XML as style
133 sheet language and data storage format. It shows how a Tk user interface can
134 be configured by an XML style, and how data records can be stored using XML.
136 ------------------------------------------------------------------------------
137 Restrictions and missing features
138 ------------------------------------------------------------------------------
140 The following restrictions apply that are not violations of the standard:
142 - The attributes "xml:space", and "xml:lang" are not supported specially. (The
143 application can do this.)
145 - The built-in support for SYSTEM and PUBLIC identifiers is limited to local
146 file access. There is no support for catalogs. The parser offers a hook to
147 add missing features.
149 - It is currently not possible to check for interoperatibility with SGML.
151 The following features are also missing:
153 - There is no special support for namespaces. (Perhaps in the next release?)
155 - There is no support for XPATH or XSLT.
157 However, I hope that these features will be implemented soon, either by myself
158 or by contributors (who are invited to do so).
160 ------------------------------------------------------------------------------
162 ------------------------------------------------------------------------------
165 Support for document order.
168 Several fixes of bugs reported by Haruo Hosoya and Alain Frisch.
169 The class type "node" has been extended: you can go directly to the next and
170 previous nodes in the list; you can refer to nodes by position.
171 There are now some iterators for nodes: find, find_all, find_element,
172 find_all_elements, map_tree, iter_tree.
173 Experimental support for viewing attributes as nodes; I hope that helps
174 Alain writing his XPath evaluator.
175 The user's manual has been revised and is almost up to date.
178 There are now additional node types T_super_root, T_pinstr and T_comment,
179 and the parser is able to create the corresponding nodes.
180 The functions for character set conversion have been moved to the Netstring
181 package; they are not specific for XML.
184 Implemented a check on deterministic content models. Added an alternate
185 validator basing on a DFA. - This means that now all mandatory features for
186 an XML-1.0 parser are implemented! The parser is now substantially complete.
189 The handling of ID and IDREF attributes has changed. The index of nodes
190 containing an ID attribute is now separated from the document. Optionally
191 the parser now checks whether the IDREF attributes refer to existing
193 The element nodes can optionally store the location in the source XML code.
194 The method 'write' writes the XML tree in every supported encoding.
195 (Successor of 'write_compact_as_latin1'.)
196 Several smaller changes and fixes.
199 The module Pxp_reader has been modernized. The resolver classes are simpler
200 to use. There is now support for URLs.
201 The interface of Pxp_yacc has been improved: The type 'source' is now
202 simpler. The type 'domspec' has gone; the new 'spec' is opaque and performs
203 better. There are some new parsing modes.
204 Many smaller changes.
207 The markup_* modules have been renamed to pxp_*. There is a new
208 compatibility API that tries to be compatible with markup-0.2.10.
209 The type "encoding" is now a polymorphic variant.
212 Added checks for the constraints about the standalone declaration.
213 Added regression tests about attribute normalization, attribute checks,
215 Fixed some minor errors of the attribute normalization function.
216 The bytecode/native archives are now separated in a general part, in a
217 ISO-8859-1-relevant part, and a UTF-8-relevant part. The parser can again be
218 compiled with ocamlopt.
221 In general, this release is an early pre-release of the next stable version
222 1.00. I do not recommend to use it for serious work; it is still very
224 The core of the parser has been rewritten using a self-written parser
226 The lexer has been restructured, and can now handle UTF-8 encoded files.
227 Numerous other changes.
230 --------------------------
232 [1] see http://www.ocaml-programming.de/packages/pxp-1.0.tar.gz
234 [2] see http://www.ocaml-programming.de/packages/documentation/netstring
236 [3] see http://www.ocaml-programming.de/packages/documentation/pxp/manual
238 [4] see mailto:gerd@gerd-stolpmann.de
240 [5] see http://caml.inria.fr/
242 [6] see http://www.w3.org/TR/1998/REC-xml-19980210.html
244 [7] This particular document is an example of this DTD!