helm/DEVEL/pxp/pxp/doc/README.xml

   1 <?xml version="1.0" encoding="ISO-8859-1"?>
   2 <!DOCTYPE readme SYSTEM "readme.dtd" [
   3
   4 <!--
   5 <!ENTITY url.ocaml           "http://caml.inria.fr/">
   6 <!ENTITY url.xml-spec        "http://www.w3.org/TR/1998/REC-xml-19980210.html">
   7 <!ENTITY url.jclark-xmltdata "ftp://ftp.jclark.com/pub/xml/xmltest.zip">
   8 <!ENTITY url.gps-ocaml-download "http://people.darmstadt.netsurf.de/ocaml">
   9 <!ENTITY url.markup-download    "&url.gps-ocaml-download;/markup-0.1.tar.gz">
  10 <!ENTITY person.gps             '<a
  11   href="mailto:Gerd.Stolpmann@darmstadt.netsurf.de">Gerd Stolpmann</a>'>
  12 -->
  13
  14 <!ENTITY % common SYSTEM "common.xml">
  15 %common;
  16
  17 <!-- Special HTML config: -->
  18 <!ENTITY % readme:html:up '<a href="../..">up</a>'>
  19
  20 <!ENTITY % config SYSTEM "config.xml">
  21 %config;
  22
  23 ]>
  24
  25 <readme title="README - PXP, the XML parser for O'Caml">
  26   <sect1>
  27     <title>Abstract</title>
  28     <p>
  29 <em>PXP</em> is a validating parser for XML-1.0 which has been written
  30 entirely in Objective Caml.
  31 </p>
  32
  33     <p>PXP is the new name of the parser formerly known as "Markup".
  34 PXP means "Polymorphic XML parser" and emphasizes its most useful
  35 property: that the API is polymorphic and can be configured such that
  36 different objects are used to store different types of elements.</p>
  37   </sect1>
  38
  39   <sect1>
  40     <title>Download</title>
  41     <p>
  42 You can download <em>PXP</em> as gzip'ed <a
  43 href="&url.pxp-download;">tarball</a>. The parser needs the <a
  44 href="&url.netstring-project;">Netstring</a> package (0.9.3). Note that PXP
  45 requires O'Caml 3.00.
  46 </p>
  47   </sect1>
  48
  49   <sect1>
  50     <title>User's Manual</title>
  51     <p>
  52 The manual is included in the distribution both as Postscript document and
  53 bunch of HTML files. An online version can be found <a
  54                                                        href="&url.pxp-manual;">here</a>.
  55 </p>
  56   </sect1>
  57
  58   <sect1>
  59     <title>Author, Credits, Copying</title>
  60     <p>
  61 <em>PXP</em> has been written by &person.gps;; it contains contributions by
  62 Claudio Sacerdoti Coen. You may copy it as you like,
  63 you may use it even for commercial purposes as long as the license conditions
  64 are respected, see the file LICENSE coming with the distribution. It allows
  65 almost everything.
  66 </p>
  67
  68     <p>Thanks also to Alain Frisch and Haruo Hosoya for discussions and bug
  69 reports.</p>
  70   </sect1>
  71
  72   <sect1>
  73     <title>Description</title>
  74     <p>
  75 <em>PXP</em> is a validating XML parser for <a
  76 href="&url.ocaml;">O'Caml</a>. It strictly complies to the
  77 <a href="&url.xml-spec;">XML-1.0</a> standard.
  78 </p>
  79
  80     <p>The parser is simple to call, usually only one statement (function
  81 call) is sufficient to parse an XML document and to represent it as object
  82 tree.</p>
  83
  84     <p>
  85 Once the document is parsed, it can be accessed using a class interface.
  86 The interface allows arbitrary access including transformations. One of
  87 the features of the document representation is its polymorphic nature;
  88 it is simple to add custom methods to the document classes. Furthermore,
  89 the parser can be configured such that different XML elements are represented
  90 by objects created from different classes. This is a very powerful feature,
  91 because it simplifies the structure of programs processing XML documents.
  92 </p>
  93
  94     <p>
  95 Note that the class interface does not comply to the DOM standard. It was not a
  96 development goal to realize a standard API (industrial developers can this much
  97 better than I); however, the API is powerful enough to be considered as
  98 equivalent with DOM. More important, the interface is compatible with the
  99 XML information model required by many XML-related standards.
 100 </p>
 101
 102     <sect2>
 103       <title>Detailed feature list</title>
 104
 105       <ul>
 106         <li><p>The XML instance is validated against the DTD; any violation of
 107 a validation constraint leads to the rejection of the instance. The validator
 108 has been carefully implemented, and conforms strictly to the standard. If
 109 needed, it is also possible to run the parser in a well-formedness mode.</p>
 110         </li>
 111         <li><p>If possible, the validator applies a deterministic finite
 112 automaton to validate the content models. This ensures that validation can
 113 always be performed in linear time. However, in the case that the content
 114 models are not deterministic, the parser uses a backtracking algorithm which
 115 can be much slower. - It is also possible to reject non-deterministic content
 116 models.</p>
 117         </li>
 118         <li><p>In particular, the validator also checks the complicated rules
 119 whether parentheses are properly nested with respect to entities, and whether
 120 the standalone declaration is satisfied. On demand, it is checked whether the
 121 IDREF attributes only refer to existing nodes.</p>
 122         </li>
 123         <li><p>Entity references are automatically resolved while the XML text
 124 is being scanned. It is not possible to recognize in the object tree where a
 125 referenced entity begins or ends; the object tree only represents the logical structure.</p>
 126         </li>
 127         <li><p>External entities are loaded using a configurable resolver
 128 infrastructure. It is possible to connect the parser with an arbitrary XML source.</p>
 129         </li>
 130         <li><p>The parser can read XML text encoded in a variety of character
 131 sets. Independent of this, it is possible to choose the encoding of the
 132 internal representation of the tree nodes; the parser automatically converts
 133 the input text to this encoding. Currently, the parser supports UTF-8 and
 134 ISO-8859-1 as internal encodings.</p>
 135         </li>
 136         <li><p>The interface of the parser has been designed such that it is
 137 best integrated into the language O'Caml. The first goal was simplicity of
 138 usage which is achieved by many convenience methods and functions, and by
 139 allowing the user to select which parts of the XML text are actually
 140 represented in the tree. For example, it is possible to store processing
 141 instructions as tree nodes, but the parser can also be configured such that
 142 these instructions are put into hashtables. The information model is compatible
 143 with the requirements of XML-related standards such as XPath.</p>
 144         </li>
 145         <li><p>In particular, the node tree can optionally contain or leave out
 146 processing instructions and comments. It is also possible to generate a "super
 147 root" object which is the parent of the root element. The attributes of
 148 elements are normally not stored as nodes, but it is possible to get them
 149 wrapped into nodes.</p>
 150         </li>
 151         <li><p>There is also an interface for DTDs; you can parse and access
 152 sequences of declarations. The declarations are fully represented as recursive
 153 O'Caml values.
 154 </p>
 155         </li>
 156       </ul>
 157     </sect2>
 158
 159
 160     <sect2>
 161       <title>Code examples</title>
 162       <p>
 163 This distribution contains several examples:</p>
 164       <ul>
 165         <li><p>
 166 <em>validate:</em> simply parses a
 167 document and prints all error messages
 168 </p></li>
 169
 170         <li><p>
 171 <em>readme:</em> Defines a DTD for simple "README"-like documents, and offers
 172 conversion to HTML and text files<footnote>This particular document is an
 173 example of this DTD!</footnote>.
 174 </p></li>
 175
 176         <li><p>
 177 <em>xmlforms:</em> This is already a
 178 sophisticated application that uses XML as style sheet language and data
 179 storage format. It shows how a Tk user interface can be configured by an
 180 XML style, and how data records can be stored using XML.
 181 </p></li>
 182       </ul>
 183     </sect2>
 184
 185     <sect2>
 186       <title>Restrictions and missing features</title>
 187       <p>
 188 The following restrictions apply that are not violations of the standard:
 189 </p>
 190       <ul>
 191         <li><p>
 192 The attributes "xml:space", and "xml:lang" are not supported specially.
 193   (The application can do this.)</p></li>
 194
 195         <li><p>
 196 The built-in support for SYSTEM and PUBLIC identifiers is limited to
 197   local file access. There is no support for catalogs. The parser offers
 198   a hook to add missing features.</p></li>
 199
 200         <li><p>
 201 It is currently not possible to check for interoperatibility with SGML.
 202 </p></li>
 203       </ul>
 204
 205 <p>The following features are also missing:</p>
 206       <ul>
 207         <li><p>There is no special support for namespaces. (Perhaps in the next release?)</p>
 208         </li>
 209         <li><p>There is no support for XPATH or XSLT.</p>
 210         </li>
 211       </ul>
 212 <p>However, I hope that these features will be implemented soon, either by
 213 myself or by contributors (who are invited to do so).</p>
 214     </sect2>
 215
 216     <sect2>
 217       <title>Recent Changes</title>
 218       <ul>
 219         <li>
 220           <p>Changed in 1.0:</p>
 221           <p>Support for document order.</p>
 222         </li>
 223         <li>
 224           <p>Changed in 0.99.8:</p>
 225           <p>Several fixes of bugs reported by Haruo Hosoya and Alain
 226 Frisch.</p>
 227           <p>The class type "node" has been extended: you can go directly to
 228 the next and previous nodes in the list; you can refer to nodes by
 229 position.</p>
 230           <p>There are now some iterators for nodes: find, find_all,
 231 find_element, find_all_elements, map_tree, iter_tree.</p>
 232           <p>Experimental support for viewing attributes as nodes; I hope that
 233 helps Alain writing his XPath evaluator.</p>
 234           <p>The user's manual has been revised and is almost up to date.</p>
 235         </li>
 236         <li>
 237           <p>Changed in 0.99.7:</p>
 238           <p>There are now additional node types T_super_root, T_pinstr and
 239 T_comment, and the parser is able to create the corresponding nodes.</p>
 240           <p>The functions for character set conversion have been moved to
 241 the Netstring package; they are not specific for XML.</p>
 242         </li>
 243         <li>
 244           <p>Changed in 0.99.6:</p>
 245           <p>Implemented a check on deterministic content models. Added
 246 an alternate validator basing on a DFA. - This means that now all mandatory
 247 features for an XML-1.0 parser are implemented! The parser is now substantially
 248 complete.</p>
 249         </li>
 250         <li>
 251           <p>Changed in 0.99.5:</p>
 252           <p>The handling of ID and IDREF attributes has changed. The
 253 index of nodes containing an ID attribute is now separated from the document.
 254 Optionally the parser now checks whether the IDREF attributes refer to
 255 existing elements.</p>
 256           <p>The element nodes can optionally store the location in the
 257 source XML code.</p>
 258           <p>The method 'write' writes the XML tree in every supported
 259 encoding. (Successor of 'write_compact_as_latin1'.)</p>
 260           <p>Several smaller changes and fixes.</p>
 261         </li>
 262         <li>
 263           <p>Changed in 0.99.4:</p>
 264           <p>The module Pxp_reader has been modernized. The resolver classes
 265 are simpler to use. There is now support for URLs.</p>
 266           <p>The interface of Pxp_yacc has been improved: The type 'source'
 267 is now simpler. The type 'domspec' has gone; the new 'spec' is opaque and
 268 performs better. There are some new parsing modes.</p>
 269           <p>Many smaller changes.</p>
 270         </li>
 271         <li>
 272           <p>Changed in 0.99.3:</p>
 273           <p>The markup_* modules have been renamed to pxp_*. There is a new
 274 compatibility API that tries to be compatible with markup-0.2.10.</p>
 275           <p>The type "encoding" is now a polymorphic variant.</p>
 276         </li>
 277         <li>
 278           <p>Changed in 0.99.2:</p>
 279           <p>Added checks for the constraints about the standalone
 280 declaration.</p>
 281           <p>Added regression tests about attribute normalization,
 282 attribute checks, standalone checks.</p>
 283           <p>Fixed some minor errors of the attribute normalization
 284 function.</p>
 285           <p>The bytecode/native archives are now separated in
 286 a general part, in a ISO-8859-1-relevant part, and a UTF-8-relevant
 287 part. The parser can again be compiled with ocamlopt.</p>
 288         </li>
 289         <li>
 290           <p>Changed in 0.99.1:</p>
 291           <p>In general, this release is an early pre-release of the
 292 next stable version 1.00. I do not recommend to use it for serious
 293 work; it is still very experimental!</p>
 294           <p>The core of the parser has been rewritten using a self-written
 295 parser generator.</p>
 296           <p>The lexer has been restructured, and can now handle UTF-8
 297 encoded files.</p>
 298           <p>Numerous other changes.</p>
 299         </li>
 300
 301 <!--
 302         <li>
 303           <p>Changed in 0.2.10:</p>
 304           <p>Bugfix: in the "allow_undeclared_attributes" feature.</p>
 305           <p>Bugfix: in the methods write_compact_as_latin1.</p>
 306           <p>Improvement: The code produced by the codewriter module can be
 307 faster compiled and with less memory usage.</p>
 308         </li>
 309
 310         <li>
 311           <p>Changed in 0.2.9:</p>
 312           <p>New: The module Markup_codewriter generates for a given XML
 313 tree O'Caml code that creates the same XML tree. This is useful for
 314 applications which use large, constant XML trees.</p>
 315           <p>New: Documents and DTDs have a method write_compact_as_latin1
 316 that writes an XML tree to a buffer or to a channel. (But it is not a pretty
 317 printer...)</p>
 318           <p>Enhancement: If a DTD contains the processing instruction
 319 <code>
 320 &lt;?xml:allow_undeclared_attributes x?&gt;</code>
 321 where "x" is the name of an already declared element it is allowed that
 322 instances of this element type have attributes that have not been declared.
 323 </p>
 324           <p>New function Markup_types.string_of_exn that converts an
 325 exception from Markup into a readable string.</p>
 326           <p>Change: The module Markup_reader contains all resolvers.
 327 The resolver API is now stable.</p>
 328           <p>New parser modes processing_instructions_inline and
 329 virtual_root that help locating processing instructions exactly (if needed).
 330 </p>
 331           <p>Many bugs regarding CRLF handling have been fixed.</p>
 332           <p>The distributed tarball contains now the regression test suite.
 333 </p>
 334           <p>The manual has been extended (but it is still incomplete and
 335 still behind the code).</p>
 336         </li>
 337         <li>
 338           <p>Changed in 0.2.8:</p>
 339           <p>A bit more documentation (Markup_yacc).</p>
 340           <p>Bugfix: In previous versions, the second trial to refer to
 341 an entity caused a Bad_character_stream exception. The reason was improper
 342 re-initialization of the resolver object.</p>
 343         </li>
 344         <li>
 345           <p>Changed in 0.2.7:</p>
 346           <p>Added some methods in Markup_document.</p>
 347           <p>Bugfix: in method orphaned_clone</p>
 348         </li>
 349         <li>
 350           <p>Changed in 0.2.6:</p>
 351           <p>Enhancement: The config parameter has a new component
 352 "errors_with_line_numbers". If "true", error exceptions come with line numbers
 353 (the default; and the only option in the previous versions); if "false"
 354 the line numbers are left out (only character positions). The parser is 10 to
 355 20 percent faster if the lines are not tracked.</p>
 356           <p>Enhancement: If a DTD contains the processing instruction
 357 <code>
 358 &lt;?xml:allow_undeclared_elements_and_notations?&gt;</code>
 359 it is allowed that
 360 elements and notations are undeclared. However, the elements for which
 361 declarations exist are still validated. The main effect is that the
 362 keyword ALL in element declarations means that also undeclared elements
 363 are permitted at this location.</p>
 364           <p>Bugfix in method "set_nodes" of class Markup_document.node_impl.
 365 </p>
 366         </li>
 367         <li>
 368           <p>Changed in 0.2.5:</p>
 369           <p>If the XML source is a string (i.e. Latin1 some_string is passed
 370 to the parser functions as source), resolving did not work properly in
 371 previous releases. This is now fixed.
 372 </p>
 373         </li>
 374         <li>
 375           <p>Changed in 0.2.4:</p>
 376           <p>A problem with some kind of DTD that does not specify the name
 377 of the root element was fixed. As a result, the "xmlforms" application works
 378 again. Again thanks to Haruo.</p>
 379           <p>Due to the XML specs it is forbidden that parameter entities are
 380 referenced within the internal subset if the referenced text is not a
 381 complete declaration itself. This is checked, but the check was too hard;
 382 even in external entities referenced from the internal subset this rule
 383 was enforced. This has been corrected; in external entities it is now possible
 384 to use parameter entities in an unrestricted way.
 385 </p>
 386         </li>
 387         <li>
 388           <p>Changed in 0.2.3:</p>
 389           <p>A fix for a problem when installing Markup on Solaris.
 390 Haruo detected the problem.</p>
 391         </li>
 392         <li>
 393           <p>Changed in 0.2.2:</p>
 394           <p>A single bugfix: The parser did not reject documents where the
 395 root element was not the element declared as root element. Again thanks
 396 to Claudio.</p>
 397         </li>
 398         <li>
 399           <p>Changed in 0.2.1:</p>
 400           <p>A single bugfix which reduces the number of warnings. Thanks
 401 to Claudio for detecting the bug.</p>
 402         </li>
 403         <li>
 404           <p>Changed in 0.2:</p>
 405           <p>
 406 Much more constraints are checked in the 0.2 release than in 0.1. Especially
 407 that entities are properly nested is now guaranteed; parsed entities now always
 408 match the corresponding production of the grammar.</p>
 409           <p>
 410 Many weak checks have been turned into strong checks. For example, it is now
 411 detected if the "version", "encoding", and "standalone" attributes of an XML
 412 declaration are ordered in the right way.
 413 </p>
 414           <p>
 415 The error messages have been improved.
 416 </p>
 417         </li>
 418 -->
 419       </ul>
 420     </sect2>
 421   </sect1>
 422 </readme>
 423