helm/DEVEL/pxp/pxp/doc/README

   1 ******************************************************************************
   2 README - PXP, the XML parser for O'Caml
   3 ******************************************************************************
   4
   5
   6 ==============================================================================
   7 Abstract
   8 ==============================================================================
   9
  10 PXP is a validating parser for XML-1.0 which has been written entirely in
  11 Objective Caml.
  12
  13 PXP is the new name of the parser formerly known as "Markup". PXP means
  14 "Polymorphic XML parser" and emphasizes its most useful property: that the API
  15 is polymorphic and can be configured such that different objects are used to
  16 store different types of elements.
  17
  18 ==============================================================================
  19 Download
  20 ==============================================================================
  21
  22 You can download PXP as gzip'ed tarball [1]. The parser needs the Netstring [2]
  23 package (0.9.3). Note that PXP requires O'Caml 3.00.
  24
  25 ==============================================================================
  26 User's Manual
  27 ==============================================================================
  28
  29 The manual is included in the distribution both as Postscript document and
  30 bunch of HTML files. An online version can be found here [3].
  31
  32 ==============================================================================
  33 Author, Credits, Copying
  34 ==============================================================================
  35
  36 PXP has been written by Gerd Stolpmann [4]; it contains contributions by
  37 Claudio Sacerdoti Coen. You may copy it as you like, you may use it even for
  38 commercial purposes as long as the license conditions are respected, see the
  39 file LICENSE coming with the distribution. It allows almost everything.
  40
  41 Thanks also to Alain Frisch and Haruo Hosoya for discussions and bug reports.
  42
  43 ==============================================================================
  44 Description
  45 ==============================================================================
  46
  47 PXP is a validating XML parser for O'Caml [5]. It strictly complies to the
  48 XML-1.0 [6] standard.
  49
  50 The parser is simple to call, usually only one statement (function call) is
  51 sufficient to parse an XML document and to represent it as object tree.
  52
  53 Once the document is parsed, it can be accessed using a class interface. The
  54 interface allows arbitrary access including transformations. One of the
  55 features of the document representation is its polymorphic nature; it is simple
  56 to add custom methods to the document classes. Furthermore, the parser can be
  57 configured such that different XML elements are represented by objects created
  58 from different classes. This is a very powerful feature, because it simplifies
  59 the structure of programs processing XML documents.
  60
  61 Note that the class interface does not comply to the DOM standard. It was not a
  62 development goal to realize a standard API (industrial developers can this much
  63 better than I); however, the API is powerful enough to be considered as
  64 equivalent with DOM. More important, the interface is compatible with the XML
  65 information model required by many XML-related standards.
  66
  67 ------------------------------------------------------------------------------
  68 Detailed feature list
  69 ------------------------------------------------------------------------------
  70
  71 -  The XML instance is validated against the DTD; any violation of a validation
  72    constraint leads to the rejection of the instance. The validator has been
  73    carefully implemented, and conforms strictly to the standard. If needed, it
  74    is also possible to run the parser in a well-formedness mode.
  75
  76 -  If possible, the validator applies a deterministic finite automaton to
  77    validate the content models. This ensures that validation can always be
  78    performed in linear time. However, in the case that the content models are
  79    not deterministic, the parser uses a backtracking algorithm which can be
  80    much slower. - It is also possible to reject non-deterministic content
  81    models.
  82
  83 -  In particular, the validator also checks the complicated rules whether
  84    parentheses are properly nested with respect to entities, and whether the
  85    standalone declaration is satisfied. On demand, it is checked whether the
  86    IDREF attributes only refer to existing nodes.
  87
  88 -  Entity references are automatically resolved while the XML text is being
  89    scanned. It is not possible to recognize in the object tree where a
  90    referenced entity begins or ends; the object tree only represents the
  91    logical structure.
  92
  93 -  External entities are loaded using a configurable resolver infrastructure.
  94    It is possible to connect the parser with an arbitrary XML source.
  95
  96 -  The parser can read XML text encoded in a variety of character sets.
  97    Independent of this, it is possible to choose the encoding of the internal
  98    representation of the tree nodes; the parser automatically converts the
  99    input text to this encoding. Currently, the parser supports UTF-8 and
 100    ISO-8859-1 as internal encodings.
 101
 102 -  The interface of the parser has been designed such that it is best
 103    integrated into the language O'Caml. The first goal was simplicity of usage
 104    which is achieved by many convenience methods and functions, and by allowing
 105    the user to select which parts of the XML text are actually represented in
 106    the tree. For example, it is possible to store processing instructions as
 107    tree nodes, but the parser can also be configured such that these
 108    instructions are put into hashtables. The information model is compatible
 109    with the requirements of XML-related standards such as XPath.
 110
 111 -  In particular, the node tree can optionally contain or leave out processing
 112    instructions and comments. It is also possible to generate a "super root"
 113    object which is the parent of the root element. The attributes of elements
 114    are normally not stored as nodes, but it is possible to get them wrapped
 115    into nodes.
 116
 117 -  There is also an interface for DTDs; you can parse and access sequences of
 118    declarations. The declarations are fully represented as recursive O'Caml
 119    values.
 120
 121 ------------------------------------------------------------------------------
 122 Code examples
 123 ------------------------------------------------------------------------------
 124
 125 This distribution contains several examples:
 126
 127 -  validate: simply parses a document and prints all error messages
 128
 129 -  readme: Defines a DTD for simple "README"-like documents, and offers
 130    conversion to HTML and text files [7].
 131
 132 -  xmlforms: This is already a sophisticated application that uses XML as style
 133    sheet language and data storage format. It shows how a Tk user interface can
 134    be configured by an XML style, and how data records can be stored using XML.
 135
 136 ------------------------------------------------------------------------------
 137 Restrictions and missing features
 138 ------------------------------------------------------------------------------
 139
 140 The following restrictions apply that are not violations of the standard:
 141
 142 -  The attributes "xml:space", and "xml:lang" are not supported specially. (The
 143    application can do this.)
 144
 145 -  The built-in support for SYSTEM and PUBLIC identifiers is limited to local
 146    file access. There is no support for catalogs. The parser offers a hook to
 147    add missing features.
 148
 149 -  It is currently not possible to check for interoperatibility with SGML.
 150
 151 The following features are also missing:
 152
 153 -  There is no special support for namespaces. (Perhaps in the next release?)
 154
 155 -  There is no support for XPATH or XSLT.
 156
 157 However, I hope that these features will be implemented soon, either by myself
 158 or by contributors (who are invited to do so).
 159
 160 ------------------------------------------------------------------------------
 161 Recent Changes
 162 ------------------------------------------------------------------------------
 163
 164 -  Changed in 1.0:
 165    Support for document order.
 166
 167 -  Changed in 0.99.8:
 168    Several fixes of bugs reported by Haruo Hosoya and Alain Frisch.
 169    The class type "node" has been extended: you can go directly to the next and
 170    previous nodes in the list; you can refer to nodes by position.
 171    There are now some iterators for nodes: find, find_all, find_element,
 172    find_all_elements, map_tree, iter_tree.
 173    Experimental support for viewing attributes as nodes; I hope that helps
 174    Alain writing his XPath evaluator.
 175    The user's manual has been revised and is almost up to date.
 176
 177 -  Changed in 0.99.7:
 178    There are now additional node types T_super_root, T_pinstr and T_comment,
 179    and the parser is able to create the corresponding nodes.
 180    The functions for character set conversion have been moved to the Netstring
 181    package; they are not specific for XML.
 182
 183 -  Changed in 0.99.6:
 184    Implemented a check on deterministic content models. Added an alternate
 185    validator basing on a DFA. - This means that now all mandatory features for
 186    an XML-1.0 parser are implemented! The parser is now substantially complete.
 187
 188 -  Changed in 0.99.5:
 189    The handling of ID and IDREF attributes has changed. The index of nodes
 190    containing an ID attribute is now separated from the document. Optionally
 191    the parser now checks whether the IDREF attributes refer to existing
 192    elements.
 193    The element nodes can optionally store the location in the source XML code.
 194    The method 'write' writes the XML tree in every supported encoding.
 195    (Successor of 'write_compact_as_latin1'.)
 196    Several smaller changes and fixes.
 197
 198 -  Changed in 0.99.4:
 199    The module Pxp_reader has been modernized. The resolver classes are simpler
 200    to use. There is now support for URLs.
 201    The interface of Pxp_yacc has been improved: The type 'source' is now
 202    simpler. The type 'domspec' has gone; the new 'spec' is opaque and performs
 203    better. There are some new parsing modes.
 204    Many smaller changes.
 205
 206 -  Changed in 0.99.3:
 207    The markup_* modules have been renamed to pxp_*. There is a new
 208    compatibility API that tries to be compatible with markup-0.2.10.
 209    The type "encoding" is now a polymorphic variant.
 210
 211 -  Changed in 0.99.2:
 212    Added checks for the constraints about the standalone declaration.
 213    Added regression tests about attribute normalization, attribute checks,
 214    standalone checks.
 215    Fixed some minor errors of the attribute normalization function.
 216    The bytecode/native archives are now separated in a general part, in a
 217    ISO-8859-1-relevant part, and a UTF-8-relevant part. The parser can again be
 218    compiled with ocamlopt.
 219
 220 -  Changed in 0.99.1:
 221    In general, this release is an early pre-release of the next stable version
 222    1.00. I do not recommend to use it for serious work; it is still very
 223    experimental!
 224    The core of the parser has been rewritten using a self-written parser
 225    generator.
 226    The lexer has been restructured, and can now handle UTF-8 encoded files.
 227    Numerous other changes.
 228
 229
 230 --------------------------
 231
 232 [1]   see http://www.ocaml-programming.de/packages/pxp-1.0.tar.gz
 233
 234 [2]   see http://www.ocaml-programming.de/packages/documentation/netstring
 235
 236 [3]   see http://www.ocaml-programming.de/packages/documentation/pxp/manual
 237
 238 [4]   see mailto:gerd@gerd-stolpmann.de
 239
 240 [5]   see http://caml.inria.fr/
 241
 242 [6]   see http://www.w3.org/TR/1998/REC-xml-19980210.html
 243
 244 [7]   This particular document is an example of this DTD!
 245
 246
 247