******************************************************************************
README - PXP, the XML parser for O'Caml
******************************************************************************


==============================================================================
Abstract
==============================================================================

PXP is a validating parser for XML-1.0 which has been written entirely in 
Objective Caml. 

PXP is the new name of the parser formerly known as "Markup". PXP means 
"Polymorphic XML parser" and emphasizes its most useful property: that the API 
is polymorphic and can be configured such that different objects are used to 
store different types of elements.

==============================================================================
Download
==============================================================================

You can download PXP as gzip'ed tarball [1]. The parser needs the Netstring [2] 
package (0.9.3). Note that PXP requires O'Caml 3.00. 

==============================================================================
User's Manual
==============================================================================

The manual is included in the distribution both as Postscript document and 
bunch of HTML files. An online version can be found here [3]. 

==============================================================================
Author, Credits, Copying
==============================================================================

PXP has been written by Gerd Stolpmann [4]; it contains contributions by 
Claudio Sacerdoti Coen. You may copy it as you like, you may use it even for 
commercial purposes as long as the license conditions are respected, see the 
file LICENSE coming with the distribution. It allows almost everything. 

Thanks also to Alain Frisch and Haruo Hosoya for discussions and bug reports.

==============================================================================
Description
==============================================================================

PXP is a validating XML parser for O'Caml [5]. It strictly complies to the 
XML-1.0 [6] standard. 

The parser is simple to call, usually only one statement (function call) is 
sufficient to parse an XML document and to represent it as object tree.

Once the document is parsed, it can be accessed using a class interface. The 
interface allows arbitrary access including transformations. One of the 
features of the document representation is its polymorphic nature; it is simple 
to add custom methods to the document classes. Furthermore, the parser can be 
configured such that different XML elements are represented by objects created 
from different classes. This is a very powerful feature, because it simplifies 
the structure of programs processing XML documents. 

Note that the class interface does not comply to the DOM standard. It was not a 
development goal to realize a standard API (industrial developers can this much 
better than I); however, the API is powerful enough to be considered as 
equivalent with DOM. More important, the interface is compatible with the XML 
information model required by many XML-related standards. 

------------------------------------------------------------------------------
Detailed feature list
------------------------------------------------------------------------------

-  The XML instance is validated against the DTD; any violation of a validation 
   constraint leads to the rejection of the instance. The validator has been 
   carefully implemented, and conforms strictly to the standard. If needed, it 
   is also possible to run the parser in a well-formedness mode.
   
-  If possible, the validator applies a deterministic finite automaton to 
   validate the content models. This ensures that validation can always be 
   performed in linear time. However, in the case that the content models are 
   not deterministic, the parser uses a backtracking algorithm which can be 
   much slower. - It is also possible to reject non-deterministic content 
   models.
   
-  In particular, the validator also checks the complicated rules whether 
   parentheses are properly nested with respect to entities, and whether the 
   standalone declaration is satisfied. On demand, it is checked whether the 
   IDREF attributes only refer to existing nodes.
   
-  Entity references are automatically resolved while the XML text is being 
   scanned. It is not possible to recognize in the object tree where a 
   referenced entity begins or ends; the object tree only represents the 
   logical structure.
   
-  External entities are loaded using a configurable resolver infrastructure. 
   It is possible to connect the parser with an arbitrary XML source.
   
-  The parser can read XML text encoded in a variety of character sets. 
   Independent of this, it is possible to choose the encoding of the internal 
   representation of the tree nodes; the parser automatically converts the 
   input text to this encoding. Currently, the parser supports UTF-8 and 
   ISO-8859-1 as internal encodings.
   
-  The interface of the parser has been designed such that it is best 
   integrated into the language O'Caml. The first goal was simplicity of usage 
   which is achieved by many convenience methods and functions, and by allowing 
   the user to select which parts of the XML text are actually represented in 
   the tree. For example, it is possible to store processing instructions as 
   tree nodes, but the parser can also be configured such that these 
   instructions are put into hashtables. The information model is compatible 
   with the requirements of XML-related standards such as XPath.
   
-  In particular, the node tree can optionally contain or leave out processing 
   instructions and comments. It is also possible to generate a "super root" 
   object which is the parent of the root element. The attributes of elements 
   are normally not stored as nodes, but it is possible to get them wrapped 
   into nodes.
   
-  There is also an interface for DTDs; you can parse and access sequences of 
   declarations. The declarations are fully represented as recursive O'Caml 
   values. 
   
------------------------------------------------------------------------------
Code examples
------------------------------------------------------------------------------

This distribution contains several examples:

-  validate: simply parses a document and prints all error messages 
   
-  readme: Defines a DTD for simple "README"-like documents, and offers 
   conversion to HTML and text files [7]. 
   
-  xmlforms: This is already a sophisticated application that uses XML as style 
   sheet language and data storage format. It shows how a Tk user interface can 
   be configured by an XML style, and how data records can be stored using XML. 
   
------------------------------------------------------------------------------
Restrictions and missing features
------------------------------------------------------------------------------

The following restrictions apply that are not violations of the standard: 

-  The attributes "xml:space", and "xml:lang" are not supported specially. (The 
   application can do this.)
   
-  The built-in support for SYSTEM and PUBLIC identifiers is limited to local 
   file access. There is no support for catalogs. The parser offers a hook to 
   add missing features.
   
-  It is currently not possible to check for interoperatibility with SGML. 
   
The following features are also missing:

-  There is no special support for namespaces. (Perhaps in the next release?)
   
-  There is no support for XPATH or XSLT.
   
However, I hope that these features will be implemented soon, either by myself 
or by contributors (who are invited to do so).

------------------------------------------------------------------------------
Recent Changes
------------------------------------------------------------------------------

-  Changed in 1.0:
   Support for document order.
   
-  Changed in 0.99.8:
   Several fixes of bugs reported by Haruo Hosoya and Alain Frisch.
   The class type "node" has been extended: you can go directly to the next and 
   previous nodes in the list; you can refer to nodes by position.
   There are now some iterators for nodes: find, find_all, find_element, 
   find_all_elements, map_tree, iter_tree.
   Experimental support for viewing attributes as nodes; I hope that helps 
   Alain writing his XPath evaluator.
   The user's manual has been revised and is almost up to date.
   
-  Changed in 0.99.7:
   There are now additional node types T_super_root, T_pinstr and T_comment, 
   and the parser is able to create the corresponding nodes.
   The functions for character set conversion have been moved to the Netstring 
   package; they are not specific for XML.
   
-  Changed in 0.99.6:
   Implemented a check on deterministic content models. Added an alternate 
   validator basing on a DFA. - This means that now all mandatory features for 
   an XML-1.0 parser are implemented! The parser is now substantially complete.
   
-  Changed in 0.99.5:
   The handling of ID and IDREF attributes has changed. The index of nodes 
   containing an ID attribute is now separated from the document. Optionally 
   the parser now checks whether the IDREF attributes refer to existing 
   elements.
   The element nodes can optionally store the location in the source XML code.
   The method 'write' writes the XML tree in every supported encoding. 
   (Successor of 'write_compact_as_latin1'.)
   Several smaller changes and fixes.
   
-  Changed in 0.99.4:
   The module Pxp_reader has been modernized. The resolver classes are simpler 
   to use. There is now support for URLs.
   The interface of Pxp_yacc has been improved: The type 'source' is now 
   simpler. The type 'domspec' has gone; the new 'spec' is opaque and performs 
   better. There are some new parsing modes.
   Many smaller changes.
   
-  Changed in 0.99.3:
   The markup_* modules have been renamed to pxp_*. There is a new 
   compatibility API that tries to be compatible with markup-0.2.10.
   The type "encoding" is now a polymorphic variant.
   
-  Changed in 0.99.2:
   Added checks for the constraints about the standalone declaration.
   Added regression tests about attribute normalization, attribute checks, 
   standalone checks.
   Fixed some minor errors of the attribute normalization function.
   The bytecode/native archives are now separated in a general part, in a 
   ISO-8859-1-relevant part, and a UTF-8-relevant part. The parser can again be 
   compiled with ocamlopt.
   
-  Changed in 0.99.1:
   In general, this release is an early pre-release of the next stable version 
   1.00. I do not recommend to use it for serious work; it is still very 
   experimental!
   The core of the parser has been rewritten using a self-written parser 
   generator.
   The lexer has been restructured, and can now handle UTF-8 encoded files.
   Numerous other changes.
   

--------------------------

[1]   see http://www.ocaml-programming.de/packages/pxp-1.0.tar.gz

[2]   see http://www.ocaml-programming.de/packages/documentation/netstring

[3]   see http://www.ocaml-programming.de/packages/documentation/pxp/manual

[4]   see mailto:gerd@gerd-stolpmann.de

[5]   see http://caml.inria.fr/

[6]   see http://www.w3.org/TR/1998/REC-xml-19980210.html

[7]   This particular document is an example of this DTD!