Initial revision

[helm.git] / helm / DEVEL / pxp / pxp / doc / design.txt
diff --git a/helm/DEVEL/pxp/pxp/doc/design.txt b/helm/DEVEL/pxp/pxp/doc/design.txt

new file mode 100644 (file)

index 0000000..bf75d06
--- /dev/null
+++ b/helm/DEVEL/pxp/pxp/doc/design.txt
@@ -0,0 +1,340 @@
+------------------------------------------------ -*- indented-text -*-
+Some Notes About the Design:
+----------------------------------------------------------------------
+
+----------------------------------------------------------------------
+Compilation
+----------------------------------------------------------------------
+
+Compilation is non-trivial because:
+
+ - The lexer and parser generators ocamlllex resp. ocamlyacc normally
+   create code such that the parser module precedes the lexer module.
+   THIS design requires that the lexer layer precedes the entity layer
+   which precedes the parser layer, because the parsing results modify
+   the behaviour of the lexer and entity layers. There is no way to get
+   around this because of the nature of XML.
+
+   So the dependency relation of the lexer and the parser is modified;
+   in particular the "token" type that is normally defined by the 
+   generated parser is moved to a common prdecessor of both lexer
+   and parser.
+
+ - Another modification of the standard way of handling parsers is that
+   the parser is turned into an object. This is necessary because the
+   whole parser is polymorphic, i.e. there is a type parameter (the
+   type of the node extension).
+
+......................................................................
+
+First some modules are generated as illustrated by the following
+diagram:
+
+
+                                    markup_yacc.mly
+                              |        |
+                              \|/      \|/  [ocamlyacc, 1]
+                               V               V
+                            markup_yacc.mli  markup_yacc.ml
+                                 |             --> renamed into markup_yacc.ml0
+           [awk, 2]     \|/                        |
+                         V                        \|/   [sed, 3]
+              markup_yacc_token.mlf                V
+                      |       |              markup_yacc.ml
+ markup_lexer_types_   |       |
+ shadow.mli  |                |       | markup_lexer_types_
+           \|/ [sed, \|/      | shadow.ml
+            V    4]   V       |     |
+       markup_lexer_types.mli |     |   [sed, 4]
+                             \|/   \|/
+                              V     V
+                       markup_lexer_types.ml
+
+
+                                  markup_yacc_shadow.mli
+                                          |
+                                 \|/  [replaces, 5]
+                                  V
+                             markup_yacc.mli
+
+
+
+                          markup_lexers.mll
+                                          |
+                                 \|/  [ocamllex, 6]
+                                  V
+                           markup_lexers.ml
+
+
+Notes:
+
+       (1) ocamlyacc generates both a module and a module interface.
+           The module is postprocessed in step (3). The interface cannot
+           be used, but it contains the definition of the "token" type.
+           This definition is extracted in step (2). The interface is
+           completely replaced in step (5) by a different file.
+
+       (2) An "awk" script extracts the definition of the type "token".
+           "token" is created by ocamlyacc upon the %token directives
+           in markup_yacc.mly, and normally "token" is defined in
+           the module generated by ocamlyacc. This turned out not to be
+           useful as the module dependency must be that the lexer is
+           an antecedent of the parser and not vice versa (as usually),
+           so the "token" type is "moved" to the module Markup_lexer_types
+           which is an antecedent of both the lexer and the parser.
+
+       (3) A "sed" script turns the generated parser into an object.
+           This is rather simple; some "let" definitions must be rewritten
+           as "val" definitions, the other "let" definitions as
+           "method" definitions. The parser object is needed because
+           the whole parser has a polymorphic type parameter.
+
+       (4) The implementation and definition of Markup_lexer_types are
+           both generated by inserting the "token" type definition
+           (in markup_lexer_types.mlf) into two pattern files,
+           markup_lexer_types_shadow.ml resp. -.mli. The point of insertion
+           is marked by the string INCLUDE_HERE.
+
+       (5) The generated interface of the Markup_yacc module is replaced
+           by a hand-written file.
+
+       (6) ocamllex generates the lexer; this process is not patched in any
+           way.
+
+......................................................................
+
+After the additional modules have been generated, compilation proceeds
+in the usual manner.
+
+
+----------------------------------------------------------------------
+Hierarchy of parsing layers:
+----------------------------------------------------------------------
+
+From top to bottom:
+
+ - Parser: Markup_yacc
+   + gets input stream from the main entity object
+   + checks most of the grammar
+   + creates the DTD object as side-effect
+   + creates the element tree as side-effect
+   + creates further entity objects that are entered into the DTD
+ - Entity layer: Markup_entity
+   + gets input stream from the lexers, or another entity object
+   + handles entity references: if a reference is encountered the
+     input stream is redirected such that the tokens come from the
+     referenced entity object
+   + handles conditional sections
+ - Lexer layer: Markup_lexers
+   + gets input from lexbuffers created by resolvers
+   + different lexers for different lexical contexts
+   + a lexer returns pairs (token,lexid), where token is the scanned
+     token, and lexid is the name of the lexer that must be used for
+     the next token
+ - Resolver layer: Markup_entity
+   + a resolver creates the lexbuf from some character source
+   + a resolver recodes the input and handles the encoding scheme
+
+----------------------------------------------------------------------
+The YACC based parser
+----------------------------------------------------------------------
+
+ocamlyacc allows it to pass an arbitrary 'next_token' function to the
+parsing functions. We always use 'en # next_token()' where 'en' is the
+main entity object representing the main file to be parsed.
+
+The parser is not functional, but uses mainly side-effects to accumulate
+the structures that have been recognized. This is very important for the
+entity definitions, because once an entity definition has been found there
+may be a reference to it which is handled by the entity layer (which is
+below the yacc layer). This means that such a definition modifies the
+token source of the parser, and this can only be handled by side-effects
+(at least in a sensible manner; a purely functional parser would have to
+pass unresolved entity references to its caller, which would have to
+resolve the reference and to re-parse the whole document!).
+
+Note that also element definitions profit from the imperative style of
+the parser; an element instance can be validated directly once the end
+tag has been read in.
+
+----------------------------------------------------------------------
+The entity layer
+----------------------------------------------------------------------
+
+The parser gets the tokens from the main entity object. This object
+controls the underlying lexing mechanism (see below), and already
+interprets the following:
+
+- Conditional sections (if they are allowed in this entity):
+  The structures <![ INCLUDE [ ... ]]> and <! IGNORE [ ... ]]> are
+  recognized and interpreted.
+
+  This would be hard to realize by the yacc parser, because:
+  - INCLUDE and IGNORE are not recognized as lexical keywords but as names.
+    This means that the parser cannot select different rules for them.
+  - The text after IGNORE requires a different lexical handling.
+
+- Entity references: &name; and %name;
+  The named entity is looked up and the input source is redirected to it, i.e.
+  if the main entity object gets the message 'next_token' this message is
+  forwarded to the referenced entity. (This entity may choose to forward the
+  message again to a third entity, and so on.)
+
+  There are some fine points:
+
+  - It is okay that redirection happens at token level, not at character level:
+    + General entities must always match the 'content' production, and because
+      of this they must always consist of a whole number of tokens.
+    + If parameter entities are resolved, the XML specification states that
+      a space character is inserted before and after the replacement text.
+      This also means that such entities always consists of a whole number
+      of tokens.
+
+  - There are some "nesting constraints":
+    + General entities must match the 'content' production. Because of this,
+      the special token Begin_entity is inserted before the first token of
+      the entity, and End_entity is inserted just before the Eof token. The
+      brace Begin_entity...End_entity is recognized by the yacc parser, but
+      only in the 'content' production.
+    + External parameter entities must match 'extSubsetDecl'. Again,
+      Begin_entity and End_entity tokens embrace the inner token stream.
+      The brace Begin_entity...End_entity is recognized by the yacc parser
+      at the appropriate position.
+      (As general and parameter entities are used in different contexts
+      (document vs. DTD), both kinds of entities can use the same brace
+      Begin_entity...End_entity.)
+    + TODO:
+      The constraints for internal parameter entities are not yet checked.
+
+  - Recursive references can be detected because entities must be opened
+    before the 'next_token' method can be invoked.
+
+----------------------------------------------------------------------
+The lexer layer
+----------------------------------------------------------------------
+
+There are five main lexers, and a number of auxiliary lexers. The five
+main lexers are:
+
+- Document (function scan_document):
+  Scans an XML document outside the DTD and outside the element instance.
+
+- Content (function scan_content):
+  Scans an element instance, but not within tags.
+
+- Within_tag (function scan_within_tag):
+  Scans within <...>, i.e. a tag denoting an element instance.
+
+- Document_type (function scan_document_type):
+  Scans after <!DOCTYPE until the corresponding >.
+
+- Declaration (function scan_declaration):
+  Scans sequences of declarations
+
+Why several lexers? Because there are different lexical rules in these
+five regions of an XML document.
+
+Every lexer not only produces tokens, but also the name of the next lexer
+to use. For example, if the Document lexer scans "<!DOCTYPE", it also
+outputs that the next token must be scanned by Document_type.
+
+It is interesting that this really works. The beginning of every lexical
+context can be recognized by the lexer of the previous context, and there
+is always a token that unambigously indicates that the context ends.
+
+----------------------------------------------------------------------
+The DTD object
+----------------------------------------------------------------------
+
+There is usually one object that collects DTD declarations. All kinds of
+declarations are entered here:
+
+- element and attribute list declarations
+- entity declarations
+- notation declarations
+
+Some properties are validated directly after a declarations has been added
+to the DTD, but most validation is done by a 'validate' method.
+
+The result of 'validate' is stored such that another invocation is cheap.
+A DTD becomes again 'unchecked' if another declaration is added.
+
+TODO: We need a special DTD object that allows every content.
+
+The DTD object is known by more or less every other object, i.e. entities
+know the DTD, element declarations and instances know the DTD, and so on.
+
+TODO: We need a method that deletes all entity declarations once the DTD
+is complete (to free memory).
+
+----------------------------------------------------------------------
+Element and Document objects
+----------------------------------------------------------------------
+
+The 'element' objects form the tree of the element instances.
+
+The 'document' object is a derivate of 'element' where properties of the
+whole document can be stored.
+
+New element objects are NOT created by the "new class" mechanism, but
+instead by an exemplar/instance scheme: A new instance is the duplicate
+of an exemplar. This has the advantage that the user can provide own
+classes for the element instances. A hashtable contains the exemplars
+for every element type (tag name), and there is a default exemplar.
+The user can configure this hashtable such that for elements A objects
+of class element_a, for elements B objects of class element_b and so on
+are used.
+
+The object for the root element must already be created before parsing
+starts, and the parser returns the (filled) root object. Because of this,
+the user determines the *static* type of the object without the need
+of back coercion (which is not possible in Ocaml).
+
+----------------------------------------------------------------------
+Newline normalization
+----------------------------------------------------------------------
+
+The XML spec states that all of \n, \r, and \r\n must be recognized
+as newline characters/character sequences. Notes:
+- The replacement text of entities always contains the orginal text,
+  i.e. \r and \r\n are NOT converted to \n.
+  It is unclear if this is a violation of the standard or not.
+- Content of elements: Newline characters are converted to \n.
+- Attribute values: Newline characters are converted to spaces.
+- Processing instructions: Newline characters are not converted.
+  It is unclear if this is a violation of the standard or not.
+
+----------------------------------------------------------------------
+Empty entities
+----------------------------------------------------------------------
+
+Many entities are artificially surrounded by a Begin_entity/End_entity pair.
+This is sometimes not done if the entity is empty:
+
+- External parameter entities are parsed entities, i.e. they must match
+  the markupdecl* production. If they are not empty, the Begin_entity/End_entity
+  trick guarantees that they match markupdecl+, and that they are only
+  referred to at positions where markupdecl+ is allowed.
+  If they are empty, they are allowed everywhere just like internal 
+  parameter entities. Because of this, the Begin_entity/End_entity pair
+  is dropped.
+
+- This does not apply to parameter entities (either external or internal)
+  which are referred to in the internal subset, nor applies to internal
+  parameter entities, nor applies to general entities:
+
+  + References in the internal subset are only allowed at positions where
+    markupdecl can occur, so Begin_entity/End_entity is added even if the
+    entity is empty.
+  + References to internal parameter entities are allowed anywhere, so
+    never Begin_entity/End_entity is added.
+  + References to general entities: An empty  Begin_entity/End_entity pair
+    is recognized by the yacc parser, so special handling is not required.
+    Moreover, there is the situation that an empty entity is referred to
+    after the toplevel element:
+    <!DOCTYPE doc ...[
+    <!ENTITY empty "">
+    ]>
+    <doc></doc>&empty;
+    - This is illegal, and the presence of an empty Begin_entity/End_entity pair
+    helps to recognize this.