This commit was manufactured by cvs2svn to create branch 'init'.

[helm.git] / helm / DEVEL / pxp / pxp / doc / design.txt
diff --git a/helm/DEVEL/pxp/pxp/doc/design.txt b/helm/DEVEL/pxp/pxp/doc/design.txt

deleted file mode 100644 (file)

index bf75d06..0000000
--- a/helm/DEVEL/pxp/pxp/doc/design.txt
+++ /dev/null
@@ -1,340 +0,0 @@
------------------------------------------------- -*- indented-text -*-
-Some Notes About the Design:
-----------------------------------------------------------------------
-
-----------------------------------------------------------------------
-Compilation
-----------------------------------------------------------------------
-
-Compilation is non-trivial because:
-
- - The lexer and parser generators ocamlllex resp. ocamlyacc normally
-   create code such that the parser module precedes the lexer module.
-   THIS design requires that the lexer layer precedes the entity layer
-   which precedes the parser layer, because the parsing results modify
-   the behaviour of the lexer and entity layers. There is no way to get
-   around this because of the nature of XML.
-
-   So the dependency relation of the lexer and the parser is modified;
-   in particular the "token" type that is normally defined by the 
-   generated parser is moved to a common prdecessor of both lexer
-   and parser.
-
- - Another modification of the standard way of handling parsers is that
-   the parser is turned into an object. This is necessary because the
-   whole parser is polymorphic, i.e. there is a type parameter (the
-   type of the node extension).
-
-......................................................................
-
-First some modules are generated as illustrated by the following
-diagram:
-
-
-                                    markup_yacc.mly
-                              |        |
-                              \|/      \|/  [ocamlyacc, 1]
-                               V               V
-                            markup_yacc.mli  markup_yacc.ml
-                                 |             --> renamed into markup_yacc.ml0
-           [awk, 2]     \|/                        |
-                         V                        \|/   [sed, 3]
-              markup_yacc_token.mlf                V
-                      |       |              markup_yacc.ml
- markup_lexer_types_   |       |
- shadow.mli  |                |       | markup_lexer_types_
-           \|/ [sed, \|/      | shadow.ml
-            V    4]   V       |     |
-       markup_lexer_types.mli |     |   [sed, 4]
-                             \|/   \|/
-                              V     V
-                       markup_lexer_types.ml
-
-
-                                  markup_yacc_shadow.mli
-                                          |
-                                 \|/  [replaces, 5]
-                                  V
-                             markup_yacc.mli
-
-
-
-                          markup_lexers.mll
-                                          |
-                                 \|/  [ocamllex, 6]
-                                  V
-                           markup_lexers.ml
-
-
-Notes:
-
-       (1) ocamlyacc generates both a module and a module interface.
-           The module is postprocessed in step (3). The interface cannot
-           be used, but it contains the definition of the "token" type.
-           This definition is extracted in step (2). The interface is
-           completely replaced in step (5) by a different file.
-
-       (2) An "awk" script extracts the definition of the type "token".
-           "token" is created by ocamlyacc upon the %token directives
-           in markup_yacc.mly, and normally "token" is defined in
-           the module generated by ocamlyacc. This turned out not to be
-           useful as the module dependency must be that the lexer is
-           an antecedent of the parser and not vice versa (as usually),
-           so the "token" type is "moved" to the module Markup_lexer_types
-           which is an antecedent of both the lexer and the parser.
-
-       (3) A "sed" script turns the generated parser into an object.
-           This is rather simple; some "let" definitions must be rewritten
-           as "val" definitions, the other "let" definitions as
-           "method" definitions. The parser object is needed because
-           the whole parser has a polymorphic type parameter.
-
-       (4) The implementation and definition of Markup_lexer_types are
-           both generated by inserting the "token" type definition
-           (in markup_lexer_types.mlf) into two pattern files,
-           markup_lexer_types_shadow.ml resp. -.mli. The point of insertion
-           is marked by the string INCLUDE_HERE.
-
-       (5) The generated interface of the Markup_yacc module is replaced
-           by a hand-written file.
-
-       (6) ocamllex generates the lexer; this process is not patched in any
-           way.
-
-......................................................................
-
-After the additional modules have been generated, compilation proceeds
-in the usual manner.
-
-
-----------------------------------------------------------------------
-Hierarchy of parsing layers:
-----------------------------------------------------------------------
-
-From top to bottom:
-
- - Parser: Markup_yacc
-   + gets input stream from the main entity object
-   + checks most of the grammar
-   + creates the DTD object as side-effect
-   + creates the element tree as side-effect
-   + creates further entity objects that are entered into the DTD
- - Entity layer: Markup_entity
-   + gets input stream from the lexers, or another entity object
-   + handles entity references: if a reference is encountered the
-     input stream is redirected such that the tokens come from the
-     referenced entity object
-   + handles conditional sections
- - Lexer layer: Markup_lexers
-   + gets input from lexbuffers created by resolvers
-   + different lexers for different lexical contexts
-   + a lexer returns pairs (token,lexid), where token is the scanned
-     token, and lexid is the name of the lexer that must be used for
-     the next token
- - Resolver layer: Markup_entity
-   + a resolver creates the lexbuf from some character source
-   + a resolver recodes the input and handles the encoding scheme
-
-----------------------------------------------------------------------
-The YACC based parser
-----------------------------------------------------------------------
-
-ocamlyacc allows it to pass an arbitrary 'next_token' function to the
-parsing functions. We always use 'en # next_token()' where 'en' is the
-main entity object representing the main file to be parsed.
-
-The parser is not functional, but uses mainly side-effects to accumulate
-the structures that have been recognized. This is very important for the
-entity definitions, because once an entity definition has been found there
-may be a reference to it which is handled by the entity layer (which is
-below the yacc layer). This means that such a definition modifies the
-token source of the parser, and this can only be handled by side-effects
-(at least in a sensible manner; a purely functional parser would have to
-pass unresolved entity references to its caller, which would have to
-resolve the reference and to re-parse the whole document!).
-
-Note that also element definitions profit from the imperative style of
-the parser; an element instance can be validated directly once the end
-tag has been read in.
-
-----------------------------------------------------------------------
-The entity layer
-----------------------------------------------------------------------
-
-The parser gets the tokens from the main entity object. This object
-controls the underlying lexing mechanism (see below), and already
-interprets the following:
-
-- Conditional sections (if they are allowed in this entity):
-  The structures <![ INCLUDE [ ... ]]> and <! IGNORE [ ... ]]> are
-  recognized and interpreted.
-
-  This would be hard to realize by the yacc parser, because:
-  - INCLUDE and IGNORE are not recognized as lexical keywords but as names.
-    This means that the parser cannot select different rules for them.
-  - The text after IGNORE requires a different lexical handling.
-
-- Entity references: &name; and %name;
-  The named entity is looked up and the input source is redirected to it, i.e.
-  if the main entity object gets the message 'next_token' this message is
-  forwarded to the referenced entity. (This entity may choose to forward the
-  message again to a third entity, and so on.)
-
-  There are some fine points:
-
-  - It is okay that redirection happens at token level, not at character level:
-    + General entities must always match the 'content' production, and because
-      of this they must always consist of a whole number of tokens.
-    + If parameter entities are resolved, the XML specification states that
-      a space character is inserted before and after the replacement text.
-      This also means that such entities always consists of a whole number
-      of tokens.
-
-  - There are some "nesting constraints":
-    + General entities must match the 'content' production. Because of this,
-      the special token Begin_entity is inserted before the first token of
-      the entity, and End_entity is inserted just before the Eof token. The
-      brace Begin_entity...End_entity is recognized by the yacc parser, but
-      only in the 'content' production.
-    + External parameter entities must match 'extSubsetDecl'. Again,
-      Begin_entity and End_entity tokens embrace the inner token stream.
-      The brace Begin_entity...End_entity is recognized by the yacc parser
-      at the appropriate position.
-      (As general and parameter entities are used in different contexts
-      (document vs. DTD), both kinds of entities can use the same brace
-      Begin_entity...End_entity.)
-    + TODO:
-      The constraints for internal parameter entities are not yet checked.
-
-  - Recursive references can be detected because entities must be opened
-    before the 'next_token' method can be invoked.
-
-----------------------------------------------------------------------
-The lexer layer
-----------------------------------------------------------------------
-
-There are five main lexers, and a number of auxiliary lexers. The five
-main lexers are:
-
-- Document (function scan_document):
-  Scans an XML document outside the DTD and outside the element instance.
-
-- Content (function scan_content):
-  Scans an element instance, but not within tags.
-
-- Within_tag (function scan_within_tag):
-  Scans within <...>, i.e. a tag denoting an element instance.
-
-- Document_type (function scan_document_type):
-  Scans after <!DOCTYPE until the corresponding >.
-
-- Declaration (function scan_declaration):
-  Scans sequences of declarations
-
-Why several lexers? Because there are different lexical rules in these
-five regions of an XML document.
-
-Every lexer not only produces tokens, but also the name of the next lexer
-to use. For example, if the Document lexer scans "<!DOCTYPE", it also
-outputs that the next token must be scanned by Document_type.
-
-It is interesting that this really works. The beginning of every lexical
-context can be recognized by the lexer of the previous context, and there
-is always a token that unambigously indicates that the context ends.
-
-----------------------------------------------------------------------
-The DTD object
-----------------------------------------------------------------------
-
-There is usually one object that collects DTD declarations. All kinds of
-declarations are entered here:
-
-- element and attribute list declarations
-- entity declarations
-- notation declarations
-
-Some properties are validated directly after a declarations has been added
-to the DTD, but most validation is done by a 'validate' method.
-
-The result of 'validate' is stored such that another invocation is cheap.
-A DTD becomes again 'unchecked' if another declaration is added.
-
-TODO: We need a special DTD object that allows every content.
-
-The DTD object is known by more or less every other object, i.e. entities
-know the DTD, element declarations and instances know the DTD, and so on.
-
-TODO: We need a method that deletes all entity declarations once the DTD
-is complete (to free memory).
-
-----------------------------------------------------------------------
-Element and Document objects
-----------------------------------------------------------------------
-
-The 'element' objects form the tree of the element instances.
-
-The 'document' object is a derivate of 'element' where properties of the
-whole document can be stored.
-
-New element objects are NOT created by the "new class" mechanism, but
-instead by an exemplar/instance scheme: A new instance is the duplicate
-of an exemplar. This has the advantage that the user can provide own
-classes for the element instances. A hashtable contains the exemplars
-for every element type (tag name), and there is a default exemplar.
-The user can configure this hashtable such that for elements A objects
-of class element_a, for elements B objects of class element_b and so on
-are used.
-
-The object for the root element must already be created before parsing
-starts, and the parser returns the (filled) root object. Because of this,
-the user determines the *static* type of the object without the need
-of back coercion (which is not possible in Ocaml).
-
-----------------------------------------------------------------------
-Newline normalization
-----------------------------------------------------------------------
-
-The XML spec states that all of \n, \r, and \r\n must be recognized
-as newline characters/character sequences. Notes:
-- The replacement text of entities always contains the orginal text,
-  i.e. \r and \r\n are NOT converted to \n.
-  It is unclear if this is a violation of the standard or not.
-- Content of elements: Newline characters are converted to \n.
-- Attribute values: Newline characters are converted to spaces.
-- Processing instructions: Newline characters are not converted.
-  It is unclear if this is a violation of the standard or not.
-
-----------------------------------------------------------------------
-Empty entities
-----------------------------------------------------------------------
-
-Many entities are artificially surrounded by a Begin_entity/End_entity pair.
-This is sometimes not done if the entity is empty:
-
-- External parameter entities are parsed entities, i.e. they must match
-  the markupdecl* production. If they are not empty, the Begin_entity/End_entity
-  trick guarantees that they match markupdecl+, and that they are only
-  referred to at positions where markupdecl+ is allowed.
-  If they are empty, they are allowed everywhere just like internal 
-  parameter entities. Because of this, the Begin_entity/End_entity pair
-  is dropped.
-
-- This does not apply to parameter entities (either external or internal)
-  which are referred to in the internal subset, nor applies to internal
-  parameter entities, nor applies to general entities:
-
-  + References in the internal subset are only allowed at positions where
-    markupdecl can occur, so Begin_entity/End_entity is added even if the
-    entity is empty.
-  + References to internal parameter entities are allowed anywhere, so
-    never Begin_entity/End_entity is added.
-  + References to general entities: An empty  Begin_entity/End_entity pair
-    is recognized by the yacc parser, so special handling is not required.
-    Moreover, there is the situation that an empty entity is referred to
-    after the toplevel element:
-    <!DOCTYPE doc ...[
-    <!ENTITY empty "">
-    ]>
-    <doc></doc>&empty;
-    - This is illegal, and the presence of an empty Begin_entity/End_entity pair
-    helps to recognize this.