--- /dev/null
+------------------------------------------------ -*- indented-text -*-
+Some Notes About the Design:
+----------------------------------------------------------------------
+
+----------------------------------------------------------------------
+Compilation
+----------------------------------------------------------------------
+
+Compilation is non-trivial because:
+
+ - The lexer and parser generators ocamlllex resp. ocamlyacc normally
+ create code such that the parser module precedes the lexer module.
+ THIS design requires that the lexer layer precedes the entity layer
+ which precedes the parser layer, because the parsing results modify
+ the behaviour of the lexer and entity layers. There is no way to get
+ around this because of the nature of XML.
+
+ So the dependency relation of the lexer and the parser is modified;
+ in particular the "token" type that is normally defined by the
+ generated parser is moved to a common prdecessor of both lexer
+ and parser.
+
+ - Another modification of the standard way of handling parsers is that
+ the parser is turned into an object. This is necessary because the
+ whole parser is polymorphic, i.e. there is a type parameter (the
+ type of the node extension).
+
+......................................................................
+
+First some modules are generated as illustrated by the following
+diagram:
+
+
+ markup_yacc.mly
+ | |
+ \|/ \|/ [ocamlyacc, 1]
+ V V
+ markup_yacc.mli markup_yacc.ml
+ | --> renamed into markup_yacc.ml0
+ [awk, 2] \|/ |
+ V \|/ [sed, 3]
+ markup_yacc_token.mlf V
+ | | markup_yacc.ml
+ markup_lexer_types_ | |
+ shadow.mli | | | markup_lexer_types_
+ \|/ [sed, \|/ | shadow.ml
+ V 4] V | |
+ markup_lexer_types.mli | | [sed, 4]
+ \|/ \|/
+ V V
+ markup_lexer_types.ml
+
+
+ markup_yacc_shadow.mli
+ |
+ \|/ [replaces, 5]
+ V
+ markup_yacc.mli
+
+
+
+ markup_lexers.mll
+ |
+ \|/ [ocamllex, 6]
+ V
+ markup_lexers.ml
+
+
+Notes:
+
+ (1) ocamlyacc generates both a module and a module interface.
+ The module is postprocessed in step (3). The interface cannot
+ be used, but it contains the definition of the "token" type.
+ This definition is extracted in step (2). The interface is
+ completely replaced in step (5) by a different file.
+
+ (2) An "awk" script extracts the definition of the type "token".
+ "token" is created by ocamlyacc upon the %token directives
+ in markup_yacc.mly, and normally "token" is defined in
+ the module generated by ocamlyacc. This turned out not to be
+ useful as the module dependency must be that the lexer is
+ an antecedent of the parser and not vice versa (as usually),
+ so the "token" type is "moved" to the module Markup_lexer_types
+ which is an antecedent of both the lexer and the parser.
+
+ (3) A "sed" script turns the generated parser into an object.
+ This is rather simple; some "let" definitions must be rewritten
+ as "val" definitions, the other "let" definitions as
+ "method" definitions. The parser object is needed because
+ the whole parser has a polymorphic type parameter.
+
+ (4) The implementation and definition of Markup_lexer_types are
+ both generated by inserting the "token" type definition
+ (in markup_lexer_types.mlf) into two pattern files,
+ markup_lexer_types_shadow.ml resp. -.mli. The point of insertion
+ is marked by the string INCLUDE_HERE.
+
+ (5) The generated interface of the Markup_yacc module is replaced
+ by a hand-written file.
+
+ (6) ocamllex generates the lexer; this process is not patched in any
+ way.
+
+......................................................................
+
+After the additional modules have been generated, compilation proceeds
+in the usual manner.
+
+
+----------------------------------------------------------------------
+Hierarchy of parsing layers:
+----------------------------------------------------------------------
+
+From top to bottom:
+
+ - Parser: Markup_yacc
+ + gets input stream from the main entity object
+ + checks most of the grammar
+ + creates the DTD object as side-effect
+ + creates the element tree as side-effect
+ + creates further entity objects that are entered into the DTD
+ - Entity layer: Markup_entity
+ + gets input stream from the lexers, or another entity object
+ + handles entity references: if a reference is encountered the
+ input stream is redirected such that the tokens come from the
+ referenced entity object
+ + handles conditional sections
+ - Lexer layer: Markup_lexers
+ + gets input from lexbuffers created by resolvers
+ + different lexers for different lexical contexts
+ + a lexer returns pairs (token,lexid), where token is the scanned
+ token, and lexid is the name of the lexer that must be used for
+ the next token
+ - Resolver layer: Markup_entity
+ + a resolver creates the lexbuf from some character source
+ + a resolver recodes the input and handles the encoding scheme
+
+----------------------------------------------------------------------
+The YACC based parser
+----------------------------------------------------------------------
+
+ocamlyacc allows it to pass an arbitrary 'next_token' function to the
+parsing functions. We always use 'en # next_token()' where 'en' is the
+main entity object representing the main file to be parsed.
+
+The parser is not functional, but uses mainly side-effects to accumulate
+the structures that have been recognized. This is very important for the
+entity definitions, because once an entity definition has been found there
+may be a reference to it which is handled by the entity layer (which is
+below the yacc layer). This means that such a definition modifies the
+token source of the parser, and this can only be handled by side-effects
+(at least in a sensible manner; a purely functional parser would have to
+pass unresolved entity references to its caller, which would have to
+resolve the reference and to re-parse the whole document!).
+
+Note that also element definitions profit from the imperative style of
+the parser; an element instance can be validated directly once the end
+tag has been read in.
+
+----------------------------------------------------------------------
+The entity layer
+----------------------------------------------------------------------
+
+The parser gets the tokens from the main entity object. This object
+controls the underlying lexing mechanism (see below), and already
+interprets the following:
+
+- Conditional sections (if they are allowed in this entity):
+ The structures <![ INCLUDE [ ... ]]> and <! IGNORE [ ... ]]> are
+ recognized and interpreted.
+
+ This would be hard to realize by the yacc parser, because:
+ - INCLUDE and IGNORE are not recognized as lexical keywords but as names.
+ This means that the parser cannot select different rules for them.
+ - The text after IGNORE requires a different lexical handling.
+
+- Entity references: &name; and %name;
+ The named entity is looked up and the input source is redirected to it, i.e.
+ if the main entity object gets the message 'next_token' this message is
+ forwarded to the referenced entity. (This entity may choose to forward the
+ message again to a third entity, and so on.)
+
+ There are some fine points:
+
+ - It is okay that redirection happens at token level, not at character level:
+ + General entities must always match the 'content' production, and because
+ of this they must always consist of a whole number of tokens.
+ + If parameter entities are resolved, the XML specification states that
+ a space character is inserted before and after the replacement text.
+ This also means that such entities always consists of a whole number
+ of tokens.
+
+ - There are some "nesting constraints":
+ + General entities must match the 'content' production. Because of this,
+ the special token Begin_entity is inserted before the first token of
+ the entity, and End_entity is inserted just before the Eof token. The
+ brace Begin_entity...End_entity is recognized by the yacc parser, but
+ only in the 'content' production.
+ + External parameter entities must match 'extSubsetDecl'. Again,
+ Begin_entity and End_entity tokens embrace the inner token stream.
+ The brace Begin_entity...End_entity is recognized by the yacc parser
+ at the appropriate position.
+ (As general and parameter entities are used in different contexts
+ (document vs. DTD), both kinds of entities can use the same brace
+ Begin_entity...End_entity.)
+ + TODO:
+ The constraints for internal parameter entities are not yet checked.
+
+ - Recursive references can be detected because entities must be opened
+ before the 'next_token' method can be invoked.
+
+----------------------------------------------------------------------
+The lexer layer
+----------------------------------------------------------------------
+
+There are five main lexers, and a number of auxiliary lexers. The five
+main lexers are:
+
+- Document (function scan_document):
+ Scans an XML document outside the DTD and outside the element instance.
+
+- Content (function scan_content):
+ Scans an element instance, but not within tags.
+
+- Within_tag (function scan_within_tag):
+ Scans within <...>, i.e. a tag denoting an element instance.
+
+- Document_type (function scan_document_type):
+ Scans after <!DOCTYPE until the corresponding >.
+
+- Declaration (function scan_declaration):
+ Scans sequences of declarations
+
+Why several lexers? Because there are different lexical rules in these
+five regions of an XML document.
+
+Every lexer not only produces tokens, but also the name of the next lexer
+to use. For example, if the Document lexer scans "<!DOCTYPE", it also
+outputs that the next token must be scanned by Document_type.
+
+It is interesting that this really works. The beginning of every lexical
+context can be recognized by the lexer of the previous context, and there
+is always a token that unambigously indicates that the context ends.
+
+----------------------------------------------------------------------
+The DTD object
+----------------------------------------------------------------------
+
+There is usually one object that collects DTD declarations. All kinds of
+declarations are entered here:
+
+- element and attribute list declarations
+- entity declarations
+- notation declarations
+
+Some properties are validated directly after a declarations has been added
+to the DTD, but most validation is done by a 'validate' method.
+
+The result of 'validate' is stored such that another invocation is cheap.
+A DTD becomes again 'unchecked' if another declaration is added.
+
+TODO: We need a special DTD object that allows every content.
+
+The DTD object is known by more or less every other object, i.e. entities
+know the DTD, element declarations and instances know the DTD, and so on.
+
+TODO: We need a method that deletes all entity declarations once the DTD
+is complete (to free memory).
+
+----------------------------------------------------------------------
+Element and Document objects
+----------------------------------------------------------------------
+
+The 'element' objects form the tree of the element instances.
+
+The 'document' object is a derivate of 'element' where properties of the
+whole document can be stored.
+
+New element objects are NOT created by the "new class" mechanism, but
+instead by an exemplar/instance scheme: A new instance is the duplicate
+of an exemplar. This has the advantage that the user can provide own
+classes for the element instances. A hashtable contains the exemplars
+for every element type (tag name), and there is a default exemplar.
+The user can configure this hashtable such that for elements A objects
+of class element_a, for elements B objects of class element_b and so on
+are used.
+
+The object for the root element must already be created before parsing
+starts, and the parser returns the (filled) root object. Because of this,
+the user determines the *static* type of the object without the need
+of back coercion (which is not possible in Ocaml).
+
+----------------------------------------------------------------------
+Newline normalization
+----------------------------------------------------------------------
+
+The XML spec states that all of \n, \r, and \r\n must be recognized
+as newline characters/character sequences. Notes:
+- The replacement text of entities always contains the orginal text,
+ i.e. \r and \r\n are NOT converted to \n.
+ It is unclear if this is a violation of the standard or not.
+- Content of elements: Newline characters are converted to \n.
+- Attribute values: Newline characters are converted to spaces.
+- Processing instructions: Newline characters are not converted.
+ It is unclear if this is a violation of the standard or not.
+
+----------------------------------------------------------------------
+Empty entities
+----------------------------------------------------------------------
+
+Many entities are artificially surrounded by a Begin_entity/End_entity pair.
+This is sometimes not done if the entity is empty:
+
+- External parameter entities are parsed entities, i.e. they must match
+ the markupdecl* production. If they are not empty, the Begin_entity/End_entity
+ trick guarantees that they match markupdecl+, and that they are only
+ referred to at positions where markupdecl+ is allowed.
+ If they are empty, they are allowed everywhere just like internal
+ parameter entities. Because of this, the Begin_entity/End_entity pair
+ is dropped.
+
+- This does not apply to parameter entities (either external or internal)
+ which are referred to in the internal subset, nor applies to internal
+ parameter entities, nor applies to general entities:
+
+ + References in the internal subset are only allowed at positions where
+ markupdecl can occur, so Begin_entity/End_entity is added even if the
+ entity is empty.
+ + References to internal parameter entities are allowed anywhere, so
+ never Begin_entity/End_entity is added.
+ + References to general entities: An empty Begin_entity/End_entity pair
+ is recognized by the yacc parser, so special handling is not required.
+ Moreover, there is the situation that an empty entity is referred to
+ after the toplevel element:
+ <!DOCTYPE doc ...[
+ <!ENTITY empty "">
+ ]>
+ <doc></doc>∅
+ - This is illegal, and the presence of an empty Begin_entity/End_entity pair
+ helps to recognize this.