+++ /dev/null
------------------------------------------------- -*- indented-text -*-
-Some Notes About the Design:
-----------------------------------------------------------------------
-
-----------------------------------------------------------------------
-Compilation
-----------------------------------------------------------------------
-
-Compilation is non-trivial because:
-
- - The lexer and parser generators ocamlllex resp. ocamlyacc normally
- create code such that the parser module precedes the lexer module.
- THIS design requires that the lexer layer precedes the entity layer
- which precedes the parser layer, because the parsing results modify
- the behaviour of the lexer and entity layers. There is no way to get
- around this because of the nature of XML.
-
- So the dependency relation of the lexer and the parser is modified;
- in particular the "token" type that is normally defined by the
- generated parser is moved to a common prdecessor of both lexer
- and parser.
-
- - Another modification of the standard way of handling parsers is that
- the parser is turned into an object. This is necessary because the
- whole parser is polymorphic, i.e. there is a type parameter (the
- type of the node extension).
-
-......................................................................
-
-First some modules are generated as illustrated by the following
-diagram:
-
-
- markup_yacc.mly
- | |
- \|/ \|/ [ocamlyacc, 1]
- V V
- markup_yacc.mli markup_yacc.ml
- | --> renamed into markup_yacc.ml0
- [awk, 2] \|/ |
- V \|/ [sed, 3]
- markup_yacc_token.mlf V
- | | markup_yacc.ml
- markup_lexer_types_ | |
- shadow.mli | | | markup_lexer_types_
- \|/ [sed, \|/ | shadow.ml
- V 4] V | |
- markup_lexer_types.mli | | [sed, 4]
- \|/ \|/
- V V
- markup_lexer_types.ml
-
-
- markup_yacc_shadow.mli
- |
- \|/ [replaces, 5]
- V
- markup_yacc.mli
-
-
-
- markup_lexers.mll
- |
- \|/ [ocamllex, 6]
- V
- markup_lexers.ml
-
-
-Notes:
-
- (1) ocamlyacc generates both a module and a module interface.
- The module is postprocessed in step (3). The interface cannot
- be used, but it contains the definition of the "token" type.
- This definition is extracted in step (2). The interface is
- completely replaced in step (5) by a different file.
-
- (2) An "awk" script extracts the definition of the type "token".
- "token" is created by ocamlyacc upon the %token directives
- in markup_yacc.mly, and normally "token" is defined in
- the module generated by ocamlyacc. This turned out not to be
- useful as the module dependency must be that the lexer is
- an antecedent of the parser and not vice versa (as usually),
- so the "token" type is "moved" to the module Markup_lexer_types
- which is an antecedent of both the lexer and the parser.
-
- (3) A "sed" script turns the generated parser into an object.
- This is rather simple; some "let" definitions must be rewritten
- as "val" definitions, the other "let" definitions as
- "method" definitions. The parser object is needed because
- the whole parser has a polymorphic type parameter.
-
- (4) The implementation and definition of Markup_lexer_types are
- both generated by inserting the "token" type definition
- (in markup_lexer_types.mlf) into two pattern files,
- markup_lexer_types_shadow.ml resp. -.mli. The point of insertion
- is marked by the string INCLUDE_HERE.
-
- (5) The generated interface of the Markup_yacc module is replaced
- by a hand-written file.
-
- (6) ocamllex generates the lexer; this process is not patched in any
- way.
-
-......................................................................
-
-After the additional modules have been generated, compilation proceeds
-in the usual manner.
-
-
-----------------------------------------------------------------------
-Hierarchy of parsing layers:
-----------------------------------------------------------------------
-
-From top to bottom:
-
- - Parser: Markup_yacc
- + gets input stream from the main entity object
- + checks most of the grammar
- + creates the DTD object as side-effect
- + creates the element tree as side-effect
- + creates further entity objects that are entered into the DTD
- - Entity layer: Markup_entity
- + gets input stream from the lexers, or another entity object
- + handles entity references: if a reference is encountered the
- input stream is redirected such that the tokens come from the
- referenced entity object
- + handles conditional sections
- - Lexer layer: Markup_lexers
- + gets input from lexbuffers created by resolvers
- + different lexers for different lexical contexts
- + a lexer returns pairs (token,lexid), where token is the scanned
- token, and lexid is the name of the lexer that must be used for
- the next token
- - Resolver layer: Markup_entity
- + a resolver creates the lexbuf from some character source
- + a resolver recodes the input and handles the encoding scheme
-
-----------------------------------------------------------------------
-The YACC based parser
-----------------------------------------------------------------------
-
-ocamlyacc allows it to pass an arbitrary 'next_token' function to the
-parsing functions. We always use 'en # next_token()' where 'en' is the
-main entity object representing the main file to be parsed.
-
-The parser is not functional, but uses mainly side-effects to accumulate
-the structures that have been recognized. This is very important for the
-entity definitions, because once an entity definition has been found there
-may be a reference to it which is handled by the entity layer (which is
-below the yacc layer). This means that such a definition modifies the
-token source of the parser, and this can only be handled by side-effects
-(at least in a sensible manner; a purely functional parser would have to
-pass unresolved entity references to its caller, which would have to
-resolve the reference and to re-parse the whole document!).
-
-Note that also element definitions profit from the imperative style of
-the parser; an element instance can be validated directly once the end
-tag has been read in.
-
-----------------------------------------------------------------------
-The entity layer
-----------------------------------------------------------------------
-
-The parser gets the tokens from the main entity object. This object
-controls the underlying lexing mechanism (see below), and already
-interprets the following:
-
-- Conditional sections (if they are allowed in this entity):
- The structures <![ INCLUDE [ ... ]]> and <! IGNORE [ ... ]]> are
- recognized and interpreted.
-
- This would be hard to realize by the yacc parser, because:
- - INCLUDE and IGNORE are not recognized as lexical keywords but as names.
- This means that the parser cannot select different rules for them.
- - The text after IGNORE requires a different lexical handling.
-
-- Entity references: &name; and %name;
- The named entity is looked up and the input source is redirected to it, i.e.
- if the main entity object gets the message 'next_token' this message is
- forwarded to the referenced entity. (This entity may choose to forward the
- message again to a third entity, and so on.)
-
- There are some fine points:
-
- - It is okay that redirection happens at token level, not at character level:
- + General entities must always match the 'content' production, and because
- of this they must always consist of a whole number of tokens.
- + If parameter entities are resolved, the XML specification states that
- a space character is inserted before and after the replacement text.
- This also means that such entities always consists of a whole number
- of tokens.
-
- - There are some "nesting constraints":
- + General entities must match the 'content' production. Because of this,
- the special token Begin_entity is inserted before the first token of
- the entity, and End_entity is inserted just before the Eof token. The
- brace Begin_entity...End_entity is recognized by the yacc parser, but
- only in the 'content' production.
- + External parameter entities must match 'extSubsetDecl'. Again,
- Begin_entity and End_entity tokens embrace the inner token stream.
- The brace Begin_entity...End_entity is recognized by the yacc parser
- at the appropriate position.
- (As general and parameter entities are used in different contexts
- (document vs. DTD), both kinds of entities can use the same brace
- Begin_entity...End_entity.)
- + TODO:
- The constraints for internal parameter entities are not yet checked.
-
- - Recursive references can be detected because entities must be opened
- before the 'next_token' method can be invoked.
-
-----------------------------------------------------------------------
-The lexer layer
-----------------------------------------------------------------------
-
-There are five main lexers, and a number of auxiliary lexers. The five
-main lexers are:
-
-- Document (function scan_document):
- Scans an XML document outside the DTD and outside the element instance.
-
-- Content (function scan_content):
- Scans an element instance, but not within tags.
-
-- Within_tag (function scan_within_tag):
- Scans within <...>, i.e. a tag denoting an element instance.
-
-- Document_type (function scan_document_type):
- Scans after <!DOCTYPE until the corresponding >.
-
-- Declaration (function scan_declaration):
- Scans sequences of declarations
-
-Why several lexers? Because there are different lexical rules in these
-five regions of an XML document.
-
-Every lexer not only produces tokens, but also the name of the next lexer
-to use. For example, if the Document lexer scans "<!DOCTYPE", it also
-outputs that the next token must be scanned by Document_type.
-
-It is interesting that this really works. The beginning of every lexical
-context can be recognized by the lexer of the previous context, and there
-is always a token that unambigously indicates that the context ends.
-
-----------------------------------------------------------------------
-The DTD object
-----------------------------------------------------------------------
-
-There is usually one object that collects DTD declarations. All kinds of
-declarations are entered here:
-
-- element and attribute list declarations
-- entity declarations
-- notation declarations
-
-Some properties are validated directly after a declarations has been added
-to the DTD, but most validation is done by a 'validate' method.
-
-The result of 'validate' is stored such that another invocation is cheap.
-A DTD becomes again 'unchecked' if another declaration is added.
-
-TODO: We need a special DTD object that allows every content.
-
-The DTD object is known by more or less every other object, i.e. entities
-know the DTD, element declarations and instances know the DTD, and so on.
-
-TODO: We need a method that deletes all entity declarations once the DTD
-is complete (to free memory).
-
-----------------------------------------------------------------------
-Element and Document objects
-----------------------------------------------------------------------
-
-The 'element' objects form the tree of the element instances.
-
-The 'document' object is a derivate of 'element' where properties of the
-whole document can be stored.
-
-New element objects are NOT created by the "new class" mechanism, but
-instead by an exemplar/instance scheme: A new instance is the duplicate
-of an exemplar. This has the advantage that the user can provide own
-classes for the element instances. A hashtable contains the exemplars
-for every element type (tag name), and there is a default exemplar.
-The user can configure this hashtable such that for elements A objects
-of class element_a, for elements B objects of class element_b and so on
-are used.
-
-The object for the root element must already be created before parsing
-starts, and the parser returns the (filled) root object. Because of this,
-the user determines the *static* type of the object without the need
-of back coercion (which is not possible in Ocaml).
-
-----------------------------------------------------------------------
-Newline normalization
-----------------------------------------------------------------------
-
-The XML spec states that all of \n, \r, and \r\n must be recognized
-as newline characters/character sequences. Notes:
-- The replacement text of entities always contains the orginal text,
- i.e. \r and \r\n are NOT converted to \n.
- It is unclear if this is a violation of the standard or not.
-- Content of elements: Newline characters are converted to \n.
-- Attribute values: Newline characters are converted to spaces.
-- Processing instructions: Newline characters are not converted.
- It is unclear if this is a violation of the standard or not.
-
-----------------------------------------------------------------------
-Empty entities
-----------------------------------------------------------------------
-
-Many entities are artificially surrounded by a Begin_entity/End_entity pair.
-This is sometimes not done if the entity is empty:
-
-- External parameter entities are parsed entities, i.e. they must match
- the markupdecl* production. If they are not empty, the Begin_entity/End_entity
- trick guarantees that they match markupdecl+, and that they are only
- referred to at positions where markupdecl+ is allowed.
- If they are empty, they are allowed everywhere just like internal
- parameter entities. Because of this, the Begin_entity/End_entity pair
- is dropped.
-
-- This does not apply to parameter entities (either external or internal)
- which are referred to in the internal subset, nor applies to internal
- parameter entities, nor applies to general entities:
-
- + References in the internal subset are only allowed at positions where
- markupdecl can occur, so Begin_entity/End_entity is added even if the
- entity is empty.
- + References to internal parameter entities are allowed anywhere, so
- never Begin_entity/End_entity is added.
- + References to general entities: An empty Begin_entity/End_entity pair
- is recognized by the yacc parser, so special handling is not required.
- Moreover, there is the situation that an empty entity is referred to
- after the toplevel element:
- <!DOCTYPE doc ...[
- <!ENTITY empty "">
- ]>
- <doc></doc>∅
- - This is illegal, and the presence of an empty Begin_entity/End_entity pair
- helps to recognize this.