helm/DEVEL/pxp/pxp/doc/design.txt

   1 ------------------------------------------------ -*- indented-text -*-
   2 Some Notes About the Design:
   3 ----------------------------------------------------------------------
   4
   5 ----------------------------------------------------------------------
   6 Compilation
   7 ----------------------------------------------------------------------
   8
   9 Compilation is non-trivial because:
  10
  11  - The lexer and parser generators ocamlllex resp. ocamlyacc normally
  12    create code such that the parser module precedes the lexer module.
  13    THIS design requires that the lexer layer precedes the entity layer
  14    which precedes the parser layer, because the parsing results modify
  15    the behaviour of the lexer and entity layers. There is no way to get
  16    around this because of the nature of XML.
  17
  18    So the dependency relation of the lexer and the parser is modified;
  19    in particular the "token" type that is normally defined by the
  20    generated parser is moved to a common prdecessor of both lexer
  21    and parser.
  22
  23  - Another modification of the standard way of handling parsers is that
  24    the parser is turned into an object. This is necessary because the
  25    whole parser is polymorphic, i.e. there is a type parameter (the
  26    type of the node extension).
  27
  28 ......................................................................
  29
  30 First some modules are generated as illustrated by the following
  31 diagram:
  32
  33
  34                              markup_yacc.mly
  35                                |        |
  36                               \|/      \|/  [ocamlyacc, 1]
  37                                V        V
  38                      markup_yacc.mli  markup_yacc.ml
  39                           |             --> renamed into markup_yacc.ml0
  40             [awk, 2]     \|/                        |
  41                           V                        \|/   [sed, 3]
  42                markup_yacc_token.mlf                V
  43                        |       |              markup_yacc.ml
  44  markup_lexer_types_   |       |
  45  shadow.mli  |         |       | markup_lexer_types_
  46             \|/ [sed, \|/      | shadow.ml
  47              V    4]   V       |     |
  48         markup_lexer_types.mli |     |   [sed, 4]
  49                               \|/   \|/
  50                                V     V
  51                         markup_lexer_types.ml
  52
  53
  54                            markup_yacc_shadow.mli
  55                                    |
  56                                   \|/  [replaces, 5]
  57                                    V
  58                               markup_yacc.mli
  59
  60
  61
  62                            markup_lexers.mll
  63                                    |
  64                                   \|/  [ocamllex, 6]
  65                                    V
  66                             markup_lexers.ml
  67
  68
  69 Notes:
  70
  71         (1) ocamlyacc generates both a module and a module interface.
  72             The module is postprocessed in step (3). The interface cannot
  73             be used, but it contains the definition of the "token" type.
  74             This definition is extracted in step (2). The interface is
  75             completely replaced in step (5) by a different file.
  76
  77         (2) An "awk" script extracts the definition of the type "token".
  78             "token" is created by ocamlyacc upon the %token directives
  79             in markup_yacc.mly, and normally "token" is defined in
  80             the module generated by ocamlyacc. This turned out not to be
  81             useful as the module dependency must be that the lexer is
  82             an antecedent of the parser and not vice versa (as usually),
  83             so the "token" type is "moved" to the module Markup_lexer_types
  84             which is an antecedent of both the lexer and the parser.
  85
  86         (3) A "sed" script turns the generated parser into an object.
  87             This is rather simple; some "let" definitions must be rewritten
  88             as "val" definitions, the other "let" definitions as
  89             "method" definitions. The parser object is needed because
  90             the whole parser has a polymorphic type parameter.
  91
  92         (4) The implementation and definition of Markup_lexer_types are
  93             both generated by inserting the "token" type definition
  94             (in markup_lexer_types.mlf) into two pattern files,
  95             markup_lexer_types_shadow.ml resp. -.mli. The point of insertion
  96             is marked by the string INCLUDE_HERE.
  97
  98         (5) The generated interface of the Markup_yacc module is replaced
  99             by a hand-written file.
 100
 101         (6) ocamllex generates the lexer; this process is not patched in any
 102             way.
 103
 104 ......................................................................
 105
 106 After the additional modules have been generated, compilation proceeds
 107 in the usual manner.
 108
 109
 110 ----------------------------------------------------------------------
 111 Hierarchy of parsing layers:
 112 ----------------------------------------------------------------------
 113
 114 From top to bottom:
 115
 116  - Parser: Markup_yacc
 117    + gets input stream from the main entity object
 118    + checks most of the grammar
 119    + creates the DTD object as side-effect
 120    + creates the element tree as side-effect
 121    + creates further entity objects that are entered into the DTD
 122  - Entity layer: Markup_entity
 123    + gets input stream from the lexers, or another entity object
 124    + handles entity references: if a reference is encountered the
 125      input stream is redirected such that the tokens come from the
 126      referenced entity object
 127    + handles conditional sections
 128  - Lexer layer: Markup_lexers
 129    + gets input from lexbuffers created by resolvers
 130    + different lexers for different lexical contexts
 131    + a lexer returns pairs (token,lexid), where token is the scanned
 132      token, and lexid is the name of the lexer that must be used for
 133      the next token
 134  - Resolver layer: Markup_entity
 135    + a resolver creates the lexbuf from some character source
 136    + a resolver recodes the input and handles the encoding scheme
 137
 138 ----------------------------------------------------------------------
 139 The YACC based parser
 140 ----------------------------------------------------------------------
 141
 142 ocamlyacc allows it to pass an arbitrary 'next_token' function to the
 143 parsing functions. We always use 'en # next_token()' where 'en' is the
 144 main entity object representing the main file to be parsed.
 145
 146 The parser is not functional, but uses mainly side-effects to accumulate
 147 the structures that have been recognized. This is very important for the
 148 entity definitions, because once an entity definition has been found there
 149 may be a reference to it which is handled by the entity layer (which is
 150 below the yacc layer). This means that such a definition modifies the
 151 token source of the parser, and this can only be handled by side-effects
 152 (at least in a sensible manner; a purely functional parser would have to
 153 pass unresolved entity references to its caller, which would have to
 154 resolve the reference and to re-parse the whole document!).
 155
 156 Note that also element definitions profit from the imperative style of
 157 the parser; an element instance can be validated directly once the end
 158 tag has been read in.
 159
 160 ----------------------------------------------------------------------
 161 The entity layer
 162 ----------------------------------------------------------------------
 163
 164 The parser gets the tokens from the main entity object. This object
 165 controls the underlying lexing mechanism (see below), and already
 166 interprets the following:
 167
 168 - Conditional sections (if they are allowed in this entity):
 169   The structures <![ INCLUDE [ ... ]]> and <! IGNORE [ ... ]]> are
 170   recognized and interpreted.
 171
 172   This would be hard to realize by the yacc parser, because:
 173   - INCLUDE and IGNORE are not recognized as lexical keywords but as names.
 174     This means that the parser cannot select different rules for them.
 175   - The text after IGNORE requires a different lexical handling.
 176
 177 - Entity references: &name; and %name;
 178   The named entity is looked up and the input source is redirected to it, i.e.
 179   if the main entity object gets the message 'next_token' this message is
 180   forwarded to the referenced entity. (This entity may choose to forward the
 181   message again to a third entity, and so on.)
 182
 183   There are some fine points:
 184
 185   - It is okay that redirection happens at token level, not at character level:
 186     + General entities must always match the 'content' production, and because
 187       of this they must always consist of a whole number of tokens.
 188     + If parameter entities are resolved, the XML specification states that
 189       a space character is inserted before and after the replacement text.
 190       This also means that such entities always consists of a whole number
 191       of tokens.
 192
 193   - There are some "nesting constraints":
 194     + General entities must match the 'content' production. Because of this,
 195       the special token Begin_entity is inserted before the first token of
 196       the entity, and End_entity is inserted just before the Eof token. The
 197       brace Begin_entity...End_entity is recognized by the yacc parser, but
 198       only in the 'content' production.
 199     + External parameter entities must match 'extSubsetDecl'. Again,
 200       Begin_entity and End_entity tokens embrace the inner token stream.
 201       The brace Begin_entity...End_entity is recognized by the yacc parser
 202       at the appropriate position.
 203       (As general and parameter entities are used in different contexts
 204       (document vs. DTD), both kinds of entities can use the same brace
 205       Begin_entity...End_entity.)
 206     + TODO:
 207       The constraints for internal parameter entities are not yet checked.
 208
 209   - Recursive references can be detected because entities must be opened
 210     before the 'next_token' method can be invoked.
 211
 212 ----------------------------------------------------------------------
 213 The lexer layer
 214 ----------------------------------------------------------------------
 215
 216 There are five main lexers, and a number of auxiliary lexers. The five
 217 main lexers are:
 218
 219 - Document (function scan_document):
 220   Scans an XML document outside the DTD and outside the element instance.
 221
 222 - Content (function scan_content):
 223   Scans an element instance, but not within tags.
 224
 225 - Within_tag (function scan_within_tag):
 226   Scans within <...>, i.e. a tag denoting an element instance.
 227
 228 - Document_type (function scan_document_type):
 229   Scans after <!DOCTYPE until the corresponding >.
 230
 231 - Declaration (function scan_declaration):
 232   Scans sequences of declarations
 233
 234 Why several lexers? Because there are different lexical rules in these
 235 five regions of an XML document.
 236
 237 Every lexer not only produces tokens, but also the name of the next lexer
 238 to use. For example, if the Document lexer scans "<!DOCTYPE", it also
 239 outputs that the next token must be scanned by Document_type.
 240
 241 It is interesting that this really works. The beginning of every lexical
 242 context can be recognized by the lexer of the previous context, and there
 243 is always a token that unambigously indicates that the context ends.
 244
 245 ----------------------------------------------------------------------
 246 The DTD object
 247 ----------------------------------------------------------------------
 248
 249 There is usually one object that collects DTD declarations. All kinds of
 250 declarations are entered here:
 251
 252 - element and attribute list declarations
 253 - entity declarations
 254 - notation declarations
 255
 256 Some properties are validated directly after a declarations has been added
 257 to the DTD, but most validation is done by a 'validate' method.
 258
 259 The result of 'validate' is stored such that another invocation is cheap.
 260 A DTD becomes again 'unchecked' if another declaration is added.
 261
 262 TODO: We need a special DTD object that allows every content.
 263
 264 The DTD object is known by more or less every other object, i.e. entities
 265 know the DTD, element declarations and instances know the DTD, and so on.
 266
 267 TODO: We need a method that deletes all entity declarations once the DTD
 268 is complete (to free memory).
 269
 270 ----------------------------------------------------------------------
 271 Element and Document objects
 272 ----------------------------------------------------------------------
 273
 274 The 'element' objects form the tree of the element instances.
 275
 276 The 'document' object is a derivate of 'element' where properties of the
 277 whole document can be stored.
 278
 279 New element objects are NOT created by the "new class" mechanism, but
 280 instead by an exemplar/instance scheme: A new instance is the duplicate
 281 of an exemplar. This has the advantage that the user can provide own
 282 classes for the element instances. A hashtable contains the exemplars
 283 for every element type (tag name), and there is a default exemplar.
 284 The user can configure this hashtable such that for elements A objects
 285 of class element_a, for elements B objects of class element_b and so on
 286 are used.
 287
 288 The object for the root element must already be created before parsing
 289 starts, and the parser returns the (filled) root object. Because of this,
 290 the user determines the *static* type of the object without the need
 291 of back coercion (which is not possible in Ocaml).
 292
 293 ----------------------------------------------------------------------
 294 Newline normalization
 295 ----------------------------------------------------------------------
 296
 297 The XML spec states that all of \n, \r, and \r\n must be recognized
 298 as newline characters/character sequences. Notes:
 299 - The replacement text of entities always contains the orginal text,
 300   i.e. \r and \r\n are NOT converted to \n.
 301   It is unclear if this is a violation of the standard or not.
 302 - Content of elements: Newline characters are converted to \n.
 303 - Attribute values: Newline characters are converted to spaces.
 304 - Processing instructions: Newline characters are not converted.
 305   It is unclear if this is a violation of the standard or not.
 306
 307 ----------------------------------------------------------------------
 308 Empty entities
 309 ----------------------------------------------------------------------
 310
 311 Many entities are artificially surrounded by a Begin_entity/End_entity pair.
 312 This is sometimes not done if the entity is empty:
 313
 314 - External parameter entities are parsed entities, i.e. they must match
 315   the markupdecl* production. If they are not empty, the Begin_entity/End_entity
 316   trick guarantees that they match markupdecl+, and that they are only
 317   referred to at positions where markupdecl+ is allowed.
 318   If they are empty, they are allowed everywhere just like internal
 319   parameter entities. Because of this, the Begin_entity/End_entity pair
 320   is dropped.
 321
 322 - This does not apply to parameter entities (either external or internal)
 323   which are referred to in the internal subset, nor applies to internal
 324   parameter entities, nor applies to general entities:
 325
 326   + References in the internal subset are only allowed at positions where
 327     markupdecl can occur, so Begin_entity/End_entity is added even if the
 328     entity is empty.
 329   + References to internal parameter entities are allowed anywhere, so
 330     never Begin_entity/End_entity is added.
 331   + References to general entities: An empty  Begin_entity/End_entity pair
 332     is recognized by the yacc parser, so special handling is not required.
 333     Moreover, there is the situation that an empty entity is referred to
 334     after the toplevel element:
 335     <!DOCTYPE doc ...[
 336     <!ENTITY empty "">
 337     ]>
 338     <doc></doc>&empty;
 339     - This is illegal, and the presence of an empty Begin_entity/End_entity pair
 340     helps to recognize this.