------------------------------------------------ -*- indented-text -*- Some Notes About the Design: ---------------------------------------------------------------------- ---------------------------------------------------------------------- Compilation ---------------------------------------------------------------------- Compilation is non-trivial because: - The lexer and parser generators ocamlllex resp. ocamlyacc normally create code such that the parser module precedes the lexer module. THIS design requires that the lexer layer precedes the entity layer which precedes the parser layer, because the parsing results modify the behaviour of the lexer and entity layers. There is no way to get around this because of the nature of XML. So the dependency relation of the lexer and the parser is modified; in particular the "token" type that is normally defined by the generated parser is moved to a common prdecessor of both lexer and parser. - Another modification of the standard way of handling parsers is that the parser is turned into an object. This is necessary because the whole parser is polymorphic, i.e. there is a type parameter (the type of the node extension). ...................................................................... First some modules are generated as illustrated by the following diagram: markup_yacc.mly | | \|/ \|/ [ocamlyacc, 1] V V markup_yacc.mli markup_yacc.ml | --> renamed into markup_yacc.ml0 [awk, 2] \|/ | V \|/ [sed, 3] markup_yacc_token.mlf V | | markup_yacc.ml markup_lexer_types_ | | shadow.mli | | | markup_lexer_types_ \|/ [sed, \|/ | shadow.ml V 4] V | | markup_lexer_types.mli | | [sed, 4] \|/ \|/ V V markup_lexer_types.ml markup_yacc_shadow.mli | \|/ [replaces, 5] V markup_yacc.mli markup_lexers.mll | \|/ [ocamllex, 6] V markup_lexers.ml Notes: (1) ocamlyacc generates both a module and a module interface. The module is postprocessed in step (3). The interface cannot be used, but it contains the definition of the "token" type. This definition is extracted in step (2). The interface is completely replaced in step (5) by a different file. (2) An "awk" script extracts the definition of the type "token". "token" is created by ocamlyacc upon the %token directives in markup_yacc.mly, and normally "token" is defined in the module generated by ocamlyacc. This turned out not to be useful as the module dependency must be that the lexer is an antecedent of the parser and not vice versa (as usually), so the "token" type is "moved" to the module Markup_lexer_types which is an antecedent of both the lexer and the parser. (3) A "sed" script turns the generated parser into an object. This is rather simple; some "let" definitions must be rewritten as "val" definitions, the other "let" definitions as "method" definitions. The parser object is needed because the whole parser has a polymorphic type parameter. (4) The implementation and definition of Markup_lexer_types are both generated by inserting the "token" type definition (in markup_lexer_types.mlf) into two pattern files, markup_lexer_types_shadow.ml resp. -.mli. The point of insertion is marked by the string INCLUDE_HERE. (5) The generated interface of the Markup_yacc module is replaced by a hand-written file. (6) ocamllex generates the lexer; this process is not patched in any way. ...................................................................... After the additional modules have been generated, compilation proceeds in the usual manner. ---------------------------------------------------------------------- Hierarchy of parsing layers: ---------------------------------------------------------------------- From top to bottom: - Parser: Markup_yacc + gets input stream from the main entity object + checks most of the grammar + creates the DTD object as side-effect + creates the element tree as side-effect + creates further entity objects that are entered into the DTD - Entity layer: Markup_entity + gets input stream from the lexers, or another entity object + handles entity references: if a reference is encountered the input stream is redirected such that the tokens come from the referenced entity object + handles conditional sections - Lexer layer: Markup_lexers + gets input from lexbuffers created by resolvers + different lexers for different lexical contexts + a lexer returns pairs (token,lexid), where token is the scanned token, and lexid is the name of the lexer that must be used for the next token - Resolver layer: Markup_entity + a resolver creates the lexbuf from some character source + a resolver recodes the input and handles the encoding scheme ---------------------------------------------------------------------- The YACC based parser ---------------------------------------------------------------------- ocamlyacc allows it to pass an arbitrary 'next_token' function to the parsing functions. We always use 'en # next_token()' where 'en' is the main entity object representing the main file to be parsed. The parser is not functional, but uses mainly side-effects to accumulate the structures that have been recognized. This is very important for the entity definitions, because once an entity definition has been found there may be a reference to it which is handled by the entity layer (which is below the yacc layer). This means that such a definition modifies the token source of the parser, and this can only be handled by side-effects (at least in a sensible manner; a purely functional parser would have to pass unresolved entity references to its caller, which would have to resolve the reference and to re-parse the whole document!). Note that also element definitions profit from the imperative style of the parser; an element instance can be validated directly once the end tag has been read in. ---------------------------------------------------------------------- The entity layer ---------------------------------------------------------------------- The parser gets the tokens from the main entity object. This object controls the underlying lexing mechanism (see below), and already interprets the following: - Conditional sections (if they are allowed in this entity): The structures and are recognized and interpreted. This would be hard to realize by the yacc parser, because: - INCLUDE and IGNORE are not recognized as lexical keywords but as names. This means that the parser cannot select different rules for them. - The text after IGNORE requires a different lexical handling. - Entity references: &name; and %name; The named entity is looked up and the input source is redirected to it, i.e. if the main entity object gets the message 'next_token' this message is forwarded to the referenced entity. (This entity may choose to forward the message again to a third entity, and so on.) There are some fine points: - It is okay that redirection happens at token level, not at character level: + General entities must always match the 'content' production, and because of this they must always consist of a whole number of tokens. + If parameter entities are resolved, the XML specification states that a space character is inserted before and after the replacement text. This also means that such entities always consists of a whole number of tokens. - There are some "nesting constraints": + General entities must match the 'content' production. Because of this, the special token Begin_entity is inserted before the first token of the entity, and End_entity is inserted just before the Eof token. The brace Begin_entity...End_entity is recognized by the yacc parser, but only in the 'content' production. + External parameter entities must match 'extSubsetDecl'. Again, Begin_entity and End_entity tokens embrace the inner token stream. The brace Begin_entity...End_entity is recognized by the yacc parser at the appropriate position. (As general and parameter entities are used in different contexts (document vs. DTD), both kinds of entities can use the same brace Begin_entity...End_entity.) + TODO: The constraints for internal parameter entities are not yet checked. - Recursive references can be detected because entities must be opened before the 'next_token' method can be invoked. ---------------------------------------------------------------------- The lexer layer ---------------------------------------------------------------------- There are five main lexers, and a number of auxiliary lexers. The five main lexers are: - Document (function scan_document): Scans an XML document outside the DTD and outside the element instance. - Content (function scan_content): Scans an element instance, but not within tags. - Within_tag (function scan_within_tag): Scans within <...>, i.e. a tag denoting an element instance. - Document_type (function scan_document_type): Scans after . - Declaration (function scan_declaration): Scans sequences of declarations Why several lexers? Because there are different lexical rules in these five regions of an XML document. Every lexer not only produces tokens, but also the name of the next lexer to use. For example, if the Document lexer scans " ]> ∅ - This is illegal, and the presence of an empty Begin_entity/End_entity pair helps to recognize this.