X-Git-Url: http://matita.cs.unibo.it/gitweb/?a=blobdiff_plain;f=helm%2FDEVEL%2Fpxp%2Fpxp%2Fdoc%2Fdesign.txt;fp=helm%2FDEVEL%2Fpxp%2Fpxp%2Fdoc%2Fdesign.txt;h=bf75d06180e6ab31b6bb11d38eebf95762533886;hb=c03d2c1fdab8d228cb88aaba5ca0f556318bebc5;hp=0000000000000000000000000000000000000000;hpb=758057e85325f94cd88583feb1fdf6b038e35055;p=helm.git diff --git a/helm/DEVEL/pxp/pxp/doc/design.txt b/helm/DEVEL/pxp/pxp/doc/design.txt new file mode 100644 index 000000000..bf75d0618 --- /dev/null +++ b/helm/DEVEL/pxp/pxp/doc/design.txt @@ -0,0 +1,340 @@ +------------------------------------------------ -*- indented-text -*- +Some Notes About the Design: +---------------------------------------------------------------------- + +---------------------------------------------------------------------- +Compilation +---------------------------------------------------------------------- + +Compilation is non-trivial because: + + - The lexer and parser generators ocamlllex resp. ocamlyacc normally + create code such that the parser module precedes the lexer module. + THIS design requires that the lexer layer precedes the entity layer + which precedes the parser layer, because the parsing results modify + the behaviour of the lexer and entity layers. There is no way to get + around this because of the nature of XML. + + So the dependency relation of the lexer and the parser is modified; + in particular the "token" type that is normally defined by the + generated parser is moved to a common prdecessor of both lexer + and parser. + + - Another modification of the standard way of handling parsers is that + the parser is turned into an object. This is necessary because the + whole parser is polymorphic, i.e. there is a type parameter (the + type of the node extension). + +...................................................................... + +First some modules are generated as illustrated by the following +diagram: + + + markup_yacc.mly + | | + \|/ \|/ [ocamlyacc, 1] + V V + markup_yacc.mli markup_yacc.ml + | --> renamed into markup_yacc.ml0 + [awk, 2] \|/ | + V \|/ [sed, 3] + markup_yacc_token.mlf V + | | markup_yacc.ml + markup_lexer_types_ | | + shadow.mli | | | markup_lexer_types_ + \|/ [sed, \|/ | shadow.ml + V 4] V | | + markup_lexer_types.mli | | [sed, 4] + \|/ \|/ + V V + markup_lexer_types.ml + + + markup_yacc_shadow.mli + | + \|/ [replaces, 5] + V + markup_yacc.mli + + + + markup_lexers.mll + | + \|/ [ocamllex, 6] + V + markup_lexers.ml + + +Notes: + + (1) ocamlyacc generates both a module and a module interface. + The module is postprocessed in step (3). The interface cannot + be used, but it contains the definition of the "token" type. + This definition is extracted in step (2). The interface is + completely replaced in step (5) by a different file. + + (2) An "awk" script extracts the definition of the type "token". + "token" is created by ocamlyacc upon the %token directives + in markup_yacc.mly, and normally "token" is defined in + the module generated by ocamlyacc. This turned out not to be + useful as the module dependency must be that the lexer is + an antecedent of the parser and not vice versa (as usually), + so the "token" type is "moved" to the module Markup_lexer_types + which is an antecedent of both the lexer and the parser. + + (3) A "sed" script turns the generated parser into an object. + This is rather simple; some "let" definitions must be rewritten + as "val" definitions, the other "let" definitions as + "method" definitions. The parser object is needed because + the whole parser has a polymorphic type parameter. + + (4) The implementation and definition of Markup_lexer_types are + both generated by inserting the "token" type definition + (in markup_lexer_types.mlf) into two pattern files, + markup_lexer_types_shadow.ml resp. -.mli. The point of insertion + is marked by the string INCLUDE_HERE. + + (5) The generated interface of the Markup_yacc module is replaced + by a hand-written file. + + (6) ocamllex generates the lexer; this process is not patched in any + way. + +...................................................................... + +After the additional modules have been generated, compilation proceeds +in the usual manner. + + +---------------------------------------------------------------------- +Hierarchy of parsing layers: +---------------------------------------------------------------------- + +From top to bottom: + + - Parser: Markup_yacc + + gets input stream from the main entity object + + checks most of the grammar + + creates the DTD object as side-effect + + creates the element tree as side-effect + + creates further entity objects that are entered into the DTD + - Entity layer: Markup_entity + + gets input stream from the lexers, or another entity object + + handles entity references: if a reference is encountered the + input stream is redirected such that the tokens come from the + referenced entity object + + handles conditional sections + - Lexer layer: Markup_lexers + + gets input from lexbuffers created by resolvers + + different lexers for different lexical contexts + + a lexer returns pairs (token,lexid), where token is the scanned + token, and lexid is the name of the lexer that must be used for + the next token + - Resolver layer: Markup_entity + + a resolver creates the lexbuf from some character source + + a resolver recodes the input and handles the encoding scheme + +---------------------------------------------------------------------- +The YACC based parser +---------------------------------------------------------------------- + +ocamlyacc allows it to pass an arbitrary 'next_token' function to the +parsing functions. We always use 'en # next_token()' where 'en' is the +main entity object representing the main file to be parsed. + +The parser is not functional, but uses mainly side-effects to accumulate +the structures that have been recognized. This is very important for the +entity definitions, because once an entity definition has been found there +may be a reference to it which is handled by the entity layer (which is +below the yacc layer). This means that such a definition modifies the +token source of the parser, and this can only be handled by side-effects +(at least in a sensible manner; a purely functional parser would have to +pass unresolved entity references to its caller, which would have to +resolve the reference and to re-parse the whole document!). + +Note that also element definitions profit from the imperative style of +the parser; an element instance can be validated directly once the end +tag has been read in. + +---------------------------------------------------------------------- +The entity layer +---------------------------------------------------------------------- + +The parser gets the tokens from the main entity object. This object +controls the underlying lexing mechanism (see below), and already +interprets the following: + +- Conditional sections (if they are allowed in this entity): + The structures and are + recognized and interpreted. + + This would be hard to realize by the yacc parser, because: + - INCLUDE and IGNORE are not recognized as lexical keywords but as names. + This means that the parser cannot select different rules for them. + - The text after IGNORE requires a different lexical handling. + +- Entity references: &name; and %name; + The named entity is looked up and the input source is redirected to it, i.e. + if the main entity object gets the message 'next_token' this message is + forwarded to the referenced entity. (This entity may choose to forward the + message again to a third entity, and so on.) + + There are some fine points: + + - It is okay that redirection happens at token level, not at character level: + + General entities must always match the 'content' production, and because + of this they must always consist of a whole number of tokens. + + If parameter entities are resolved, the XML specification states that + a space character is inserted before and after the replacement text. + This also means that such entities always consists of a whole number + of tokens. + + - There are some "nesting constraints": + + General entities must match the 'content' production. Because of this, + the special token Begin_entity is inserted before the first token of + the entity, and End_entity is inserted just before the Eof token. The + brace Begin_entity...End_entity is recognized by the yacc parser, but + only in the 'content' production. + + External parameter entities must match 'extSubsetDecl'. Again, + Begin_entity and End_entity tokens embrace the inner token stream. + The brace Begin_entity...End_entity is recognized by the yacc parser + at the appropriate position. + (As general and parameter entities are used in different contexts + (document vs. DTD), both kinds of entities can use the same brace + Begin_entity...End_entity.) + + TODO: + The constraints for internal parameter entities are not yet checked. + + - Recursive references can be detected because entities must be opened + before the 'next_token' method can be invoked. + +---------------------------------------------------------------------- +The lexer layer +---------------------------------------------------------------------- + +There are five main lexers, and a number of auxiliary lexers. The five +main lexers are: + +- Document (function scan_document): + Scans an XML document outside the DTD and outside the element instance. + +- Content (function scan_content): + Scans an element instance, but not within tags. + +- Within_tag (function scan_within_tag): + Scans within <...>, i.e. a tag denoting an element instance. + +- Document_type (function scan_document_type): + Scans after . + +- Declaration (function scan_declaration): + Scans sequences of declarations + +Why several lexers? Because there are different lexical rules in these +five regions of an XML document. + +Every lexer not only produces tokens, but also the name of the next lexer +to use. For example, if the Document lexer scans " + ]> + ∅ + - This is illegal, and the presence of an empty Begin_entity/End_entity pair + helps to recognize this.