X-Git-Url: http://matita.cs.unibo.it/gitweb/?a=blobdiff_plain;f=helm%2FDEVEL%2Fpxp%2Fpxp%2Fdoc%2Fdesign.txt;fp=helm%2FDEVEL%2Fpxp%2Fpxp%2Fdoc%2Fdesign.txt;h=0000000000000000000000000000000000000000;hb=c7514aaa249a96c5fdd39b1123fbdb38d92f20b6;hp=bf75d06180e6ab31b6bb11d38eebf95762533886;hpb=1c7fb836e2af4f2f3d18afd0396701f2094265ff;p=helm.git diff --git a/helm/DEVEL/pxp/pxp/doc/design.txt b/helm/DEVEL/pxp/pxp/doc/design.txt deleted file mode 100644 index bf75d0618..000000000 --- a/helm/DEVEL/pxp/pxp/doc/design.txt +++ /dev/null @@ -1,340 +0,0 @@ ------------------------------------------------- -*- indented-text -*- -Some Notes About the Design: ----------------------------------------------------------------------- - ----------------------------------------------------------------------- -Compilation ----------------------------------------------------------------------- - -Compilation is non-trivial because: - - - The lexer and parser generators ocamlllex resp. ocamlyacc normally - create code such that the parser module precedes the lexer module. - THIS design requires that the lexer layer precedes the entity layer - which precedes the parser layer, because the parsing results modify - the behaviour of the lexer and entity layers. There is no way to get - around this because of the nature of XML. - - So the dependency relation of the lexer and the parser is modified; - in particular the "token" type that is normally defined by the - generated parser is moved to a common prdecessor of both lexer - and parser. - - - Another modification of the standard way of handling parsers is that - the parser is turned into an object. This is necessary because the - whole parser is polymorphic, i.e. there is a type parameter (the - type of the node extension). - -...................................................................... - -First some modules are generated as illustrated by the following -diagram: - - - markup_yacc.mly - | | - \|/ \|/ [ocamlyacc, 1] - V V - markup_yacc.mli markup_yacc.ml - | --> renamed into markup_yacc.ml0 - [awk, 2] \|/ | - V \|/ [sed, 3] - markup_yacc_token.mlf V - | | markup_yacc.ml - markup_lexer_types_ | | - shadow.mli | | | markup_lexer_types_ - \|/ [sed, \|/ | shadow.ml - V 4] V | | - markup_lexer_types.mli | | [sed, 4] - \|/ \|/ - V V - markup_lexer_types.ml - - - markup_yacc_shadow.mli - | - \|/ [replaces, 5] - V - markup_yacc.mli - - - - markup_lexers.mll - | - \|/ [ocamllex, 6] - V - markup_lexers.ml - - -Notes: - - (1) ocamlyacc generates both a module and a module interface. - The module is postprocessed in step (3). The interface cannot - be used, but it contains the definition of the "token" type. - This definition is extracted in step (2). The interface is - completely replaced in step (5) by a different file. - - (2) An "awk" script extracts the definition of the type "token". - "token" is created by ocamlyacc upon the %token directives - in markup_yacc.mly, and normally "token" is defined in - the module generated by ocamlyacc. This turned out not to be - useful as the module dependency must be that the lexer is - an antecedent of the parser and not vice versa (as usually), - so the "token" type is "moved" to the module Markup_lexer_types - which is an antecedent of both the lexer and the parser. - - (3) A "sed" script turns the generated parser into an object. - This is rather simple; some "let" definitions must be rewritten - as "val" definitions, the other "let" definitions as - "method" definitions. The parser object is needed because - the whole parser has a polymorphic type parameter. - - (4) The implementation and definition of Markup_lexer_types are - both generated by inserting the "token" type definition - (in markup_lexer_types.mlf) into two pattern files, - markup_lexer_types_shadow.ml resp. -.mli. The point of insertion - is marked by the string INCLUDE_HERE. - - (5) The generated interface of the Markup_yacc module is replaced - by a hand-written file. - - (6) ocamllex generates the lexer; this process is not patched in any - way. - -...................................................................... - -After the additional modules have been generated, compilation proceeds -in the usual manner. - - ----------------------------------------------------------------------- -Hierarchy of parsing layers: ----------------------------------------------------------------------- - -From top to bottom: - - - Parser: Markup_yacc - + gets input stream from the main entity object - + checks most of the grammar - + creates the DTD object as side-effect - + creates the element tree as side-effect - + creates further entity objects that are entered into the DTD - - Entity layer: Markup_entity - + gets input stream from the lexers, or another entity object - + handles entity references: if a reference is encountered the - input stream is redirected such that the tokens come from the - referenced entity object - + handles conditional sections - - Lexer layer: Markup_lexers - + gets input from lexbuffers created by resolvers - + different lexers for different lexical contexts - + a lexer returns pairs (token,lexid), where token is the scanned - token, and lexid is the name of the lexer that must be used for - the next token - - Resolver layer: Markup_entity - + a resolver creates the lexbuf from some character source - + a resolver recodes the input and handles the encoding scheme - ----------------------------------------------------------------------- -The YACC based parser ----------------------------------------------------------------------- - -ocamlyacc allows it to pass an arbitrary 'next_token' function to the -parsing functions. We always use 'en # next_token()' where 'en' is the -main entity object representing the main file to be parsed. - -The parser is not functional, but uses mainly side-effects to accumulate -the structures that have been recognized. This is very important for the -entity definitions, because once an entity definition has been found there -may be a reference to it which is handled by the entity layer (which is -below the yacc layer). This means that such a definition modifies the -token source of the parser, and this can only be handled by side-effects -(at least in a sensible manner; a purely functional parser would have to -pass unresolved entity references to its caller, which would have to -resolve the reference and to re-parse the whole document!). - -Note that also element definitions profit from the imperative style of -the parser; an element instance can be validated directly once the end -tag has been read in. - ----------------------------------------------------------------------- -The entity layer ----------------------------------------------------------------------- - -The parser gets the tokens from the main entity object. This object -controls the underlying lexing mechanism (see below), and already -interprets the following: - -- Conditional sections (if they are allowed in this entity): - The structures and are - recognized and interpreted. - - This would be hard to realize by the yacc parser, because: - - INCLUDE and IGNORE are not recognized as lexical keywords but as names. - This means that the parser cannot select different rules for them. - - The text after IGNORE requires a different lexical handling. - -- Entity references: &name; and %name; - The named entity is looked up and the input source is redirected to it, i.e. - if the main entity object gets the message 'next_token' this message is - forwarded to the referenced entity. (This entity may choose to forward the - message again to a third entity, and so on.) - - There are some fine points: - - - It is okay that redirection happens at token level, not at character level: - + General entities must always match the 'content' production, and because - of this they must always consist of a whole number of tokens. - + If parameter entities are resolved, the XML specification states that - a space character is inserted before and after the replacement text. - This also means that such entities always consists of a whole number - of tokens. - - - There are some "nesting constraints": - + General entities must match the 'content' production. Because of this, - the special token Begin_entity is inserted before the first token of - the entity, and End_entity is inserted just before the Eof token. The - brace Begin_entity...End_entity is recognized by the yacc parser, but - only in the 'content' production. - + External parameter entities must match 'extSubsetDecl'. Again, - Begin_entity and End_entity tokens embrace the inner token stream. - The brace Begin_entity...End_entity is recognized by the yacc parser - at the appropriate position. - (As general and parameter entities are used in different contexts - (document vs. DTD), both kinds of entities can use the same brace - Begin_entity...End_entity.) - + TODO: - The constraints for internal parameter entities are not yet checked. - - - Recursive references can be detected because entities must be opened - before the 'next_token' method can be invoked. - ----------------------------------------------------------------------- -The lexer layer ----------------------------------------------------------------------- - -There are five main lexers, and a number of auxiliary lexers. The five -main lexers are: - -- Document (function scan_document): - Scans an XML document outside the DTD and outside the element instance. - -- Content (function scan_content): - Scans an element instance, but not within tags. - -- Within_tag (function scan_within_tag): - Scans within <...>, i.e. a tag denoting an element instance. - -- Document_type (function scan_document_type): - Scans after . - -- Declaration (function scan_declaration): - Scans sequences of declarations - -Why several lexers? Because there are different lexical rules in these -five regions of an XML document. - -Every lexer not only produces tokens, but also the name of the next lexer -to use. For example, if the Document lexer scans " - ]> - ∅ - - This is illegal, and the presence of an empty Begin_entity/End_entity pair - helps to recognize this.