X-Git-Url: http://matita.cs.unibo.it/gitweb/?a=blobdiff_plain;f=helm%2FDEVEL%2Fpxp%2Fpxp%2Fm2parsergen%2FREADME;fp=helm%2FDEVEL%2Fpxp%2Fpxp%2Fm2parsergen%2FREADME;h=cccf7aa5557be8ce6cf217f7b91cfffd89180b4e;hb=c03d2c1fdab8d228cb88aaba5ca0f556318bebc5;hp=0000000000000000000000000000000000000000;hpb=758057e85325f94cd88583feb1fdf6b038e35055;p=helm.git diff --git a/helm/DEVEL/pxp/pxp/m2parsergen/README b/helm/DEVEL/pxp/pxp/m2parsergen/README new file mode 100644 index 000000000..cccf7aa55 --- /dev/null +++ b/helm/DEVEL/pxp/pxp/m2parsergen/README @@ -0,0 +1,319 @@ +---------------------------------------------------------------------- +m2parsergen +---------------------------------------------------------------------- + +This is a parser generator for top-down (or recursively descending) parsers. +The input file must be structured as follows: + +---------------------------------------- Begin of file + + + +%% + + + +%% + + + +%% + + + +---------------------------------------- End of file + +The two-character combination %% separates the various sections. The +text before the first %% and after the last %% will be copied verbatim +to the output file. + +Within the declarations and rules sections you must use /* ... */ as +comment braces. + +There are two types of declarations: + +%token Name + +declares that Name is a token without associated value, and + +%token <> Name + +declares that Name is a token with associated value (i.e. Name x). + +In contrast to ocamlyacc, you need not to specify a type. This is a +fundamental difference, because m2parsergen will not generate a type +declaration for a "token" type; you must do this yourself. + +You need not to declare start symbols; every grammar rule may be used +as start symbol. + +The rules look like: + +name_of_rule(arg1, arg2, ...): + label1:symbol1 label2:symbol2 ... {{ CODE }} +| label1:symbol1 label2:symbol2 ... {{ CODE }} +... +| label1:symbol1 label2:symbol2 ... {{ CODE }} + +The rules may have arguments (note that you must write the +parantheses, even if the rule does not have arguments). Here, arg1, +arg2, ... are the formal names of the arguments; you may refer to them +in OCaml code. + +Furthermore, the symbols may have labels (you can leave the labels +out). You can refer to the value associated with a symbol by its +label, i.e. there is an OCaml variable with the same name as the label +prescribes, and this variable contains the value. + +The OCaml code must be embraced by {{ and }}, and these separators +must not occur within the code. + +EXAMPLE: + +prefix_term(): + Plus_symbol Left_paren v1:prefix_term() Comma v2:prefix_term() Right_paren + {{ v1 + v2 }} +| Times_symbol Left_paren v1:prefix_term() Comma v2:prefix_term() Right_paren + {{ v1 * v2 }} +| n:Number + {{ n }} + +As you can see in the example, you must pass values for the arguments +if you call non-terminal symbols (here, the argument list is empty: ()). + +The generated parsers behave as follows: + +- A rule is applicable to a token sequence if the first token is + matched by the rule. + + In the example: prefix_term is applicable if the first token of a + sequence is either Plus_symbol, Times_symbol, or Number. + +- One branch of the applicable rule is selected: it is the first + branch that matches the first token. THE OTHER TOKENS DO NOT HAVE + ANY EFFECT ON BRANCH SELECTION! + + For instance, in the following rule the second branch is never + selected, because only the A is used to select the branch: + + a(): + A B {{ ... }} + | A C {{ ... }} + +- Once a branch is selected, it is checked whether the branch matches + the token sequence. If this check succeeds, the code section of the + branch is executed, and the resulting value is returned to the + caller. + If the check fails, the exception Parsing.Parse_error is raised. + Normally, this exception is not caught, and will force the parser + to stop. + + The check in detail: + + If the rule demands a terminal, there a must be exactly this + terminal at the corresponding location in the token sequence. + + If the rule demands a non-terminal, it is checked whether the rule + for to this non-terminal is applicable. If so, the branch + is selected, and recursively checked. If the rule is not applicable, + the check fails immediately. + +- THERE IS NO BACKTRACKING! + + Note that the following works (but the construction is resolved at + generation time): + + rule1() = + rule2() A B ... {{ ... }} + + rule2() = + C {{ ... }} + | D {{ ... }} + + In this case, the (only) branch of rule1 is selected if the next + token is C or D. + +--- + + + +*** Options and repetitions *** + +Symbols can be tagged as being optional, or to occur repeatedly: + +rule(): + Name whitespace()* Question_mark? + +- "*": The symbol matches zero or more occurrences. + +- "?": The symbol matches zero or one occurrence. + +This is done as follows: + +- terminal*: The maximum number of consecutive tokens are + matched. +- non-terminal*: The maximum number of the subsequences matching + are matched. Before another + subsequence is matched, it is checked whether the + rule for is applicable. If so, the + rule is invoked and must succeed (otherwise Parsing. + Parse_error). If not, the loop is exited. + +- terminal?: If the next token is , it is matched. If not, + no token is matched. + +- non-terminal?: It is checked whether the rule for + is applicable. If so, the rule is invoked, and + matches a sequence of tokens. If not, no token is + matched. + +You may refer to repeated or optional symbols by labels. In this case, +the label is associated with lists of values, or optional values, +respectively: + +rule(): + A lab:other()* lab':unlikely()? + {{ let n = List.length lab in ... + match lab' with + None -> ... + | Some v -> ... + }} + +A different scheme is applied if the symbol is a token without +associated value (%token Name, and NOT %token <> Name): + +rule(): + A lab:B* lab':C? + +Here, "lab" becomes an integer variable counting the number of Bs, and +"lab'" becomes a boolean variable denoting whether there is a C or not. + + +*** Early let-binding *** + +You may put some OCaml code directly after the first symbol of a +branch: + +rule(): + A $ {{ let-binding }} C D ... {{ ... }} + +The code brace {{ let-binding }} must be preceded by a dollar +sign. You can put "let ... = ... in" statements into this brace: + +rule1(): + n:A $ {{ let twice = 2 * n in }} rule2(twice) {{ ... }} + +This code is executed once the branch is selected. + + +*** Very early let-binding *** + +This is also possible: + +rule(): + $ {{ CODE }} + A + ... + +The CODE is executed right when the branch is selected, and before any +other happens. (Only for hacks!) + + + +*** Computed rules *** + +rule(): + A $ {{ let followup = ... some function ... in }} [ followup ]() + {{ ... }} + +Between [ and ], you can refer to the O'Caml name of *any* function. +Here, the function "followup" is bound in the let-binding. + + +*** Error handling *** + +If a branch is already selected, but the check fails whether the other +symbols of the branch match, it is possible to catch the resulting +exception and to find out at which position the failure has occurred. + +rule(): + x:A y:B z:C {{ ... }} ? {{ ERROR-CODE }} + +After a question mark, it is allowed to append another code +brace. This code is executed if the branch check fails (but not if the +branch is not selected nor if no branches are selected). The string +variable !yy_position contains the label of the symbol that caused the +failure (or it contains the empty string if the symbol does not have a +label). + +Example: + +rule(): + x:A y:B z:C {{ print_endline "SUCCESS" }} ? {{ print_endline !yy_position }} + +If the token sequence is A B C, "SUCCESS" will be printed. If the +sequence is A C, the second symbol fails, and "y" will be printed. If +the sequence is A B D, the third symbol fails, and "z" will be +printed. If the sequence is B, the rule will be never selected because +it is not applicable. + + + +*** Error recovery *** + +You may call the functions yy_current, yy_get_next, or one of the +parse_* functions in the error brace to recover from the error +(e.g. to move ahead until a certain token is reached). See below. + + + +*** How to call the parser *** + +The rules are rewritten into a OCaml let-binding: + +let rec parse_ ... = ... + and parse_ ... = ... + ... + and parse_ ... = ... +in + +i.e. there are lots of functions, and the name of the functions are +"parse_" plus the name of the rules. You can call every function. + +The first two arguments of the functions have a special meaning; the +other arguments are the arguments coming from the rule description: + +rule(a,b): + ... + +===> + +let rec parse_rule yy_current yy_get_next a b = ... + +The first argument, yy_current, is a function that returns the current +token. The second arguments, yy_get_next, is a function that switches +to the next token, and returns it. + +If the tokens are stored in a list, this may be a definition: + +let input = ref [ Token1; Token2; ... ] in +let yy_current() = List.hd !input in +let yy_get_next () = + input := List.tl !input; + List.hd !input + +When you call one of the parser functions, the current token must +already be loaded, i.e. yy_current returns the first token to match by +the function. + +After the functions has returned, the current token is the token +following the sequence of tokens that have been matched by the +function. + +The function returns the value computed by the OCaml code brace of the +rule (or the value of the error brace). + +If the rule is not applicable, the exception Not_found is raised. + +If the rule is applicable, but it does not match, the exception +Parsing.Parse_error is raised.