---------------------------------------------------------------------- m2parsergen ---------------------------------------------------------------------- This is a parser generator for top-down (or recursively descending) parsers. The input file must be structured as follows: ---------------------------------------- Begin of file %% %% %% ---------------------------------------- End of file The two-character combination %% separates the various sections. The text before the first %% and after the last %% will be copied verbatim to the output file. Within the declarations and rules sections you must use /* ... */ as comment braces. There are two types of declarations: %token Name declares that Name is a token without associated value, and %token <> Name declares that Name is a token with associated value (i.e. Name x). In contrast to ocamlyacc, you need not to specify a type. This is a fundamental difference, because m2parsergen will not generate a type declaration for a "token" type; you must do this yourself. You need not to declare start symbols; every grammar rule may be used as start symbol. The rules look like: name_of_rule(arg1, arg2, ...): label1:symbol1 label2:symbol2 ... {{ CODE }} | label1:symbol1 label2:symbol2 ... {{ CODE }} ... | label1:symbol1 label2:symbol2 ... {{ CODE }} The rules may have arguments (note that you must write the parantheses, even if the rule does not have arguments). Here, arg1, arg2, ... are the formal names of the arguments; you may refer to them in OCaml code. Furthermore, the symbols may have labels (you can leave the labels out). You can refer to the value associated with a symbol by its label, i.e. there is an OCaml variable with the same name as the label prescribes, and this variable contains the value. The OCaml code must be embraced by {{ and }}, and these separators must not occur within the code. EXAMPLE: prefix_term(): Plus_symbol Left_paren v1:prefix_term() Comma v2:prefix_term() Right_paren {{ v1 + v2 }} | Times_symbol Left_paren v1:prefix_term() Comma v2:prefix_term() Right_paren {{ v1 * v2 }} | n:Number {{ n }} As you can see in the example, you must pass values for the arguments if you call non-terminal symbols (here, the argument list is empty: ()). The generated parsers behave as follows: - A rule is applicable to a token sequence if the first token is matched by the rule. In the example: prefix_term is applicable if the first token of a sequence is either Plus_symbol, Times_symbol, or Number. - One branch of the applicable rule is selected: it is the first branch that matches the first token. THE OTHER TOKENS DO NOT HAVE ANY EFFECT ON BRANCH SELECTION! For instance, in the following rule the second branch is never selected, because only the A is used to select the branch: a(): A B {{ ... }} | A C {{ ... }} - Once a branch is selected, it is checked whether the branch matches the token sequence. If this check succeeds, the code section of the branch is executed, and the resulting value is returned to the caller. If the check fails, the exception Parsing.Parse_error is raised. Normally, this exception is not caught, and will force the parser to stop. The check in detail: If the rule demands a terminal, there a must be exactly this terminal at the corresponding location in the token sequence. If the rule demands a non-terminal, it is checked whether the rule for to this non-terminal is applicable. If so, the branch is selected, and recursively checked. If the rule is not applicable, the check fails immediately. - THERE IS NO BACKTRACKING! Note that the following works (but the construction is resolved at generation time): rule1() = rule2() A B ... {{ ... }} rule2() = C {{ ... }} | D {{ ... }} In this case, the (only) branch of rule1 is selected if the next token is C or D. --- *** Options and repetitions *** Symbols can be tagged as being optional, or to occur repeatedly: rule(): Name whitespace()* Question_mark? - "*": The symbol matches zero or more occurrences. - "?": The symbol matches zero or one occurrence. This is done as follows: - terminal*: The maximum number of consecutive tokens are matched. - non-terminal*: The maximum number of the subsequences matching are matched. Before another subsequence is matched, it is checked whether the rule for is applicable. If so, the rule is invoked and must succeed (otherwise Parsing. Parse_error). If not, the loop is exited. - terminal?: If the next token is , it is matched. If not, no token is matched. - non-terminal?: It is checked whether the rule for is applicable. If so, the rule is invoked, and matches a sequence of tokens. If not, no token is matched. You may refer to repeated or optional symbols by labels. In this case, the label is associated with lists of values, or optional values, respectively: rule(): A lab:other()* lab':unlikely()? {{ let n = List.length lab in ... match lab' with None -> ... | Some v -> ... }} A different scheme is applied if the symbol is a token without associated value (%token Name, and NOT %token <> Name): rule(): A lab:B* lab':C? Here, "lab" becomes an integer variable counting the number of Bs, and "lab'" becomes a boolean variable denoting whether there is a C or not. *** Early let-binding *** You may put some OCaml code directly after the first symbol of a branch: rule(): A $ {{ let-binding }} C D ... {{ ... }} The code brace {{ let-binding }} must be preceded by a dollar sign. You can put "let ... = ... in" statements into this brace: rule1(): n:A $ {{ let twice = 2 * n in }} rule2(twice) {{ ... }} This code is executed once the branch is selected. *** Very early let-binding *** This is also possible: rule(): $ {{ CODE }} A ... The CODE is executed right when the branch is selected, and before any other happens. (Only for hacks!) *** Computed rules *** rule(): A $ {{ let followup = ... some function ... in }} [ followup ]() {{ ... }} Between [ and ], you can refer to the O'Caml name of *any* function. Here, the function "followup" is bound in the let-binding. *** Error handling *** If a branch is already selected, but the check fails whether the other symbols of the branch match, it is possible to catch the resulting exception and to find out at which position the failure has occurred. rule(): x:A y:B z:C {{ ... }} ? {{ ERROR-CODE }} After a question mark, it is allowed to append another code brace. This code is executed if the branch check fails (but not if the branch is not selected nor if no branches are selected). The string variable !yy_position contains the label of the symbol that caused the failure (or it contains the empty string if the symbol does not have a label). Example: rule(): x:A y:B z:C {{ print_endline "SUCCESS" }} ? {{ print_endline !yy_position }} If the token sequence is A B C, "SUCCESS" will be printed. If the sequence is A C, the second symbol fails, and "y" will be printed. If the sequence is A B D, the third symbol fails, and "z" will be printed. If the sequence is B, the rule will be never selected because it is not applicable. *** Error recovery *** You may call the functions yy_current, yy_get_next, or one of the parse_* functions in the error brace to recover from the error (e.g. to move ahead until a certain token is reached). See below. *** How to call the parser *** The rules are rewritten into a OCaml let-binding: let rec parse_ ... = ... and parse_ ... = ... ... and parse_ ... = ... in i.e. there are lots of functions, and the name of the functions are "parse_" plus the name of the rules. You can call every function. The first two arguments of the functions have a special meaning; the other arguments are the arguments coming from the rule description: rule(a,b): ... ===> let rec parse_rule yy_current yy_get_next a b = ... The first argument, yy_current, is a function that returns the current token. The second arguments, yy_get_next, is a function that switches to the next token, and returns it. If the tokens are stored in a list, this may be a definition: let input = ref [ Token1; Token2; ... ] in let yy_current() = List.hd !input in let yy_get_next () = input := List.tl !input; List.hd !input When you call one of the parser functions, the current token must already be loaded, i.e. yy_current returns the first token to match by the function. After the functions has returned, the current token is the token following the sequence of tokens that have been matched by the function. The function returns the value computed by the OCaml code brace of the rule (or the value of the error brace). If the rule is not applicable, the exception Not_found is raised. If the rule is applicable, but it does not match, the exception Parsing.Parse_error is raised.