1 ----------------------------------------------------------------------
3 ----------------------------------------------------------------------
5 This is a parser generator for top-down (or recursively descending) parsers.
6 The input file must be structured as follows:
8 ---------------------------------------- Begin of file
10 <OCAML TEXT ("preamble")>
22 <OCAML TEXT ("postamble")>
24 ---------------------------------------- End of file
26 The two-character combination %% separates the various sections. The
27 text before the first %% and after the last %% will be copied verbatim
30 Within the declarations and rules sections you must use /* ... */ as
33 There are two types of declarations:
37 declares that Name is a token without associated value, and
41 declares that Name is a token with associated value (i.e. Name x).
43 In contrast to ocamlyacc, you need not to specify a type. This is a
44 fundamental difference, because m2parsergen will not generate a type
45 declaration for a "token" type; you must do this yourself.
47 You need not to declare start symbols; every grammar rule may be used
52 name_of_rule(arg1, arg2, ...):
53 label1:symbol1 label2:symbol2 ... {{ CODE }}
54 | label1:symbol1 label2:symbol2 ... {{ CODE }}
56 | label1:symbol1 label2:symbol2 ... {{ CODE }}
58 The rules may have arguments (note that you must write the
59 parantheses, even if the rule does not have arguments). Here, arg1,
60 arg2, ... are the formal names of the arguments; you may refer to them
63 Furthermore, the symbols may have labels (you can leave the labels
64 out). You can refer to the value associated with a symbol by its
65 label, i.e. there is an OCaml variable with the same name as the label
66 prescribes, and this variable contains the value.
68 The OCaml code must be embraced by {{ and }}, and these separators
69 must not occur within the code.
74 Plus_symbol Left_paren v1:prefix_term() Comma v2:prefix_term() Right_paren
76 | Times_symbol Left_paren v1:prefix_term() Comma v2:prefix_term() Right_paren
81 As you can see in the example, you must pass values for the arguments
82 if you call non-terminal symbols (here, the argument list is empty: ()).
84 The generated parsers behave as follows:
86 - A rule is applicable to a token sequence if the first token is
89 In the example: prefix_term is applicable if the first token of a
90 sequence is either Plus_symbol, Times_symbol, or Number.
92 - One branch of the applicable rule is selected: it is the first
93 branch that matches the first token. THE OTHER TOKENS DO NOT HAVE
94 ANY EFFECT ON BRANCH SELECTION!
96 For instance, in the following rule the second branch is never
97 selected, because only the A is used to select the branch:
103 - Once a branch is selected, it is checked whether the branch matches
104 the token sequence. If this check succeeds, the code section of the
105 branch is executed, and the resulting value is returned to the
107 If the check fails, the exception Parsing.Parse_error is raised.
108 Normally, this exception is not caught, and will force the parser
113 If the rule demands a terminal, there a must be exactly this
114 terminal at the corresponding location in the token sequence.
116 If the rule demands a non-terminal, it is checked whether the rule
117 for to this non-terminal is applicable. If so, the branch
118 is selected, and recursively checked. If the rule is not applicable,
119 the check fails immediately.
121 - THERE IS NO BACKTRACKING!
123 Note that the following works (but the construction is resolved at
127 rule2() A B ... {{ ... }}
133 In this case, the (only) branch of rule1 is selected if the next
140 *** Options and repetitions ***
142 Symbols can be tagged as being optional, or to occur repeatedly:
145 Name whitespace()* Question_mark?
147 - "*": The symbol matches zero or more occurrences.
149 - "?": The symbol matches zero or one occurrence.
151 This is done as follows:
153 - terminal*: The maximum number of consecutive tokens <terminal> are
155 - non-terminal*: The maximum number of the subsequences matching
156 <non-terminal> are matched. Before another
157 subsequence is matched, it is checked whether the
158 rule for <non-terminal> is applicable. If so, the
159 rule is invoked and must succeed (otherwise Parsing.
160 Parse_error). If not, the loop is exited.
162 - terminal?: If the next token is <terminal>, it is matched. If not,
165 - non-terminal?: It is checked whether the rule for <non-terminal>
166 is applicable. If so, the rule is invoked, and
167 matches a sequence of tokens. If not, no token is
170 You may refer to repeated or optional symbols by labels. In this case,
171 the label is associated with lists of values, or optional values,
175 A lab:other()* lab':unlikely()?
176 {{ let n = List.length lab in ...
182 A different scheme is applied if the symbol is a token without
183 associated value (%token Name, and NOT %token <> Name):
188 Here, "lab" becomes an integer variable counting the number of Bs, and
189 "lab'" becomes a boolean variable denoting whether there is a C or not.
192 *** Early let-binding ***
194 You may put some OCaml code directly after the first symbol of a
198 A $ {{ let-binding }} C D ... {{ ... }}
200 The code brace {{ let-binding }} must be preceded by a dollar
201 sign. You can put "let ... = ... in" statements into this brace:
204 n:A $ {{ let twice = 2 * n in }} rule2(twice) {{ ... }}
206 This code is executed once the branch is selected.
209 *** Very early let-binding ***
211 This is also possible:
218 The CODE is executed right when the branch is selected, and before any
219 other happens. (Only for hacks!)
223 *** Computed rules ***
226 A $ {{ let followup = ... some function ... in }} [ followup ]()
229 Between [ and ], you can refer to the O'Caml name of *any* function.
230 Here, the function "followup" is bound in the let-binding.
233 *** Error handling ***
235 If a branch is already selected, but the check fails whether the other
236 symbols of the branch match, it is possible to catch the resulting
237 exception and to find out at which position the failure has occurred.
240 x:A y:B z:C {{ ... }} ? {{ ERROR-CODE }}
242 After a question mark, it is allowed to append another code
243 brace. This code is executed if the branch check fails (but not if the
244 branch is not selected nor if no branches are selected). The string
245 variable !yy_position contains the label of the symbol that caused the
246 failure (or it contains the empty string if the symbol does not have a
252 x:A y:B z:C {{ print_endline "SUCCESS" }} ? {{ print_endline !yy_position }}
254 If the token sequence is A B C, "SUCCESS" will be printed. If the
255 sequence is A C, the second symbol fails, and "y" will be printed. If
256 the sequence is A B D, the third symbol fails, and "z" will be
257 printed. If the sequence is B, the rule will be never selected because
258 it is not applicable.
262 *** Error recovery ***
264 You may call the functions yy_current, yy_get_next, or one of the
265 parse_* functions in the error brace to recover from the error
266 (e.g. to move ahead until a certain token is reached). See below.
270 *** How to call the parser ***
272 The rules are rewritten into a OCaml let-binding:
274 let rec parse_<rule1> ... = ...
275 and parse_<rule2> ... = ...
277 and parse_<ruleN> ... = ...
280 i.e. there are lots of functions, and the name of the functions are
281 "parse_" plus the name of the rules. You can call every function.
283 The first two arguments of the functions have a special meaning; the
284 other arguments are the arguments coming from the rule description:
291 let rec parse_rule yy_current yy_get_next a b = ...
293 The first argument, yy_current, is a function that returns the current
294 token. The second arguments, yy_get_next, is a function that switches
295 to the next token, and returns it.
297 If the tokens are stored in a list, this may be a definition:
299 let input = ref [ Token1; Token2; ... ] in
300 let yy_current() = List.hd !input in
302 input := List.tl !input;
305 When you call one of the parser functions, the current token must
306 already be loaded, i.e. yy_current returns the first token to match by
309 After the functions has returned, the current token is the token
310 following the sequence of tokens that have been matched by the
313 The function returns the value computed by the OCaml code brace of the
314 rule (or the value of the error brace).
316 If the rule is not applicable, the exception Not_found is raised.
318 If the rule is applicable, but it does not match, the exception
319 Parsing.Parse_error is raised.