30 $\Rightarrow$ |
30 $\Rightarrow$ |
31 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$ |
31 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$ |
32 \end{center} |
32 \end{center} |
33 |
33 |
34 \noindent |
34 \noindent |
35 Given the extended effort we have spent implementing a lexer |
35 Given the extended effort we have spent implementing a lexer in order |
36 in order to generate list of tokens, it might be surprising that in |
36 to generate list of tokens, it might be surprising that in what |
37 what follows we shall often use strings as input. This is for making |
37 follows we shall often use strings as input, rather than list of |
38 the explanation more lucid. It does not make our previous work on |
38 tokens. This is for making the explanation more lucid. It does not |
39 lexers obsolete (remember they transform a string into a list of |
39 make our previous work on lexers obsolete (remember they transform a |
40 tokens). Lexers will still be needed for building a somewhat realistic |
40 string into a list of tokens). Lexers will still be needed for |
41 compiler. |
41 building a somewhat realistic compiler. |
42 |
42 |
43 But as said, parser combinators are relatively agnostic about what |
43 As mentioned above, parser combinators are relatively agnostic about what |
44 kind of input they process. In my Scala code I use the following |
44 kind of input they process. In my Scala code I use the following |
45 polymorphic types for parser combinators: |
45 polymorphic types for parser combinators: |
46 |
46 |
47 \begin{center} |
47 \begin{center} |
48 input:\;\; \texttt{I} \qquad output:\;\; \texttt{T} |
48 input:\;\; \texttt{I} \qquad output:\;\; \texttt{T} |
49 \end{center} |
49 \end{center} |
50 |
50 |
51 \noindent That is they take as input something of type \texttt{I} and |
51 \noindent That is they take as input something of type \texttt{I} and |
52 return a set of pairs of type \texttt{Set[(T, I)]}. Since the input needs |
52 return a set of pairs of type \texttt{Set[(T, I)]}. Since the input |
53 to be of ``sequence-kind'', I actually have to often write \texttt{I |
53 needs to be of ``sequence-kind'', I actually have to often write |
54 <\% Seq[\_]} for the input type. This ensures the input is a |
54 \texttt{I <\% Seq[\_]} for the input type. This ensures the input is a |
55 subtype of Scala sequences. The first component of the generated pairs |
55 subtype of Scala sequences. The first component of the generated pairs |
56 corresponds to what the parser combinator was able to process from the |
56 corresponds to what the parser combinator was able to process from the |
57 input and the second is the unprocessed part of the input (therefore |
57 input and the second is the unprocessed part, or leftover, of the |
58 the type of this unprocessed part is the same as the input). As we |
58 input (therefore the type of this unprocessed part is the same as the |
59 shall see shortly, a parser combinator might return more than one such |
59 input). A parser combinator might return more than one such pair; the |
60 pair; the idea is that there are potentially several ways of how to |
60 idea is that there are potentially several ways of how to parse the |
61 parse the input. As a concrete example, consider the string |
61 input. As a concrete example, consider the string |
62 |
62 |
63 \begin{center} |
63 \begin{center} |
64 \tt\Grid{iffoo\VS testbar} |
64 \tt\Grid{iffoo\VS testbar} |
65 \end{center} |
65 \end{center} |
66 |
66 |
72 $\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), |
72 $\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), |
73 \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$ |
73 \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$ |
74 \end{center} |
74 \end{center} |
75 |
75 |
76 \noindent where the first pair means the parser could recognise |
76 \noindent where the first pair means the parser could recognise |
77 \texttt{if} from the input and leaves the rest as `unprocessed' as the |
77 \texttt{if} from the input and leaves the \texttt{foo\VS testbar} as |
78 second component of the pair; in the other case it could recognise |
78 `unprocessed' part; in the other case it could recognise |
79 \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the |
79 \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the |
80 parser cannot recognise anything from the input at all, then parser |
80 parser cannot recognise anything from the input at all, then parser |
81 combinators just return the empty set $\{\}$. This will indicate |
81 combinators just return the empty set $\{\}$. This will indicate |
82 something ``went wrong''\ldots or more precisely, nothing could be |
82 something ``went wrong''\ldots or more precisely, nothing could be |
83 parsed. |
83 parsed. |
84 |
84 |
85 Also important to note is that the type \texttt{T} for the processed |
85 Also important to note is that the type \texttt{T} for the processed |
86 part is different from the input type. In the example above is just |
86 part is different from the input type \texttt{I} in the parse. In the |
87 happens to be the same. The reason for the difference is that in |
87 example above is just happens to be the same. The reason for the |
88 general we are interested in transforming our input into something |
88 potential difference is that in general we are interested in |
89 ``different''\ldots for example into a tree; or if we implement the |
89 transforming our input into something ``different''\ldots for example |
90 grammar for arithmetic expressions, we might be interested in the |
90 into a tree; or if we implement the grammar for arithmetic |
91 actual integer number the arithmetic expression, say \texttt{1 + 2 * |
91 expressions, we might be interested in the actual integer number the |
92 3}, stands for. In this way we can use parser combinators to |
92 arithmetic expression, say \texttt{1 + 2 * 3}, stands for. In this way |
93 implement relatively easily a calculator, for instance. |
93 we can use parser combinators to implement relatively easily a |
|
94 calculator, for instance. |
94 |
95 |
95 The main idea of parser combinators is that we can easily build parser |
96 The main idea of parser combinators is that we can easily build parser |
96 combinators out of smaller components following very closely the |
97 combinators out of smaller components following very closely the |
97 structure of a grammar. In order to implement this in an |
98 structure of a grammar. In order to implement this in an |
98 object-oriented programming language, like Scala, we need to specify |
99 object-oriented programming language, like Scala, we need to specify |
175 In this parser the input is of type \texttt{List[Token]}. The function |
175 In this parser the input is of type \texttt{List[Token]}. The function |
176 parse looks at the input \texttt{ts} and checks whether the first |
176 parse looks at the input \texttt{ts} and checks whether the first |
177 token is a \texttt{Num\_token} (let us assume our lexer generated |
177 token is a \texttt{Num\_token} (let us assume our lexer generated |
178 these tokens for numbers). But this parser does not just return this |
178 these tokens for numbers). But this parser does not just return this |
179 token (and the rest of the list), like the \texttt{CharParser} above, |
179 token (and the rest of the list), like the \texttt{CharParser} above, |
180 rather it extracts the string \texttt{s} from the token and converts it |
180 rather it extracts also the string \texttt{s} from the token and |
181 into an integer. The hope is that the lexer did its work well and this |
181 converts it into an integer. The hope is that the lexer did its work |
182 conversion always succeeds. The consequence of this is that the output |
182 well and this conversion always succeeds. The consequence of this is |
183 type for this parser is \texttt{Int}, not \texttt{Token}. Such a |
183 that the output type for this parser is \texttt{Int}, not |
184 conversion would be needed if we want to implement a simple calculator |
184 \texttt{Token}. Such a conversion would be needed if we want to |
185 program. |
185 implement a simple calculator program, because string-numbers need to |
|
186 be transformed into \texttt{Int}-numbers in order to do the |
|
187 calculations. |
186 |
188 |
187 |
189 |
188 These simple parsers that just look at the input and do a simple |
190 These simple parsers that just look at the input and do a simple |
189 transformation are often called \emph{atomic} parser combinators. |
191 transformation are often called \emph{atomic} parser combinators. |
190 More interesting are the parser combinators that build larger parsers |
192 More interesting are the parser combinators that build larger parsers |
212 \end{center} |
214 \end{center} |
213 |
215 |
214 \noindent The types of this parser combinator are again generic (we |
216 \noindent The types of this parser combinator are again generic (we |
215 have \texttt{I} for the input type, and \texttt{T} for the output |
217 have \texttt{I} for the input type, and \texttt{T} for the output |
216 type). The alternative parser builds a new parser out of two existing |
218 type). The alternative parser builds a new parser out of two existing |
217 parsers \texttt{p} and \texttt{q} given as arguments. Both parsers |
219 parsers \texttt{p} and \texttt{q} which are given as arguments. Both |
218 need to be able to process input of type \texttt{I} and return in |
220 parsers need to be able to process input of type \texttt{I} and return |
219 \texttt{parse} the same output type \texttt{Set[(T, |
221 in \texttt{parse} the same output type \texttt{Set[(T, |
220 I)]}.\footnote{There is an interesting detail of Scala, namely the |
222 I)]}.\footnote{There is an interesting detail of Scala, namely the |
221 \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They |
223 \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They |
222 will prevent the evaluation of the arguments before they are |
224 will prevent the evaluation of the arguments before they are |
223 used. This is often called \emph{lazy evaluation} of the |
225 used. This is often called \emph{lazy evaluation} of the |
224 arguments. We will explain this later.} The alternative parser |
226 arguments. We will explain this later.} The alternative parser runs |
225 should run the input with the first parser \texttt{p} (producing a set |
227 the input with the first parser \texttt{p} (producing a set of pairs) |
226 of pairs) and then run the same input with \texttt{q} (producing |
228 and then runs the same input with \texttt{q} (producing another set of |
227 another set of pairs). The result should be then just the union of |
229 pairs). The result should be then just the union of both sets, which |
228 both sets, which is the operation \texttt{++} in Scala. |
230 is the operation \texttt{++} in Scala. |
229 |
231 |
230 The alternative parser combinator allows us to construct a parser that |
232 The alternative parser combinator allows us to construct a parser that |
231 parses either a character \texttt{a} or \texttt{b} using the |
233 parses either a character \texttt{a} or \texttt{b} using the |
232 \texttt{CharParser} shown above. For this we can write |
234 \texttt{CharParser} shown above. For this we can write |
233 |
235 |
258 parse anything with \pcode{ccde}, therefore the empty |
260 parse anything with \pcode{ccde}, therefore the empty |
259 set is returned. |
261 set is returned. |
260 |
262 |
261 A bit more interesting is the \emph{sequence parser combinator}. Given |
263 A bit more interesting is the \emph{sequence parser combinator}. Given |
262 two parsers, say again, $p$ and $q$, we want to apply first the input |
264 two parsers, say again, $p$ and $q$, we want to apply first the input |
263 to $p$ producing a set of pairs; then apply $q$ to all the unparsed |
265 to $p$ producing a set of pairs; then apply $q$ to all the unparsed |
264 parts in the pairs; and then combine the results. Mathematically we would |
266 parts in the pairs; and then combine the results. Mathematically we would |
265 write something like this for the expected set of pairs: |
267 write something like this for the result set of pairs: |
266 |
268 |
267 \begin{center} |
269 \begin{center} |
268 \begin{tabular}{lcl} |
270 \begin{tabular}{lcl} |
269 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & |
271 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & |
270 $(\textit{output}_1, u_1) \in p(\text{input}) |
272 $(\textit{output}_1, u_1) \in p(\text{input}) |
271 \;\wedge\;$\\ |
273 \;\wedge\;$\\ |
272 && $(\textit{output}_2, u_2) \in q(u_1)\}$ |
274 && $(\textit{output}_2, u_2) \in q(u_1)\}$ |
273 \end{tabular} |
275 \end{tabular} |
274 \end{center} |
276 \end{center} |
275 |
277 |
276 \noindent Notice that the $p$ wil first be run on the input, producing |
|
277 pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands |
|
278 for the unprocessed, or leftover, parts. We want that $q$ runs on all |
|
279 these unprocessed parts $u_1$. This again will produce some |
|
280 processed part , $p$ and |
|
281 $q$, we apply both parsers to the input (remember parsers are |
|
282 functions) and combine the output (remember they are sets of pairs) |
|
283 |
|
284 \begin{center} |
|
285 $p(\text{input}) \cup q(\text{input})$ |
|
286 \end{center} |
|
287 |
|
288 \noindent In Scala we would implement alternative parser |
|
289 combinator as follows |
|
290 |
|
291 \begin{center} |
|
292 \begin{lstlisting}[language=Scala, numbers=none] |
|
293 class AltParser[I, T] |
|
294 (p: => Parser[I, T], |
|
295 q: => Parser[I, T]) extends Parser[I, T] { |
|
296 def parse(in: I) = p.parse(in) ++ q.parse(in) |
|
297 } |
|
298 \end{lstlisting} |
|
299 \end{center} |
|
300 |
|
301 \noindent The types of this parser combinator are again generic (we |
|
302 just have \texttt{I} for the input type, and \texttt{T} for the output |
|
303 type). The alternative parser builds a new parser out of two existing |
|
304 parsers \texttt{p} and \texttt{q}. Both need to be able to process |
|
305 input of type \texttt{I} and return the same output type |
|
306 \texttt{Set[(T, I)]}.\footnote{There is an interesting detail of |
|
307 Scala, namely the \texttt{=>} in front of the types of \texttt{p} |
|
308 and \texttt{q}. They will prevent the evaluation of the arguments |
|
309 before they are used. This is often called \emph{lazy evaluation} of |
|
310 the arguments. We will explain this later.} Therefore the output |
|
311 type of this parser is \texttt{T}. The alternative parser should run |
|
312 the input with the first parser \texttt{p} (producing a set of pairs) |
|
313 and then run the same input with \texttt{q} (producing another set of |
|
314 pairs). The result should be then just the union of both sets, which |
|
315 is the operation \texttt{++} in Scala. |
|
316 |
|
317 The alternative parser combinator already allows us to |
|
318 construct a parser that parses either a character \texttt{a} |
|
319 or \texttt{b}, as |
|
320 |
|
321 \begin{center} |
|
322 \begin{lstlisting}[language=Scala, numbers=none] |
|
323 new AltParser(CharParser('a'), CharParser('b')) |
|
324 \end{lstlisting} |
|
325 \end{center} |
|
326 |
|
327 \noindent Later on we will use again Scala mechanism for introducing some |
|
328 more readable shorthand notation for this, like \texttt{"a" | "b"}. Let us look in detail at what this parser combinator produces with some sample strings |
|
329 |
|
330 \begin{center} |
|
331 \begin{tabular}{rcl} |
|
332 input strings & & output\medskip\\ |
|
333 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\ |
|
334 \texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\ |
|
335 \texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$ |
|
336 \end{tabular} |
|
337 \end{center} |
|
338 |
|
339 \noindent We receive in the first two cases a successful |
|
340 output (that is a non-empty set). In each case, either |
|
341 \pcode{a} or \pcode{b} is in the processed part, and |
|
342 \pcode{cde} in the unprocessed part. Clearly this parser cannot |
|
343 parse anything in the string \pcode{ccde}, therefore the empty |
|
344 set. |
|
345 |
|
346 A bit more interesting is the \emph{sequence parser combinator}. Given |
|
347 two parsers, say again, $p$ and $q$, we want to apply first the input |
|
348 to $p$ producing a set of pairs; then apply $q$ to all the unparsed |
|
349 parts in the pairs; and then combine the results like |
|
350 |
|
351 \begin{center} |
|
352 \begin{tabular}{lcl} |
|
353 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & |
|
354 $(\textit{output}_1, u_1) \in p(\text{input}) |
|
355 \;\wedge\;$\\ |
|
356 && $(\textit{output}_2, u_2) \in q(u_1)\}$ |
|
357 \end{tabular} |
|
358 \end{center} |
|
359 |
|
360 \noindent Notice that the $p$ will first be run on the input, |
278 \noindent Notice that the $p$ will first be run on the input, |
361 producing pairs of the form $\textit{output}_1$ and unprocessed part |
279 producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ |
362 $u_1$. This unprocessed part is fed into the second parser $q$. The |
280 stands for the unprocessed, or leftover, parts pf $p$. We want that |
363 overall result of the sequence parser combinator is pairs of the form |
281 $q$ runs on all these unprocessed parts $u_1$. Therefore these |
|
282 unprocessed parts are fed into the second parser $q$. The overall |
|
283 result of the sequence parser combinator is pairs of the form |
364 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the |
284 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the |
365 unprocessed part of both parsers is the unprocessed part the second |
285 unprocessed part of the sequqnce p[arser combinator is the unprocessed |
366 parser $q$ leaves as leftover. The processed parts of both |
286 part the second parser $q$ leaves as leftover. The processed parts of |
367 parsers is a pair consisting of the outputs of $p$ and $q$, namely |
287 of the component parsers is a pair consisting of the outputs of $p$ |
368 $(\textit{output}_1, \textit{output}_2)$. This behaviour can be |
288 and $q$, namely $(\textit{output}_1, \textit{output}_2)$. This |
369 implemented in Scala as follows: |
289 behaviour can be implemented in Scala as follows: |
370 |
290 |
371 \begin{center} |
291 \begin{center} |
372 \begin{lstlisting}[language=Scala,numbers=none] |
292 \begin{lstlisting}[language=Scala,numbers=none] |
373 class SeqParser[I, T, S] |
293 class SeqParser[I, T, S] |
374 (p: => Parser[I, T], |
294 (p: => Parser[I, T], |
393 component is the unprocessed part left over after running the second |
313 component is the unprocessed part left over after running the second |
394 parser \texttt{q}. Therefore the input type of the sequence parser |
314 parser \texttt{q}. Therefore the input type of the sequence parser |
395 combinator is as usual \texttt{I}, but the output type is |
315 combinator is as usual \texttt{I}, but the output type is |
396 |
316 |
397 \begin{center} |
317 \begin{center} |
398 \texttt{Set[((T, S), I)]} |
318 \texttt{(T, S)} |
399 \end{center} |
319 \end{center} |
400 |
320 |
401 \noindent |
321 \noindent |
402 If any of the runs of \textit{p} and \textit{q} fail, that is produce |
322 This means \texttt{parse} in the sequence parser combinator returns |
403 the empty set, then \texttt{parse} will also produce the empty set. |
323 sets of type \texttt{Set[((T, S), I)]}. Notice that we have |
404 Notice that we have now two output types for the sequence parser |
324 essentially two output types for the sequence parser combinator |
405 combinator, because in general \textit{p} and \textit{q} might produce |
325 (packaged in a pair), because in general \textit{p} and \textit{q} |
406 different things (for example first we recognise a number and then a |
326 might produce different things (for example first we recognise a |
407 string corresponding to an operator). |
327 number and then a string corresponding to an operator). If any of the |
|
328 runs of \textit{p} and \textit{q} fail, that is produce the empty set, |
|
329 then \texttt{parse} will also produce the empty set. |
408 |
330 |
409 With the shorthand notation we shall introduce later for the sequence |
331 With the shorthand notation we shall introduce later for the sequence |
410 parser combinator, we can write for example \pcode{"a" ~ "b"}, which |
332 parser combinator, we can write for example \pcode{"a" ~ "b"}, which |
411 is the parser combinator that first recognises the character |
333 is the parser combinator that first recognises the character |
412 \texttt{a} from a string and then \texttt{b}. Let us look again at |
334 \texttt{a} from a string and then \texttt{b}. Let us look again at |
508 \] |
430 \] |
509 |
431 |
510 \noindent where the unprocessed (empty) parts have been stripped away |
432 \noindent where the unprocessed (empty) parts have been stripped away |
511 from the pairs; everything where the second part was not empty has |
433 from the pairs; everything where the second part was not empty has |
512 been thrown away as well, because they represent |
434 been thrown away as well, because they represent |
513 ultimately-unsuccessful-parses. |
435 ultimately-unsuccessful-parses. The main point is that the sequence |
514 |
436 parser combinator returns pairs that can nest according to the |
515 |
437 nesting of the component parsers. |
516 Note carefully that constructing a parser such \pcode{"a" | ("a" ~ |
438 |
517 "b")} will result in a typing error. The intention is that we want |
439 |
518 to parse an \texttt{a}, or an \texttt{a} followed by a |
440 Consider also carefully that constructing a parser such \pcode{"a" | |
519 \texttt{b}. However, the first parser has as output type a single |
441 ("a" ~ "b")} will result in a typing error. The intention with this |
520 character (recall the type of \texttt{CharParser}), but the second |
442 parser is that we want to parse an \texttt{a}, or an \texttt{a} |
521 parser produces a pair of characters as output. The alternative parser |
443 followed by a \texttt{b}. However, the first parser has as output type |
522 is required to have both component parsers to have the same type---we |
444 a single character (recall the type of \texttt{CharParser}), but the |
523 need to be able to build the union of two sets, which means in Scala |
445 second parser produces a pair of characters as output. The alternative |
524 they need to be of the same type. Therefore the typing error. |
446 parser is required to have both component parsers to have the same |
525 We will see later how we can build |
447 type---we need to be able to build the union of two sets, which means |
|
448 in Scala they need to be of the same type. Since ther are not, there |
|
449 is a typing error in this example. We will see later how we can build |
526 this parser without the typing error. |
450 this parser without the typing error. |
527 |
451 |
528 The next parser combinator, called \emph{semantic action}, does not |
452 The next parser combinator, called \emph{semantic action}, does not |
529 actually combine smaller parsers, but applies a function to the result |
453 actually combine smaller parsers, but applies a function to the result |
530 of a parser. It is implemented in Scala as follows |
454 of a parser. It is implemented in Scala as follows |
539 } |
463 } |
540 \end{lstlisting} |
464 \end{lstlisting} |
541 \end{center} |
465 \end{center} |
542 |
466 |
543 |
467 |
544 \noindent This parser combinator takes a parser \texttt{p} |
468 \noindent This parser combinator takes a parser \texttt{p} (with input |
545 with output type \texttt{T} as one argument as well as a |
469 type \texttt{I} and output type \texttt{T}) as one argument but also a |
546 function \texttt{f} with type \texttt{T => S}. The parser |
470 function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p} |
547 \texttt{p} produces sets of type \texttt{(T, I)}. The |
471 produces sets of type \texttt{Set[(T, I)]}. The semantic action |
548 semantic action combinator then applies the function |
472 combinator then applies the function \texttt{f} to all the `processed' |
549 \texttt{f} to all the parser outputs. Since this function is of |
473 parser outputs. Since this function is of type \texttt{T => S}, we |
550 type \texttt{T => S}, we obtain a parser with output type |
474 obtain a parser with output type \texttt{S}. Again Scala lets us |
551 \texttt{S}. Again Scala lets us introduce some shorthand |
475 introduce some shorthand notation for this parser |
552 notation for this parser combinator. Therefore we will write |
476 combinator. Therefore we will write \texttt{p ==> f} for it. |
553 \texttt{p ==> f} for it. |
|
554 |
477 |
555 What are semantic actions good for? Well, they allow you to transform |
478 What are semantic actions good for? Well, they allow you to transform |
556 the parsed input into a datastructure you can use for further |
479 the parsed input into datastructures you can use for further |
557 processing. A simple example would be to transform parsed characters |
480 processing. A simple example would be to transform parsed characters |
558 into ASCII numbers. Suppose we define a function \texttt{f} (from |
481 into ASCII numbers. Suppose we define a function \texttt{f} (from |
559 characters to ints) and use a \texttt{CharParser} for parsing |
482 characters to ints) and use a \texttt{CharParser} for parsing |
560 the character \texttt{c}. |
483 the character \texttt{c}. |
561 |
484 |
642 |
565 |
643 \subsubsection*{Shorthand notation for parser combinators} |
566 \subsubsection*{Shorthand notation for parser combinators} |
644 |
567 |
645 Before we proceed, let us just explain the shorthand notation for |
568 Before we proceed, let us just explain the shorthand notation for |
646 parser combinators. Like for regular expressions, the shorthand notation |
569 parser combinators. Like for regular expressions, the shorthand notation |
647 will make our life much easier when writing actual parsers. |
570 will make our life much easier when writing actual parsers. We can define |
|
571 some implicits which allow us to write \pcode{p | q}, \pcode{p ~ q} and |
|
572 \pcode{p ==> f} as well as to use plain strings for specifying simple |
|
573 string parsers. |
|
574 |
|
575 The idea is that this shorthand notation allows us to easily translate |
|
576 context-free grammars into code. For example recall our context-free |
|
577 grammar for palindromes: |
|
578 |
|
579 \begin{plstx}[margin=3cm] |
|
580 : \meta{P} ::= a\cdot \meta{P}\cdot a | b\cdot \meta{P}\cdot b | a | b | \epsilon\\ |
|
581 \end{plstx} |
|
582 |
|
583 \noindent |
|
584 Each alternative in this grammar translates into an alternative parser |
|
585 combinator. The $\cdot$ can be translated to a sequence parser |
|
586 combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply |
|
587 written as \texttt{"a"}, \texttt{"b"} and \texttt{""}. |
|
588 |
648 |
589 |
649 \subsubsection*{How to build parsers using parser combinators?} |
590 \subsubsection*{How to build parsers using parser combinators?} |
650 |
591 |
651 The beauty of parser combinators is the ease with which they can be |
592 The beauty of parser combinators is the ease with which they can be |
652 implemented and how easy it is to translate context-free grammars into |
593 implemented and how easy it is to translate context-free grammars into |
653 code (though the grammars need to be non-left-recursive). To |
594 code (though the grammars need to be non-left-recursive). To |
654 demonstrate this recall our context-free grammar for palindromes: |
595 demonstrate this recall the grammar for palindromes from above. |
655 |
596 The first idea would be to translate it into the following code |
656 \begin{plstx}[margin=3cm] |
|
657 : \meta{P} ::= a\cdot \meta{P}\cdot a | b\cdot \meta{P}\cdot b | a | b | \epsilon\\ |
|
658 \end{plstx} |
|
659 |
|
660 \noindent |
|
661 Given the parser combinators for alternatives and sequeneces, this grammar should be |
|
662 straightforward to implement. The first idea would be |
|
663 |
597 |
664 \begin{center} |
598 \begin{center} |
665 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
599 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
666 lazy val Pal : Parser[String, String] = |
600 lazy val Pal : Parser[String, String] = |
667 (("a" ~ Pal ~ "a") | ("b" ~ Pal ~ "b") | "a" | "b" | "") |
601 (("a" ~ Pal ~ "a") | ("b" ~ Pal ~ "b") | "a" | "b" | "") |
668 \end{lstlisting} |
602 \end{lstlisting} |
669 \end{center} |
603 \end{center} |
670 |
604 |
671 \noindent |
605 \noindent |
672 Unfortunately, this does not work as it produces a typing error. The |
606 Unfortunately, this does not quite work yet as it produces a typing |
673 reason is that the parsers \texttt{"a"}, \texttt{"b"} and \texttt{""} |
607 error. The reason is that the parsers \texttt{"a"}, \texttt{"b"} and |
674 all produce strings and therefore can be put into an alternative |
608 \texttt{""} all produce strings as output type and therefore can be |
675 \texttt{...| "a" | "b" | ""}. But both \pcode{"a" ~ Pal ~ "a"} and |
609 put into an alternative \texttt{...| "a" | "b" | ""}. But both |
676 \pcode{"b" ~ Pal ~ "b"} produce pairs of the form |
610 \pcode{"a" ~ Pal ~ "a"} and \pcode{"b" ~ Pal ~ "b"} produce pairs of |
677 $(((\_, \_), \_), \_)$---that is how the sequence parser combinator |
611 the form $(((\_, \_), \_), \_)$---that is how the sequence parser |
678 nests results when \pcode{\~} is used between two components. The |
612 combinator nests results when \pcode{\~} is used between two |
679 solution is to use a semantic action that ``flattens'' these pairs and |
613 components. The solution is to use a semantic action that ``flattens'' |
680 appends the corresponding strings, like |
614 these pairs and appends the corresponding strings, like |
681 |
615 |
682 \begin{center} |
616 \begin{center} |
683 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
617 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
684 lazy val Pal : Parser[String, String] = |
618 lazy val Pal : Parser[String, String] = |
685 (("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } | |
619 (("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } | |