18 test when it is empty and ``sequentially'' take it apart. Strings and |
18 test when it is empty and ``sequentially'' take it apart. Strings and |
19 lists fit this bill. However, parser combinators also have their |
19 lists fit this bill. However, parser combinators also have their |
20 drawbacks. For example they require that the grammar to be parsed is |
20 drawbacks. For example they require that the grammar to be parsed is |
21 \emph{not} left-recursive and they are efficient only when the grammar |
21 \emph{not} left-recursive and they are efficient only when the grammar |
22 is unambiguous. It is the responsibility of the grammar designer to |
22 is unambiguous. It is the responsibility of the grammar designer to |
23 ensure these two properties. |
23 ensure these two properties hold. |
24 |
24 |
25 The general idea behind parser combinators is to transform the input |
25 The general idea behind parser combinators is to transform the input |
26 into sets of pairs, like so |
26 into sets of pairs, like so |
27 |
27 |
28 \begin{center} |
28 \begin{center} |
29 $\underbrace{\text{list of tokens}}_{\text{input}}$ |
29 $\underbrace{\text{list of tokens}}_{\text{input}}$ |
30 $\Rightarrow$ |
30 $\Rightarrow$ |
31 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$ |
31 $\underbrace{\text{set of (parsed part, unprocessed part)}}_{\text{output}}$ |
32 \end{center} |
32 \end{center} |
33 |
33 |
34 \noindent |
34 \noindent |
35 Given the extended effort we have spent implementing a lexer in order |
35 Given the extended effort we have spent implementing a lexer in order |
36 to generate list of tokens, it might be surprising that in what |
36 to generate lists of tokens, it might be surprising that in what |
37 follows we shall often use strings as input, rather than list of |
37 follows we shall often use strings as input, rather than lists of |
38 tokens. This is for making the explanation more lucid. It does not |
38 tokens. This is for making the explanation more lucid and for quick |
39 make our previous work on lexers obsolete (remember they transform a |
39 examples. It does not make our previous work on lexers obsolete |
40 string into a list of tokens). Lexers will still be needed for |
40 (remember they transform a string into a list of tokens). Lexers will |
41 building a somewhat realistic compiler. |
41 still be needed for building a somewhat realistic compiler. |
42 |
42 |
43 As mentioned above, parser combinators are relatively agnostic about what |
43 As mentioned above, parser combinators are relatively agnostic about what |
44 kind of input they process. In my Scala code I use the following |
44 kind of input they process. In my Scala code I use the following |
45 polymorphic types for parser combinators: |
45 polymorphic types for parser combinators: |
46 |
46 |
49 \end{center} |
49 \end{center} |
50 |
50 |
51 \noindent That is they take as input something of type \texttt{I} and |
51 \noindent That is they take as input something of type \texttt{I} and |
52 return a set of pairs of type \texttt{Set[(T, I)]}. Since the input |
52 return a set of pairs of type \texttt{Set[(T, I)]}. Since the input |
53 needs to be of ``sequence-kind'', I actually have to often write |
53 needs to be of ``sequence-kind'', I actually have to often write |
54 \texttt{I <\% Seq[\_]} for the input type. This ensures the input is a |
54 \texttt{I <\% Seq[\_]} for the input type. This ensures the |
55 subtype of Scala sequences. The first component of the generated pairs |
55 input is a subtype of Scala sequences. The first component of the |
56 corresponds to what the parser combinator was able to process from the |
56 generated pairs corresponds to what the parser combinator was able to |
57 input and the second is the unprocessed part, or leftover, of the |
57 parse from the input and the second is the unprocessed, or |
58 input (therefore the type of this unprocessed part is the same as the |
58 leftover, part of the input (therefore the type of this unprocessed part is |
59 input). A parser combinator might return more than one such pair; the |
59 the same as the input). A parser combinator might return more than one |
60 idea is that there are potentially several ways of how to parse the |
60 such pair; the idea is that there are potentially several ways of how |
61 input. As a concrete example, consider the string |
61 to parse the input. As a concrete example, consider the string |
62 |
62 |
63 \begin{center} |
63 \begin{center} |
64 \tt\Grid{iffoo\VS testbar} |
64 \tt\Grid{iffoo\VS testbar} |
65 \end{center} |
65 \end{center} |
66 |
66 |
73 \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$ |
73 \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$ |
74 \end{center} |
74 \end{center} |
75 |
75 |
76 \noindent where the first pair means the parser could recognise |
76 \noindent where the first pair means the parser could recognise |
77 \texttt{if} from the input and leaves the \texttt{foo\VS testbar} as |
77 \texttt{if} from the input and leaves the \texttt{foo\VS testbar} as |
78 `unprocessed' part; in the other case it could recognise |
78 unprocessed part; in the other case it could recognise |
79 \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the |
79 \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the |
80 parser cannot recognise anything from the input at all, then parser |
80 parser cannot recognise anything from the input at all, then parser |
81 combinators just return the empty set $\{\}$. This will indicate |
81 combinators just return the empty set $\{\}$. This will indicate |
82 something ``went wrong''\ldots or more precisely, nothing could be |
82 something ``went wrong''\ldots or more precisely, nothing could be |
83 parsed. |
83 parsed. |
84 |
84 |
85 Also important to note is that the type \texttt{T} for the processed |
85 Also important to note is that the type \texttt{T} for the processed |
86 part is different from the input type \texttt{I} in the parse. In the |
86 part is different from the input type \texttt{I} in the parse. In the |
87 example above is just happens to be the same. The reason for the |
87 example above is just happens to be the same. The reason for the |
88 potential difference is that in general we are interested in |
88 difference is that in general we are interested in |
89 transforming our input into something ``different''\ldots for example |
89 transforming our input into something ``different''\ldots for example |
90 into a tree; or if we implement the grammar for arithmetic |
90 into a tree; or if we implement the grammar for arithmetic |
91 expressions, we might be interested in the actual integer number the |
91 expressions, we might be interested in the actual integer number the |
92 arithmetic expression, say \texttt{1 + 2 * 3}, stands for. In this way |
92 arithmetic expression, say \texttt{1 + 2 * 3}, stands for. In this way |
93 we can use parser combinators to implement relatively easily a |
93 we can use parser combinators to implement relatively easily a |
94 calculator, for instance. |
94 calculator, for instance. |
95 |
95 |
96 The main idea of parser combinators is that we can easily build parser |
96 The main idea of parser combinators is that we can easily build parser |
97 combinators out of smaller components following very closely the |
97 combinators out of smaller components following very closely the |
98 structure of a grammar. In order to implement this in an |
98 structure of a grammar. In order to implement this in a |
99 object-oriented programming language, like Scala, we need to specify |
99 functional/object-oriented programming language, like Scala, we need |
100 an abstract class for parser combinators. This abstract class states |
100 to specify an abstract class for parser combinators. In the abstract |
101 that the function \texttt{parse} takes an argument of type \texttt{I} |
101 class we specify that \texttt{I} is the \emph{input type} of the |
102 and returns a set of type \mbox{\texttt{Set[(T, I)]}}. |
102 parser combinator and that \texttt{T} is the \emph{ouput type}. This |
|
103 implies that the function \texttt{parse} takes an argument of type |
|
104 \texttt{I} and returns a set of type \mbox{\texttt{Set[(T, I)]}}. |
103 |
105 |
104 \begin{center} |
106 \begin{center} |
105 \begin{lstlisting}[language=Scala] |
107 \begin{lstlisting}[language=Scala] |
106 abstract class Parser[I, T] { |
108 abstract class Parser[I, T] { |
107 def parse(in: I) : Set[(T, I)] |
109 def parse(in: I) : Set[(T, I)] |
253 \end{tabular} |
255 \end{tabular} |
254 \end{center} |
256 \end{center} |
255 |
257 |
256 \noindent We receive in the first two cases a successful |
258 \noindent We receive in the first two cases a successful |
257 output (that is a non-empty set). In each case, either |
259 output (that is a non-empty set). In each case, either |
258 \pcode{a} or \pcode{b} is in the processed part, and |
260 \pcode{a} or \pcode{b} is in the parsed part, and |
259 \pcode{cde} in the unprocessed part. Clearly this parser cannot |
261 \pcode{cde} in the unprocessed part. Clearly this parser cannot |
260 parse anything with \pcode{ccde}, therefore the empty |
262 parse anything with \pcode{ccde}, therefore the empty |
261 set is returned. |
263 set is returned. |
262 |
264 |
263 A bit more interesting is the \emph{sequence parser combinator}. Given |
265 A bit more interesting is the \emph{sequence parser combinator}. Given |
264 two parsers, say again, $p$ and $q$, we want to apply first the input |
266 two parsers, say again, $p$ and $q$, we want to apply first the input |
265 to $p$ producing a set of pairs; then apply $q$ to all the unparsed |
267 to $p$ producing a set of pairs; then apply $q$ to all the unparsed |
266 parts in the pairs; and then combine the results. Mathematically we would |
268 parts in the pairs; and then combine the results. Mathematically we would |
267 write something like this for the result set of pairs: |
269 write something like this for the set of pairs: |
268 |
270 |
269 \begin{center} |
271 \begin{center} |
270 \begin{tabular}{lcl} |
272 \begin{tabular}{lcl} |
271 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & |
273 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & |
272 $(\textit{output}_1, u_1) \in p(\text{input}) |
274 $(\textit{output}_1, u_1) \in p(\text{input}) |
275 \end{tabular} |
277 \end{tabular} |
276 \end{center} |
278 \end{center} |
277 |
279 |
278 \noindent Notice that the $p$ will first be run on the input, |
280 \noindent Notice that the $p$ will first be run on the input, |
279 producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ |
281 producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ |
280 stands for the unprocessed, or leftover, parts pf $p$. We want that |
282 stands for the unprocessed, or leftover, parts of $p$. We want that |
281 $q$ runs on all these unprocessed parts $u_1$. Therefore these |
283 $q$ runs on all these unprocessed parts $u_1$. Therefore these |
282 unprocessed parts are fed into the second parser $q$. The overall |
284 unprocessed parts are fed into the second parser $q$. The overall |
283 result of the sequence parser combinator is pairs of the form |
285 result of the sequence parser combinator is pairs of the form |
284 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the |
286 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the |
285 unprocessed part of the sequqnce p[arser combinator is the unprocessed |
287 unprocessed part of the sequqnce parser combinator is the unprocessed |
286 part the second parser $q$ leaves as leftover. The processed parts of |
288 part the second parser $q$ leaves as leftover. The parsed parts of the |
287 of the component parsers is a pair consisting of the outputs of $p$ |
289 component parsers are combined in a pair, namely |
288 and $q$, namely $(\textit{output}_1, \textit{output}_2)$. This |
290 $(\textit{output}_1, \textit{output}_2)$. The reason is we want to |
289 behaviour can be implemented in Scala as follows: |
291 know what $p$ and $q$ were able to parse. This behaviour can be |
|
292 implemented in Scala as follows: |
290 |
293 |
291 \begin{center} |
294 \begin{center} |
292 \begin{lstlisting}[language=Scala,numbers=none] |
295 \begin{lstlisting}[language=Scala,numbers=none] |
293 class SeqParser[I, T, S] |
296 class SeqParser[I, T, S] |
294 (p: => Parser[I, T], |
297 (p: => Parser[I, T], |
300 } |
303 } |
301 \end{lstlisting} |
304 \end{lstlisting} |
302 \end{center} |
305 \end{center} |
303 |
306 |
304 \noindent This parser takes again as arguments two parsers, \texttt{p} |
307 \noindent This parser takes again as arguments two parsers, \texttt{p} |
305 and \texttt{q}. It implements \texttt{parse} as follows: let first run |
308 and \texttt{q}. It implements \texttt{parse} as follows: first run the |
306 the parser \texttt{p} on the input producing a set of pairs |
309 parser \texttt{p} on the input producing a set of pairs |
307 (\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the |
310 (\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the |
308 unprocessed parts left over by \texttt{p}. Let then \texttt{q} run on these |
311 unprocessed parts left over by \texttt{p} (recall that there can be |
309 unprocessed parts producing again a set of pairs. The output of the |
312 several such pairs). Let then \texttt{q} run on these unprocessed |
310 sequence parser combinator is then a set containing pairs where the |
313 parts producing again a set of pairs. The output of the sequence |
311 first components are again pairs, namely what the first parser could |
314 parser combinator is then a set containing pairs where the first |
312 parse together with what the second parser could parse; the second |
315 components are again pairs, namely what the first parser could parse |
313 component is the unprocessed part left over after running the second |
316 together with what the second parser could parse; the second component |
314 parser \texttt{q}. Therefore the input type of the sequence parser |
317 is the unprocessed part left over after running the second parser |
315 combinator is as usual \texttt{I}, but the output type is |
318 \texttt{q}. Note that the input type of the sequence parser combinator |
|
319 is as usual \texttt{I}, but the output type is |
316 |
320 |
317 \begin{center} |
321 \begin{center} |
318 \texttt{(T, S)} |
322 \texttt{(T, S)} |
319 \end{center} |
323 \end{center} |
320 |
324 |
321 \noindent |
325 \noindent |
322 This means \texttt{parse} in the sequence parser combinator returns |
326 Consequently, the function \texttt{parse} in the sequence parser |
323 sets of type \texttt{Set[((T, S), I)]}. Notice that we have |
327 combinator returns sets of type \texttt{Set[((T, S), I)]}. That means |
324 essentially two output types for the sequence parser combinator |
328 we have essentially two output types for the sequence parser |
325 (packaged in a pair), because in general \textit{p} and \textit{q} |
329 combinator (packaged in a pair), because in general \textit{p} and |
326 might produce different things (for example first we recognise a |
330 \textit{q} might produce different things (for example we recognise a |
327 number and then a string corresponding to an operator). If any of the |
331 number with \texttt{p} and then with \texttt{q} a string corresponding |
328 runs of \textit{p} and \textit{q} fail, that is produce the empty set, |
332 to an operator). If any of the runs of \textit{p} and \textit{q} |
329 then \texttt{parse} will also produce the empty set. |
333 fail, that is produce the empty set, then \texttt{parse} will also |
|
334 produce the empty set. |
330 |
335 |
331 With the shorthand notation we shall introduce later for the sequence |
336 With the shorthand notation we shall introduce later for the sequence |
332 parser combinator, we can write for example \pcode{"a" ~ "b"}, which |
337 parser combinator, we can write for example \pcode{"a" ~ "b"}, which |
333 is the parser combinator that first recognises the character |
338 is the parser combinator that first recognises the character |
334 \texttt{a} from a string and then \texttt{b}. Let us look again at |
339 \texttt{a} from a string and then \texttt{b}. Let us look again at |
335 three examples of how this parser combinator processes some strings: |
340 some examples of how this parser combinator processes some strings: |
336 |
341 |
337 \begin{center} |
342 \begin{center} |
338 \begin{tabular}{rcl} |
343 \begin{tabular}{rcl} |
339 input strings & & output\medskip\\ |
344 input strings & & output\medskip\\ |
340 \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\ |
345 \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\ |
437 nesting of the component parsers. |
442 nesting of the component parsers. |
438 |
443 |
439 |
444 |
440 Consider also carefully that constructing a parser such \pcode{"a" | |
445 Consider also carefully that constructing a parser such \pcode{"a" | |
441 ("a" ~ "b")} will result in a typing error. The intention with this |
446 ("a" ~ "b")} will result in a typing error. The intention with this |
442 parser is that we want to parse an \texttt{a}, or an \texttt{a} |
447 parser is that we want to parse either an \texttt{a}, or an \texttt{a} |
443 followed by a \texttt{b}. However, the first parser has as output type |
448 followed by a \texttt{b}. However, the first parser has as output type |
444 a single character (recall the type of \texttt{CharParser}), but the |
449 a single character (recall the type of \texttt{CharParser}), but the |
445 second parser produces a pair of characters as output. The alternative |
450 second parser produces a pair of characters as output. The alternative |
446 parser is required to have both component parsers to have the same |
451 parser is required to have both component parsers to have the same |
447 type---we need to be able to build the union of two sets, which means |
452 type---the reason is that we need to be able to build the union of two |
448 in Scala they need to be of the same type. Since ther are not, there |
453 sets, which requires in Scala that the sets have the same type. Since |
449 is a typing error in this example. We will see later how we can build |
454 they are not in this case, there is a typing error. We will see later |
450 this parser without the typing error. |
455 how we can build this parser without the typing error. |
451 |
456 |
452 The next parser combinator, called \emph{semantic action}, does not |
457 The next parser combinator, called \emph{semantic action}, does not |
453 actually combine smaller parsers, but applies a function to the result |
458 actually combine two smaller parsers, but applies a function to the result |
454 of a parser. It is implemented in Scala as follows |
459 of a parser. It is implemented in Scala as follows |
455 |
460 |
456 \begin{center} |
461 \begin{center} |
457 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
462 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
458 class FunParser[I, T, S] |
463 class FunParser[I, T, S] |
471 produces sets of type \texttt{Set[(T, I)]}. The semantic action |
476 produces sets of type \texttt{Set[(T, I)]}. The semantic action |
472 combinator then applies the function \texttt{f} to all the `processed' |
477 combinator then applies the function \texttt{f} to all the `processed' |
473 parser outputs. Since this function is of type \texttt{T => S}, we |
478 parser outputs. Since this function is of type \texttt{T => S}, we |
474 obtain a parser with output type \texttt{S}. Again Scala lets us |
479 obtain a parser with output type \texttt{S}. Again Scala lets us |
475 introduce some shorthand notation for this parser |
480 introduce some shorthand notation for this parser |
476 combinator. Therefore we will write \texttt{p ==> f} for it. |
481 combinator. Therefore we will write short \texttt{p ==> f} for it. |
477 |
482 |
478 What are semantic actions good for? Well, they allow you to transform |
483 What are semantic actions good for? Well, they allow you to transform |
479 the parsed input into datastructures you can use for further |
484 the parsed input into datastructures you can use for further |
480 processing. A simple example would be to transform parsed characters |
485 processing. A simple (contrived) example would be to transform parsed |
481 into ASCII numbers. Suppose we define a function \texttt{f} (from |
486 characters into ASCII numbers. Suppose we define a function \texttt{f} |
482 characters to ints) and use a \texttt{CharParser} for parsing |
487 (from characters to \texttt{Int}s) and use a \texttt{CharParser} for parsing |
483 the character \texttt{c}. |
488 the character \texttt{c}. |
|
489 |
484 |
490 |
485 \begin{center} |
491 \begin{center} |
486 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
492 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
487 val f = (c: Char) => c.toInt |
493 val f = (c: Char) => c.toInt |
488 val c = new CharParser('c') |
494 val c = new CharParser('c') |
503 In the first line we obtain the expected result \texttt{Set(('c', |
509 In the first line we obtain the expected result \texttt{Set(('c', |
504 "bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the |
510 "bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the |
505 character has been transformed into an ASCII number. |
511 character has been transformed into an ASCII number. |
506 |
512 |
507 A slightly less contrived example is about parsing numbers (recall |
513 A slightly less contrived example is about parsing numbers (recall |
508 \texttt{NumParser} above). However, we want to do this here for strings. |
514 \texttt{NumParser} above). However, we want to do this here for |
509 For this assume we have the following \texttt{RegexParser}. |
515 strings, not for tokens. For this assume we have the following |
|
516 (atomic) \texttt{RegexParser}. |
510 |
517 |
511 \begin{center} |
518 \begin{center} |
512 \begin{lstlisting}[language=Scala,xleftmargin=0mm, |
519 \begin{lstlisting}[language=Scala,xleftmargin=0mm, |
513 basicstyle=\small\ttfamily, numbers=none] |
520 basicstyle=\small\ttfamily, numbers=none] |
514 import scala.util.matching.Regex |
521 import scala.util.matching.Regex |
525 \noindent |
532 \noindent |
526 This parser takes a regex as argument and splits up a string into a |
533 This parser takes a regex as argument and splits up a string into a |
527 prefix and the rest according to this regex |
534 prefix and the rest according to this regex |
528 (\texttt{reg.findPrefixMatchOf} generates a match---in the successful |
535 (\texttt{reg.findPrefixMatchOf} generates a match---in the successful |
529 case---and the corresponding strings can be extracted with |
536 case---and the corresponding strings can be extracted with |
530 \texttt{matched} and \texttt{after}). Using this parser we can define a |
537 \texttt{matched} and \texttt{after}). The input and output type for |
531 \texttt{NumParser} for strings as follows: |
538 this parser is \texttt{String}. Using \texttt{RegexParser} we can |
|
539 define a \texttt{NumParser} for \texttt{Strings} to \texttt{Int} as |
|
540 follows: |
532 |
541 |
533 \begin{center} |
542 \begin{center} |
534 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
543 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
535 val NumParser = RegexParser("[0-9]+".r) |
544 val NumParser = RegexParser("[0-9]+".r) |
536 \end{lstlisting} |
545 \end{lstlisting} |
537 \end{center} |
546 \end{center} |
538 |
547 |
539 \noindent |
548 \noindent |
540 This parser will recognise a number at the beginning of a string, for |
549 This parser will recognise a number at the beginning of a string. For |
541 example |
550 example |
542 |
551 |
543 \begin{center} |
552 \begin{center} |
544 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
553 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
545 NumParser.parse("123abc") |
554 NumParser.parse("123abc") |
558 \end{lstlisting} |
567 \end{lstlisting} |
559 \end{center} |
568 \end{center} |
560 |
569 |
561 \noindent |
570 \noindent |
562 The function in the semantic action converts a string into an |
571 The function in the semantic action converts a string into an |
563 \texttt{Int}. Let us come back to semantic actions when we are going |
572 \texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))}, |
564 to implement actual context-free gammars. |
573 but this time \texttt{123} is an \texttt{Int}. Let us come back to |
|
574 semantic actions when we are going to implement actual context-free |
|
575 gammars. |
565 |
576 |
566 \subsubsection*{Shorthand notation for parser combinators} |
577 \subsubsection*{Shorthand notation for parser combinators} |
567 |
578 |
568 Before we proceed, let us just explain the shorthand notation for |
579 Before we proceed, let us just explain the shorthand notation for |
569 parser combinators. Like for regular expressions, the shorthand notation |
580 parser combinators. Like for regular expressions, the shorthand notation |
570 will make our life much easier when writing actual parsers. We can define |
581 will make our life much easier when writing actual parsers. We can define |
571 some implicits which allow us to write \pcode{p | q}, \pcode{p ~ q} and |
582 some implicits which allow us to write |
572 \pcode{p ==> f} as well as to use plain strings for specifying simple |
583 |
573 string parsers. |
584 \begin{center} |
|
585 \begin{tabular}{ll} |
|
586 \pcode{p | q} & alternative parser\\ |
|
587 \pcode{p ~ q} & sequence parser\\ |
|
588 \pcode{p ==> f} & semantic action parser |
|
589 \end{tabular} |
|
590 \end{center} |
|
591 |
|
592 \noindent |
|
593 as well as to use plain strings for specifying simple string parsers. |
574 |
594 |
575 The idea is that this shorthand notation allows us to easily translate |
595 The idea is that this shorthand notation allows us to easily translate |
576 context-free grammars into code. For example recall our context-free |
596 context-free grammars into code. For example recall our context-free |
577 grammar for palindromes: |
597 grammar for palindromes: |
578 |
598 |
579 \begin{plstx}[margin=3cm] |
599 \begin{plstx}[margin=3cm] |
580 : \meta{P} ::= a\cdot \meta{P}\cdot a | b\cdot \meta{P}\cdot b | a | b | \epsilon\\ |
600 : \meta{Pal} ::= a\cdot \meta{Pal}\cdot a | b\cdot \meta{Pal}\cdot b | a | b | \epsilon\\ |
581 \end{plstx} |
601 \end{plstx} |
582 |
602 |
583 \noindent |
603 \noindent |
584 Each alternative in this grammar translates into an alternative parser |
604 Each alternative in this grammar translates into an alternative parser |
585 combinator. The $\cdot$ can be translated to a sequence parser |
605 combinator. The $\cdot$ can be translated to a sequence parser |
605 \noindent |
625 \noindent |
606 Unfortunately, this does not quite work yet as it produces a typing |
626 Unfortunately, this does not quite work yet as it produces a typing |
607 error. The reason is that the parsers \texttt{"a"}, \texttt{"b"} and |
627 error. The reason is that the parsers \texttt{"a"}, \texttt{"b"} and |
608 \texttt{""} all produce strings as output type and therefore can be |
628 \texttt{""} all produce strings as output type and therefore can be |
609 put into an alternative \texttt{...| "a" | "b" | ""}. But both |
629 put into an alternative \texttt{...| "a" | "b" | ""}. But both |
610 \pcode{"a" ~ Pal ~ "a"} and \pcode{"b" ~ Pal ~ "b"} produce pairs of |
630 sequence parsers \pcode{"a" ~ Pal ~ "a"} and \pcode{"b" ~ Pal ~ "b"} |
611 the form $(((\_, \_), \_), \_)$---that is how the sequence parser |
631 produce pairs of the form |
612 combinator nests results when \pcode{\~} is used between two |
632 |
613 components. The solution is to use a semantic action that ``flattens'' |
633 \begin{center} |
614 these pairs and appends the corresponding strings, like |
634 (((\texttt{a}-part, \texttt{Pal}-part), \texttt{a}-part), unprocessed part) |
|
635 \end{center} |
|
636 |
|
637 \noindent That is how the |
|
638 sequence parser combinator nests results when \pcode{\~} is used |
|
639 between two components. The solution is to use a semantic action that |
|
640 ``flattens'' these pairs and appends the corresponding strings, like |
615 |
641 |
616 \begin{center} |
642 \begin{center} |
617 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
643 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
618 lazy val Pal : Parser[String, String] = |
644 lazy val Pal : Parser[String, String] = |
619 (("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } | |
645 (("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } | |
621 "a" | "b" | "") |
647 "a" | "b" | "") |
622 \end{lstlisting} |
648 \end{lstlisting} |
623 \end{center} |
649 \end{center} |
624 |
650 |
625 \noindent |
651 \noindent |
626 Now in all cases we have strings as output type for the parser variants. |
652 How does this work? Well, recall again what the pairs look like for |
627 The semantic action |
653 the parser \pcode{"a" ~ Pal ~ "a"}. The pattern in the semantic |
628 |
654 action matches the nested pairs (the \texttt{x} with the |
629 |
655 \texttt{a}-part and so on). Unfortunately when we have such nested |
630 \noindent |
656 pairs, Scala requires us to define the function using the |
631 Important to note is that we must define \texttt{Pal}-parser as a |
657 \pcode{case}-syntax |
632 \emph{lazy} value. |
658 |
|
659 \begin{center} |
|
660 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
661 { case ((x, y), z) => ... } |
|
662 \end{lstlisting} |
|
663 \end{center} |
|
664 |
|
665 \noindent |
|
666 If we have more sequence parser combinators or have them differently nested, |
|
667 then the pattern in the semantic action needs to be adjusted accordingly. |
|
668 The action we implement above is to concatenate all three strings, which |
|
669 means after the semantic action is applied the output type of the parser |
|
670 is \texttt{String}, which means it fits with the alternative parsers |
|
671 \texttt{...| "a" | "b" | ""}. |
|
672 |
|
673 If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain |
|
674 as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrom |
|
675 (an empty set would mean something is wrong). But also notice what the |
|
676 intermediate results are generated by \pcode{Pal.parse("abaaaba")} |
|
677 |
|
678 \begin{center} |
|
679 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
680 Set((abaaaba,""),(aba,aaba), (a,baaaba), ("",abaaaba)) |
|
681 \end{lstlisting} |
|
682 \end{center} |
|
683 |
|
684 \noindent |
|
685 That there are more than one output might be slightly unexpected, but |
|
686 can be explained as follows: the pairs represent all possible |
|
687 (partial) parses of the string \pcode{"abaaaba"}. The first pair above |
|
688 correesponds to a complete parse (all output is consumed) and this is |
|
689 what \pcode{Pal.parse_all} returns. The second pair is a small |
|
690 ``sub-palindrome'' that can also be parsed, but the parse fails with |
|
691 the rest \pcode{aaba}, which is therefore left as unprocessed. The |
|
692 third one is an attempt to parse the whole string with the |
|
693 single-character parser \pcode{a}. That of course only partially |
|
694 succeeds, by leaving \pcode{"baaaba"} as the unprocessed |
|
695 part. Finally, since we allow the empty string to be a palindrom we |
|
696 also obtain the last pair, where actually nothing is consumed from the |
|
697 input string. While all this works as intended, we need to be careful |
|
698 with this (especially with including the \pcode{""} parser in our |
|
699 grammar): if during parsing the set of parsing attempts gets too big, |
|
700 then the parsing process can become very slow as the potential |
|
701 candidates for applying rules can snowball. |
|
702 |
|
703 |
|
704 Important is also to note is that we must define the |
|
705 \texttt{Pal}-parser as a \emph{lazy} value in Scala. Look again at the |
|
706 code: \texttt{Pal} occurs on the right-hand side of the definition. If we had |
|
707 just written |
|
708 |
|
709 \begin{center} |
|
710 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
711 val Pal : Parser[String, String] = ...rhs... |
|
712 \end{lstlisting} |
|
713 \end{center} |
|
714 |
|
715 \noindent |
|
716 then Scala before making this assignemnt to \texttt{Pal} attempts to |
|
717 find out what the expression on the right-hand side evaluates to. This |
|
718 is straightforward in case of simple expressions \texttt{2 + 3}, but |
|
719 the expression above contains \texttt{Pal} in the right-hand |
|
720 side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal} |
|
721 evaluates to and start a new recursion, which means it falls into an |
|
722 infinite loop. The definition of \texttt{Pal} is recursive and the |
|
723 \pcode{lazy} key-word prevents it from being fully evaluated. Therefore |
|
724 whenever we want to define a recursive parser we have to write |
|
725 |
|
726 \begin{center} |
|
727 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
728 lazy val SomeParser : Parser[...,...] = ...rhs... |
|
729 \end{lstlisting} |
|
730 \end{center} |
|
731 |
|
732 \noindent That was not necessary for our atomic parsers, like |
|
733 \texttt{RegexParser} or \texttt{CharParser}, because they are not recursive. |
|
734 Note that this is also the reason why we had to write |
|
735 |
|
736 \begin{center} |
|
737 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
738 class AltParser[I, T] |
|
739 (p: => Parser[I, T], |
|
740 q: => Parser[I, T]) extends Parser[I, T] {...} |
|
741 |
|
742 class SeqParser[I, T, S] |
|
743 (p: => Parser[I, T], |
|
744 q: => Parser[I, S]) extends Parser[I, (T, S)] {...} |
|
745 \end{lstlisting} |
|
746 \end{center} |
|
747 |
|
748 \noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of |
|
749 the argument types for \texttt{p} and \texttt{q} prevent Scala from |
|
750 evaluating the arguments. Normally, Scala would first evaluate what |
|
751 kind of parsers \texttt{p} and \texttt{q} are, and only then generate |
|
752 the alternative parser combinator, repsectively sequence parser |
|
753 combinator. Since the argumants can be recursive parsers, such as |
|
754 \texttt{Pal}, this would lead again to an infinite loop. |
|
755 |
|
756 As a final example in this section, let us consider the grammar for |
|
757 well-nested parentheses: |
|
758 |
|
759 \begin{plstx}[margin=3cm] |
|
760 : \meta{P} ::= (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\ |
|
761 \end{plstx} |
|
762 |
|
763 \noindent |
|
764 Let us assume we want to not just recognise strings of |
|
765 well-nested parentheses but also transfrom round parentheses |
|
766 into curly braces. We can do this by using a semantic |
|
767 action: |
|
768 |
|
769 \begin{center} |
|
770 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, |
|
771 xleftmargin=0mm, numbers=none] |
|
772 lazy val P : Parser[String, String] = |
|
773 "(" ~ P ~ ")" ~ P ==> { case (((_,x),_),y) => "{" + x + "}" + y } | "" |
|
774 \end{lstlisting} |
|
775 \end{center} |
|
776 |
|
777 \noindent |
|
778 Here we define a function where which ignores the parentheses in the |
|
779 pairs, but replaces them in the right places with curly braces when |
|
780 assembling the new string in the right-hand side. If we run |
|
781 \pcode{P.parse_all("(((()()))())")} we obtain |
|
782 \texttt{Set(\{\{\{\{\}\{\}\}\}\{\}\})} as expected. |
|
783 |
|
784 |
633 |
785 |
634 \subsubsection*{Implementing an Interpreter} |
786 \subsubsection*{Implementing an Interpreter} |
635 |
787 |
636 %\bigskip |
788 %\bigskip |
637 %takes advantage of the full generality---have a look |
789 %takes advantage of the full generality---have a look |