30 $\Rightarrow$ |
30 $\Rightarrow$ |
31 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$ |
31 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$ |
32 \end{center} |
32 \end{center} |
33 |
33 |
34 \noindent |
34 \noindent |
35 Given the extended effort we have spent in order to implement a lexer |
35 Given the extended effort we have spent implementing a lexer |
36 in order to generate list of tokens, it might be surprising that in |
36 in order to generate list of tokens, it might be surprising that in |
37 what follows we shall often use strings as input. This is for making |
37 what follows we shall often use strings as input. This is for making |
38 the explanation more lucid. It does not make our previous work on |
38 the explanation more lucid. It does not make our previous work on |
39 lexers obsolete (remember they transform a string into a list of |
39 lexers obsolete (remember they transform a string into a list of |
40 tokens). Lexers will still be needed for building a somewhat realistic |
40 tokens). Lexers will still be needed for building a somewhat realistic |
146 if (in.head == c) Set((c, in.tail)) else Set() |
146 if (in.head == c) Set((c, in.tail)) else Set() |
147 } |
147 } |
148 \end{lstlisting} |
148 \end{lstlisting} |
149 \end{center} |
149 \end{center} |
150 |
150 |
151 \noindent You can see the \texttt{parse} tests whether the |
151 \noindent You can see \texttt{parse} tests whether the |
152 first character of the input string \texttt{in} is equal to |
152 first character of the input string \texttt{in} is equal to |
153 \texttt{c}. If yes, then it splits the string into the recognised part |
153 \texttt{c}. If yes, then it splits the string into the recognised part |
154 \texttt{c} and the unprocessed part \texttt{in.tail}. In case |
154 \texttt{c} and the unprocessed part \texttt{in.tail}. In case |
155 \texttt{in} does not start with \texttt{c} then the parser returns the |
155 \texttt{in} does not start with \texttt{c} then the parser returns the |
156 empty set (in Scala \texttt{Set()}). Since this parser recognises |
156 empty set (in Scala \texttt{Set()}). Since this parser recognises |
172 \end{center} |
172 \end{center} |
173 |
173 |
174 \noindent |
174 \noindent |
175 In this parser the input is of type \texttt{List[Token]}. The function |
175 In this parser the input is of type \texttt{List[Token]}. The function |
176 parse looks at the input \texttt{ts} and checks whether the first |
176 parse looks at the input \texttt{ts} and checks whether the first |
177 token is a \texttt{Num\_token}. Let us assume our lexer generated |
177 token is a \texttt{Num\_token} (let us assume our lexer generated |
178 these tokens for numbers. But this parser does not just return this |
178 these tokens for numbers). But this parser does not just return this |
179 token (and the rest of the list), like the \texttt{CharParser} above, |
179 token (and the rest of the list), like the \texttt{CharParser} above, |
180 rather extracts the string \texttt{s} from the token and converts it |
180 rather it extracts the string \texttt{s} from the token and converts it |
181 into an integer. The hope is that the lexer did its work well and this |
181 into an integer. The hope is that the lexer did its work well and this |
182 conversion always succeeds. The consequence of this is that the output |
182 conversion always succeeds. The consequence of this is that the output |
183 type for this parser is \texttt{Int}, not \texttt{Token}. Such a |
183 type for this parser is \texttt{Int}, not \texttt{Token}. Such a |
184 conversion would be needed if we want to implement a simple calculator |
184 conversion would be needed if we want to implement a simple calculator |
185 program. |
185 program. |
273 \end{tabular} |
273 \end{tabular} |
274 \end{center} |
274 \end{center} |
275 |
275 |
276 \noindent Notice that the $p$ wil first be run on the input, producing |
276 \noindent Notice that the $p$ wil first be run on the input, producing |
277 pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands |
277 pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands |
278 for the unprocessed, or left-over, parts. We want that $q$ runs on all |
278 for the unprocessed, or leftover, parts. We want that $q$ runs on all |
279 these unprocessed parts $u_1$. This again will produce some |
279 these unprocessed parts $u_1$. This again will produce some |
280 processed part , $p$ and |
280 processed part , $p$ and |
281 $q$, we apply both parsers to the input (remember parsers are |
281 $q$, we apply both parsers to the input (remember parsers are |
282 functions) and combine the output (remember they are sets of pairs) |
282 functions) and combine the output (remember they are sets of pairs) |
283 |
283 |
323 new AltParser(CharParser('a'), CharParser('b')) |
323 new AltParser(CharParser('a'), CharParser('b')) |
324 \end{lstlisting} |
324 \end{lstlisting} |
325 \end{center} |
325 \end{center} |
326 |
326 |
327 \noindent Later on we will use again Scala mechanism for introducing some |
327 \noindent Later on we will use again Scala mechanism for introducing some |
328 more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some sample strings |
328 more readable shorthand notation for this, like \texttt{"a" | "b"}. Let us look in detail at what this parser combinator produces with some sample strings |
329 |
329 |
330 \begin{center} |
330 \begin{center} |
331 \begin{tabular}{rcl} |
331 \begin{tabular}{rcl} |
332 input strings & & output\medskip\\ |
332 input strings & & output\medskip\\ |
333 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\ |
333 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\ |
361 producing pairs of the form $\textit{output}_1$ and unprocessed part |
361 producing pairs of the form $\textit{output}_1$ and unprocessed part |
362 $u_1$. This unprocessed part is fed into the second parser $q$. The |
362 $u_1$. This unprocessed part is fed into the second parser $q$. The |
363 overall result of the sequence parser combinator is pairs of the form |
363 overall result of the sequence parser combinator is pairs of the form |
364 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the |
364 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the |
365 unprocessed part of both parsers is the unprocessed part the second |
365 unprocessed part of both parsers is the unprocessed part the second |
366 parser $q$ produces leaves as left-over. The processed parts of both |
366 parser $q$ leaves as leftover. The processed parts of both |
367 parsers is a pair consisting of the outputs of $p$ and $q$, namely |
367 parsers is a pair consisting of the outputs of $p$ and $q$, namely |
368 $(\textit{output}_1, \textit{output}_2)$. This behaviour can be |
368 $(\textit{output}_1, \textit{output}_2)$. This behaviour can be |
369 implemented in Scala as follows: |
369 implemented in Scala as follows: |
370 |
370 |
371 \begin{center} |
371 \begin{center} |
429 \emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously |
429 \emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously |
430 expect. The parser returns the empty set in the other examples, |
430 expect. The parser returns the empty set in the other examples, |
431 because they do not fit with what the parser is supposed to parse. |
431 because they do not fit with what the parser is supposed to parse. |
432 |
432 |
433 |
433 |
434 A slightly more complicated parser is \pcode{("a" || "b") ~ "c"} which |
434 A slightly more complicated parser is \pcode{("a" | "b") ~ "c"} which |
435 parses as first character either an \texttt{a} or \texttt{b}, followed |
435 parses as first character either an \texttt{a} or \texttt{b}, followed |
436 by a \texttt{c}. This parser produces the following outputs. |
436 by a \texttt{c}. This parser produces the following outputs. |
437 |
437 |
438 \begin{center} |
438 \begin{center} |
439 \begin{tabular}{rcl} |
439 \begin{tabular}{rcl} |
460 |
460 |
461 |
461 |
462 \noindent The second and third example fail, because something is |
462 \noindent The second and third example fail, because something is |
463 ``missing'' in the sequence we are looking for. Also notice how the |
463 ``missing'' in the sequence we are looking for. Also notice how the |
464 results nest with sequences: the parsed part is a nested pair of the |
464 results nest with sequences: the parsed part is a nested pair of the |
465 form \pcode{((a, b), c)}. Two more examples: first consider the parser |
465 form \pcode{((a, b), c)}. If we nest the sequence parser differently, |
|
466 for example \pcode{"a" ~ ("b" ~ "c")}, then also our output pairs nest |
|
467 differently |
|
468 |
|
469 \begin{center} |
|
470 \begin{tabular}{rcl} |
|
471 input strings & & output\medskip\\ |
|
472 \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})), \texttt{\Grid{de}})\right\}$\\ |
|
473 \end{tabular} |
|
474 \end{center} |
|
475 |
|
476 \noindent |
|
477 Two more examples: first consider the parser |
466 \pcode{("a" ~ "a") ~ "a"} and the input \pcode{aaaa}: |
478 \pcode{("a" ~ "a") ~ "a"} and the input \pcode{aaaa}: |
467 |
479 |
468 \begin{center} |
480 \begin{center} |
469 \begin{tabular}{rcl} |
481 \begin{tabular}{rcl} |
470 input string & & output\medskip\\ |
482 input string & & output\medskip\\ |
499 from the pairs; everything where the second part was not empty has |
511 from the pairs; everything where the second part was not empty has |
500 been thrown away as well, because they represent |
512 been thrown away as well, because they represent |
501 ultimately-unsuccessful-parses. |
513 ultimately-unsuccessful-parses. |
502 |
514 |
503 |
515 |
504 Note carefully that constructing a parser such \pcode{"a" || ("a" ~ |
516 Note carefully that constructing a parser such \pcode{"a" | ("a" ~ |
505 "b")} will result in a typing error. The intention is that we want |
517 "b")} will result in a typing error. The intention is that we want |
506 to parse an \texttt{a}, or an \texttt{a} followed by a |
518 to parse an \texttt{a}, or an \texttt{a} followed by a |
507 \texttt{b}. However, the first parser has as output type a single |
519 \texttt{b}. However, the first parser has as output type a single |
508 character (recall the type of \texttt{CharParser}), but the second |
520 character (recall the type of \texttt{CharParser}), but the second |
509 parser produces a pair of characters as output. The alternative parser |
521 parser produces a pair of characters as output. The alternative parser |
510 is required to have both component parsers to have the same type---we |
522 is required to have both component parsers to have the same type---we |
511 need to be able to build the union of two sets, which means in Scala |
523 need to be able to build the union of two sets, which means in Scala |
512 they need to be of the same type. We will see later how we can build |
524 they need to be of the same type. Therefore the typing error. |
|
525 We will see later how we can build |
513 this parser without the typing error. |
526 this parser without the typing error. |
514 |
527 |
515 The next parser combinator, called \emph{semantic action}, does not |
528 The next parser combinator, called \emph{semantic action}, does not |
516 actually combine smaller parsers, but applies a function to the result |
529 actually combine smaller parsers, but applies a function to the result |
517 of a parser. It is implemented in Scala as follows |
530 of a parser. It is implemented in Scala as follows |
530 |
543 |
531 \noindent This parser combinator takes a parser \texttt{p} |
544 \noindent This parser combinator takes a parser \texttt{p} |
532 with output type \texttt{T} as one argument as well as a |
545 with output type \texttt{T} as one argument as well as a |
533 function \texttt{f} with type \texttt{T => S}. The parser |
546 function \texttt{f} with type \texttt{T => S}. The parser |
534 \texttt{p} produces sets of type \texttt{(T, I)}. The |
547 \texttt{p} produces sets of type \texttt{(T, I)}. The |
535 semantic action combinastor then applies the function |
548 semantic action combinator then applies the function |
536 \texttt{f} to all the parser outputs. Since this function is of |
549 \texttt{f} to all the parser outputs. Since this function is of |
537 type \texttt{T => S}, we obtain a parser with output type |
550 type \texttt{T => S}, we obtain a parser with output type |
538 \texttt{S}. Again Scala lets us introduce some shorthand |
551 \texttt{S}. Again Scala lets us introduce some shorthand |
539 notation for this parser combinator. Therefore we will write |
552 notation for this parser combinator. Therefore we will write |
540 \texttt{p ==> f} for it. |
553 \texttt{p ==> f} for it. |
541 |
554 |
542 What are semantic actions good for? Well, they allow is to transform |
555 What are semantic actions good for? Well, they allow you to transform |
543 the parsed input into a datastructure we can use for further |
556 the parsed input into a datastructure you can use for further |
544 processing. A simple example would be to transform parsed characters |
557 processing. A simple example would be to transform parsed characters |
545 into ASCII numbers. Suppose we define a function \texttt{f} (from |
558 into ASCII numbers. Suppose we define a function \texttt{f} (from |
546 characters to ints) and use a \texttt{CharParser} for the character \texttt{c}. |
559 characters to ints) and use a \texttt{CharParser} for parsing |
|
560 the character \texttt{c}. |
547 |
561 |
548 \begin{center} |
562 \begin{center} |
549 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
563 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
550 val f = (c: Char) => c.toInt |
564 val f = (c: Char) => c.toInt |
551 val c = new CharParser('c') |
565 val c = new CharParser('c') |
552 \end{lstlisting} |
566 \end{lstlisting} |
553 \end{center} |
567 \end{center} |
554 |
568 |
555 \noindent |
569 \noindent |
556 Then we can run the following parsers on the input \texttt{cbd}: |
570 We then can run the following two parsers on the input \texttt{cbd}: |
557 |
571 |
558 \begin{center} |
572 \begin{center} |
559 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
573 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
560 c.parse("cbd") |
574 c.parse("cbd") |
561 (c ==> f).parse("cbd") |
575 (c ==> f).parse("cbd") |
562 \end{lstlisting} |
576 \end{lstlisting} |
563 \end{center} |
577 \end{center} |
564 |
578 |
565 \noindent |
579 \noindent |
566 The first line we obtain the result \texttt{Set(('c', "bd"))}, whereas the second |
580 In the first line we obtain the expected result \texttt{Set(('c', |
567 produces \texttt{Set((99, "bd"))}---the character has been transformed into |
581 "bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the |
568 an ASCII number. |
582 character has been transformed into an ASCII number. |
569 |
583 |
570 A slightly less contrived example is about parsing numbers (recall |
584 A slightly less contrived example is about parsing numbers (recall |
571 \texttt{NumParser} above). However, we want to do this here for strings. |
585 \texttt{NumParser} above). However, we want to do this here for strings. |
572 For this assume we have the following \texttt{RegexParser}. |
586 For this assume we have the following \texttt{RegexParser}. |
573 |
587 |
588 \noindent |
602 \noindent |
589 This parser takes a regex as argument and splits up a string into a |
603 This parser takes a regex as argument and splits up a string into a |
590 prefix and the rest according to this regex |
604 prefix and the rest according to this regex |
591 (\texttt{reg.findPrefixMatchOf} generates a match---in the successful |
605 (\texttt{reg.findPrefixMatchOf} generates a match---in the successful |
592 case---and the corresponding strings can be extracted with |
606 case---and the corresponding strings can be extracted with |
593 \texttt{matched} and \texttt{after}). We can now define a |
607 \texttt{matched} and \texttt{after}). Using this parser we can define a |
594 \texttt{NumParser} for strings as follows: |
608 \texttt{NumParser} for strings as follows: |
595 |
609 |
596 \begin{center} |
610 \begin{center} |
597 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
611 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
598 val NumParser = RegexParser("[0-9]+".r) |
612 val NumParser = RegexParser("[0-9]+".r) |
609 \end{lstlisting} |
623 \end{lstlisting} |
610 \end{center} |
624 \end{center} |
611 |
625 |
612 \noindent |
626 \noindent |
613 produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is |
627 produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is |
614 still a string. We need to convert it into the corresponding |
628 still a string (the double-quotes are not printed by Scala). We need |
615 \texttt{Int}. We can do this as follows |
629 to convert it into the corresponding \texttt{Int}. We can do this as |
|
630 follows using a semantic action |
616 |
631 |
617 \begin{center} |
632 \begin{center} |
618 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
633 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
619 (NumParser ==> (s => s.toInt)).parse("123abc") |
634 (NumParser ==> (s => s.toInt)).parse("123abc") |
620 \end{lstlisting} |
635 \end{lstlisting} |
621 \end{center} |
636 \end{center} |
622 |
637 |
623 \noindent |
638 \noindent |
624 The semantic action in form of a function converts a string into an |
639 The function in the semantic action converts a string into an |
625 \texttt{Int}. Let us come back to semantic actions when we are going |
640 \texttt{Int}. Let us come back to semantic actions when we are going |
626 to implement a simple calculator. |
641 to implement actual context-free gammars. |
627 |
642 |
628 \subsubsection*{Shorthand notation for parser combinators} |
643 \subsubsection*{Shorthand notation for parser combinators} |
629 |
644 |
630 Before we proceed, let us just explain the shorthand notation for |
645 Before we proceed, let us just explain the shorthand notation for |
631 parser combinators. Like for regular expressions, the shorthand notation |
646 parser combinators. Like for regular expressions, the shorthand notation |