handouts/ho06.tex
changeset 589 0451b8b67f62
parent 588 a4646557016d
child 590 c6a1e19e9801
equal deleted inserted replaced
588:a4646557016d 589:0451b8b67f62
    30 $\Rightarrow$
    30 $\Rightarrow$
    31 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
    31 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
    32 \end{center} 
    32 \end{center} 
    33 
    33 
    34 \noindent
    34 \noindent
    35 Given the extended effort we have spent in order to implement a lexer
    35 Given the extended effort we have spent implementing a lexer
    36 in order to generate list of tokens, it might be surprising that in
    36 in order to generate list of tokens, it might be surprising that in
    37 what follows we shall often use strings as input. This is for making
    37 what follows we shall often use strings as input. This is for making
    38 the explanation more lucid. It does not make our previous work on
    38 the explanation more lucid. It does not make our previous work on
    39 lexers obsolete (remember they transform a string into a list of
    39 lexers obsolete (remember they transform a string into a list of
    40 tokens). Lexers will still be needed for building a somewhat realistic
    40 tokens). Lexers will still be needed for building a somewhat realistic
   146     if (in.head == c) Set((c, in.tail)) else Set()
   146     if (in.head == c) Set((c, in.tail)) else Set()
   147 }
   147 }
   148 \end{lstlisting}
   148 \end{lstlisting}
   149 \end{center}
   149 \end{center}
   150 
   150 
   151 \noindent You can see the \texttt{parse} tests whether the
   151 \noindent You can see \texttt{parse} tests whether the
   152 first character of the input string \texttt{in} is equal to
   152 first character of the input string \texttt{in} is equal to
   153 \texttt{c}. If yes, then it splits the string into the recognised part
   153 \texttt{c}. If yes, then it splits the string into the recognised part
   154 \texttt{c} and the unprocessed part \texttt{in.tail}. In case
   154 \texttt{c} and the unprocessed part \texttt{in.tail}. In case
   155 \texttt{in} does not start with \texttt{c} then the parser returns the
   155 \texttt{in} does not start with \texttt{c} then the parser returns the
   156 empty set (in Scala \texttt{Set()}). Since this parser recognises
   156 empty set (in Scala \texttt{Set()}). Since this parser recognises
   172 \end{center}
   172 \end{center}
   173 
   173 
   174 \noindent
   174 \noindent
   175 In this parser the input is of type \texttt{List[Token]}. The function
   175 In this parser the input is of type \texttt{List[Token]}. The function
   176 parse looks at the input \texttt{ts} and checks whether the first
   176 parse looks at the input \texttt{ts} and checks whether the first
   177 token is a \texttt{Num\_token}. Let us assume our lexer generated
   177 token is a \texttt{Num\_token} (let us assume our lexer generated
   178 these tokens for numbers. But this parser does not just return this
   178 these tokens for numbers). But this parser does not just return this
   179 token (and the rest of the list), like the \texttt{CharParser} above,
   179 token (and the rest of the list), like the \texttt{CharParser} above,
   180 rather extracts the string \texttt{s} from the token and converts it
   180 rather it extracts the string \texttt{s} from the token and converts it
   181 into an integer. The hope is that the lexer did its work well and this
   181 into an integer. The hope is that the lexer did its work well and this
   182 conversion always succeeds. The consequence of this is that the output
   182 conversion always succeeds. The consequence of this is that the output
   183 type for this parser is \texttt{Int}, not \texttt{Token}. Such a
   183 type for this parser is \texttt{Int}, not \texttt{Token}. Such a
   184 conversion would be needed if we want to implement a simple calculator
   184 conversion would be needed if we want to implement a simple calculator
   185 program.
   185 program.
   236 new AltParser(CharParser('a'), CharParser('b'))
   236 new AltParser(CharParser('a'), CharParser('b'))
   237 \end{lstlisting}
   237 \end{lstlisting}
   238 \end{center}
   238 \end{center}
   239 
   239 
   240 \noindent Later on we will use Scala mechanism for introducing some
   240 \noindent Later on we will use Scala mechanism for introducing some
   241 more readable shorthand notation for this, like \texttt{"a" ||
   241 more readable shorthand notation for this, like \texttt{"a" |
   242   "b"}. Let us look in detail at what this parser combinator produces
   242   "b"}. Let us look in detail at what this parser combinator produces
   243 with some sample strings
   243 with some sample strings
   244 
   244 
   245 \begin{center}
   245 \begin{center}
   246 \begin{tabular}{rcl}
   246 \begin{tabular}{rcl}
   273 \end{tabular}
   273 \end{tabular}
   274 \end{center}
   274 \end{center}
   275 
   275 
   276 \noindent Notice that the $p$ wil first be run on the input, producing
   276 \noindent Notice that the $p$ wil first be run on the input, producing
   277 pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands
   277 pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands
   278 for the unprocessed, or left-over, parts. We want that $q$ runs on all
   278 for the unprocessed, or leftover, parts. We want that $q$ runs on all
   279 these unprocessed parts $u_1$. This again will produce some
   279 these unprocessed parts $u_1$. This again will produce some
   280 processed part , $p$ and
   280 processed part , $p$ and
   281 $q$, we apply both parsers to the input (remember parsers are
   281 $q$, we apply both parsers to the input (remember parsers are
   282 functions) and combine the output (remember they are sets of pairs)
   282 functions) and combine the output (remember they are sets of pairs)
   283  
   283  
   323 new AltParser(CharParser('a'), CharParser('b'))
   323 new AltParser(CharParser('a'), CharParser('b'))
   324 \end{lstlisting}
   324 \end{lstlisting}
   325 \end{center}
   325 \end{center}
   326 
   326 
   327 \noindent Later on we will use again Scala mechanism for introducing some
   327 \noindent Later on we will use again Scala mechanism for introducing some
   328 more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some sample strings 
   328 more readable shorthand notation for this, like \texttt{"a" | "b"}. Let us look in detail at what this parser combinator produces with some sample strings 
   329 
   329 
   330 \begin{center}
   330 \begin{center}
   331 \begin{tabular}{rcl}
   331 \begin{tabular}{rcl}
   332 input strings & & output\medskip\\
   332 input strings & & output\medskip\\
   333 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
   333 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
   361 producing pairs of the form $\textit{output}_1$ and unprocessed part
   361 producing pairs of the form $\textit{output}_1$ and unprocessed part
   362 $u_1$. This unprocessed part is fed into the second parser $q$. The
   362 $u_1$. This unprocessed part is fed into the second parser $q$. The
   363 overall result of the sequence parser combinator is pairs of the form
   363 overall result of the sequence parser combinator is pairs of the form
   364 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the
   364 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the
   365 unprocessed part of both parsers is the unprocessed part the second
   365 unprocessed part of both parsers is the unprocessed part the second
   366 parser $q$ produces leaves as left-over. The processed parts of both
   366 parser $q$ leaves as leftover. The processed parts of both
   367 parsers is a pair consisting of the outputs of $p$ and $q$, namely
   367 parsers is a pair consisting of the outputs of $p$ and $q$, namely
   368 $(\textit{output}_1, \textit{output}_2)$. This behaviour can be
   368 $(\textit{output}_1, \textit{output}_2)$. This behaviour can be
   369 implemented in Scala as follows:
   369 implemented in Scala as follows:
   370 
   370 
   371 \begin{center}
   371 \begin{center}
   429 \emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
   429 \emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
   430 expect.  The parser returns the empty set in the other examples,
   430 expect.  The parser returns the empty set in the other examples,
   431 because they do not fit with what the parser is supposed to parse.
   431 because they do not fit with what the parser is supposed to parse.
   432 
   432 
   433 
   433 
   434 A slightly more complicated parser is \pcode{("a" || "b") ~ "c"} which
   434 A slightly more complicated parser is \pcode{("a" | "b") ~ "c"} which
   435 parses as first character either an \texttt{a} or \texttt{b}, followed
   435 parses as first character either an \texttt{a} or \texttt{b}, followed
   436 by a \texttt{c}. This parser produces the following outputs.
   436 by a \texttt{c}. This parser produces the following outputs.
   437 
   437 
   438 \begin{center}
   438 \begin{center}
   439 \begin{tabular}{rcl}
   439 \begin{tabular}{rcl}
   460 
   460 
   461 
   461 
   462 \noindent The second and third example fail, because something is
   462 \noindent The second and third example fail, because something is
   463 ``missing'' in the sequence we are looking for. Also notice how the
   463 ``missing'' in the sequence we are looking for. Also notice how the
   464 results nest with sequences: the parsed part is a nested pair of the
   464 results nest with sequences: the parsed part is a nested pair of the
   465 form \pcode{((a, b), c)}. Two more examples: first consider the parser
   465 form \pcode{((a, b), c)}. If we nest the sequence parser differently,
       
   466 for example \pcode{"a" ~ ("b" ~ "c")}, then also our output pairs nest
       
   467 differently
       
   468 
       
   469 \begin{center}
       
   470 \begin{tabular}{rcl}
       
   471 input strings & & output\medskip\\
       
   472 \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})), \texttt{\Grid{de}})\right\}$\\
       
   473 \end{tabular}
       
   474 \end{center}
       
   475 
       
   476 \noindent
       
   477 Two more examples: first consider the parser
   466 \pcode{("a" ~ "a") ~ "a"} and the input \pcode{aaaa}:
   478 \pcode{("a" ~ "a") ~ "a"} and the input \pcode{aaaa}:
   467 
   479 
   468 \begin{center}
   480 \begin{center}
   469 \begin{tabular}{rcl}
   481 \begin{tabular}{rcl}
   470 input string & & output\medskip\\
   482 input string & & output\medskip\\
   499 from the pairs; everything where the second part was not empty has
   511 from the pairs; everything where the second part was not empty has
   500 been thrown away as well, because they represent
   512 been thrown away as well, because they represent
   501 ultimately-unsuccessful-parses.
   513 ultimately-unsuccessful-parses.
   502 
   514 
   503 
   515 
   504 Note carefully that constructing a parser such \pcode{"a" || ("a" ~
   516 Note carefully that constructing a parser such \pcode{"a" | ("a" ~
   505   "b")} will result in a typing error. The intention is that we want
   517   "b")} will result in a typing error. The intention is that we want
   506 to parse an \texttt{a}, or an \texttt{a} followed by a
   518 to parse an \texttt{a}, or an \texttt{a} followed by a
   507 \texttt{b}. However, the first parser has as output type a single
   519 \texttt{b}. However, the first parser has as output type a single
   508 character (recall the type of \texttt{CharParser}), but the second
   520 character (recall the type of \texttt{CharParser}), but the second
   509 parser produces a pair of characters as output. The alternative parser
   521 parser produces a pair of characters as output. The alternative parser
   510 is required to have both component parsers to have the same type---we
   522 is required to have both component parsers to have the same type---we
   511 need to be able to build the union of two sets, which means in Scala
   523 need to be able to build the union of two sets, which means in Scala
   512 they need to be of the same type.  We will see later how we can build
   524 they need to be of the same type.  Therefore the typing error.
       
   525 We will see later how we can build
   513 this parser without the typing error.
   526 this parser without the typing error.
   514 
   527 
   515 The next parser combinator, called \emph{semantic action}, does not
   528 The next parser combinator, called \emph{semantic action}, does not
   516 actually combine smaller parsers, but applies a function to the result
   529 actually combine smaller parsers, but applies a function to the result
   517 of a parser.  It is implemented in Scala as follows
   530 of a parser.  It is implemented in Scala as follows
   530 
   543 
   531 \noindent This parser combinator takes a parser \texttt{p}
   544 \noindent This parser combinator takes a parser \texttt{p}
   532 with output type \texttt{T} as one argument as well as a
   545 with output type \texttt{T} as one argument as well as a
   533 function \texttt{f} with type \texttt{T => S}. The parser
   546 function \texttt{f} with type \texttt{T => S}. The parser
   534 \texttt{p} produces sets of type \texttt{(T, I)}. The
   547 \texttt{p} produces sets of type \texttt{(T, I)}. The
   535 semantic action combinastor then applies the function
   548 semantic action combinator then applies the function
   536 \texttt{f} to all the parser outputs. Since this function is of
   549 \texttt{f} to all the parser outputs. Since this function is of
   537 type \texttt{T => S}, we obtain a parser with output type
   550 type \texttt{T => S}, we obtain a parser with output type
   538 \texttt{S}. Again Scala lets us introduce some shorthand
   551 \texttt{S}. Again Scala lets us introduce some shorthand
   539 notation for this parser combinator. Therefore we will write
   552 notation for this parser combinator. Therefore we will write
   540 \texttt{p ==> f} for it.
   553 \texttt{p ==> f} for it.
   541 
   554 
   542 What are semantic actions good for? Well, they allow is to transform
   555 What are semantic actions good for? Well, they allow you to transform
   543 the parsed input into a datastructure we can use for further
   556 the parsed input into a datastructure you can use for further
   544 processing. A simple example would be to transform parsed characters
   557 processing. A simple example would be to transform parsed characters
   545 into ASCII numbers. Suppose we define a function \texttt{f} (from
   558 into ASCII numbers. Suppose we define a function \texttt{f} (from
   546 characters to ints) and use a \texttt{CharParser} for the character \texttt{c}.
   559 characters to ints) and use a \texttt{CharParser} for parsing
       
   560 the character \texttt{c}.
   547 
   561 
   548 \begin{center}
   562 \begin{center}
   549 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   563 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   550 val f = (c: Char) => c.toInt
   564 val f = (c: Char) => c.toInt
   551 val c = new CharParser('c')
   565 val c = new CharParser('c')
   552 \end{lstlisting}
   566 \end{lstlisting}
   553 \end{center}
   567 \end{center}
   554 
   568 
   555 \noindent
   569 \noindent
   556 Then we can run the following parsers on the input \texttt{cbd}:
   570 We then can run the following two parsers on the input \texttt{cbd}:
   557 
   571 
   558 \begin{center}
   572 \begin{center}
   559 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   573 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   560 c.parse("cbd")
   574 c.parse("cbd")
   561 (c ==> f).parse("cbd")
   575 (c ==> f).parse("cbd")
   562 \end{lstlisting}
   576 \end{lstlisting}
   563 \end{center}
   577 \end{center}
   564 
   578 
   565 \noindent
   579 \noindent
   566 The first line we obtain the result \texttt{Set(('c', "bd"))}, whereas the second
   580 In the first line we obtain the expected result \texttt{Set(('c',
   567 produces \texttt{Set((99, "bd"))}---the character has been transformed into
   581   "bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the
   568 an ASCII number.
   582 character has been transformed into an ASCII number.
   569 
   583 
   570 A slightly less contrived example is about parsing numbers (recall
   584 A slightly less contrived example is about parsing numbers (recall
   571 \texttt{NumParser} above). However, we want to do this here for strings.
   585 \texttt{NumParser} above). However, we want to do this here for strings.
   572 For this assume we have the following \texttt{RegexParser}.
   586 For this assume we have the following \texttt{RegexParser}.
   573 
   587 
   588 \noindent
   602 \noindent
   589 This parser takes a regex as argument and splits up a string into a
   603 This parser takes a regex as argument and splits up a string into a
   590 prefix and the rest according to this regex
   604 prefix and the rest according to this regex
   591 (\texttt{reg.findPrefixMatchOf} generates a match---in the successful
   605 (\texttt{reg.findPrefixMatchOf} generates a match---in the successful
   592 case---and the corresponding strings can be extracted with
   606 case---and the corresponding strings can be extracted with
   593 \texttt{matched} and \texttt{after}). We can now define a
   607 \texttt{matched} and \texttt{after}). Using this parser we can define a
   594 \texttt{NumParser} for strings as follows:
   608 \texttt{NumParser} for strings as follows:
   595 
   609 
   596 \begin{center}
   610 \begin{center}
   597 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   611 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   598 val NumParser = RegexParser("[0-9]+".r)
   612 val NumParser = RegexParser("[0-9]+".r)
   609 \end{lstlisting}
   623 \end{lstlisting}
   610 \end{center}  
   624 \end{center}  
   611 
   625 
   612 \noindent
   626 \noindent
   613 produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
   627 produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
   614 still a string. We need to convert it into the corresponding
   628 still a string (the double-quotes are not printed by Scala). We need
   615 \texttt{Int}. We can do this as follows
   629 to convert it into the corresponding \texttt{Int}. We can do this as
       
   630 follows using a semantic action
   616 
   631 
   617 \begin{center}
   632 \begin{center}
   618 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   633 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   619 (NumParser ==> (s => s.toInt)).parse("123abc")
   634 (NumParser ==> (s => s.toInt)).parse("123abc")
   620 \end{lstlisting}
   635 \end{lstlisting}
   621 \end{center}  
   636 \end{center}  
   622 
   637 
   623 \noindent
   638 \noindent
   624 The semantic action in form of a function converts a string into an
   639 The function in the semantic action converts a string into an
   625 \texttt{Int}. Let us come back to semantic actions when we are going
   640 \texttt{Int}. Let us come back to semantic actions when we are going
   626 to implement a simple calculator.
   641 to implement actual context-free gammars.
   627 
   642 
   628 \subsubsection*{Shorthand notation for parser combinators}
   643 \subsubsection*{Shorthand notation for parser combinators}
   629 
   644 
   630 Before we proceed, let us just explain the shorthand notation for
   645 Before we proceed, let us just explain the shorthand notation for
   631 parser combinators. Like for regular expressions, the shorthand notation
   646 parser combinators. Like for regular expressions, the shorthand notation
   671    ("b" ~ Pal ~ "b") ==> { case ((x, y), z) => x + y + z } |
   686    ("b" ~ Pal ~ "b") ==> { case ((x, y), z) => x + y + z } |
   672     "a" | "b" | "")
   687     "a" | "b" | "")
   673 \end{lstlisting}
   688 \end{lstlisting}
   674 \end{center}
   689 \end{center}
   675 
   690 
       
   691 \noindent
       
   692 The semantic action
       
   693 
       
   694 
       
   695 \noindent
       
   696 Important to note is that we must define \texttt{Pal}-parser as a
       
   697 \emph{lazy} value.
   676 
   698 
   677 \subsubsection*{Implementing an Interpreter}
   699 \subsubsection*{Implementing an Interpreter}
   678 
   700 
   679 %\bigskip
   701 %\bigskip
   680 %takes advantage of the full generality---have a look
   702 %takes advantage of the full generality---have a look