handouts/ho06.tex
changeset 591 863e502f6a5c
parent 590 c6a1e19e9801
child 592 0d309fafa9f0
equal deleted inserted replaced
590:c6a1e19e9801 591:863e502f6a5c
    18 test when it is empty and ``sequentially'' take it apart. Strings and
    18 test when it is empty and ``sequentially'' take it apart. Strings and
    19 lists fit this bill. However, parser combinators also have their
    19 lists fit this bill. However, parser combinators also have their
    20 drawbacks. For example they require that the grammar to be parsed is
    20 drawbacks. For example they require that the grammar to be parsed is
    21 \emph{not} left-recursive and they are efficient only when the grammar
    21 \emph{not} left-recursive and they are efficient only when the grammar
    22 is unambiguous. It is the responsibility of the grammar designer to
    22 is unambiguous. It is the responsibility of the grammar designer to
    23 ensure these two properties.
    23 ensure these two properties hold.
    24 
    24 
    25 The general idea behind parser combinators is to transform the input
    25 The general idea behind parser combinators is to transform the input
    26 into sets of pairs, like so
    26 into sets of pairs, like so
    27 
    27 
    28 \begin{center}
    28 \begin{center}
    29 $\underbrace{\text{list of tokens}}_{\text{input}}$ 
    29 $\underbrace{\text{list of tokens}}_{\text{input}}$ 
    30 $\Rightarrow$
    30 $\Rightarrow$
    31 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
    31 $\underbrace{\text{set of (parsed part, unprocessed part)}}_{\text{output}}$
    32 \end{center} 
    32 \end{center} 
    33 
    33 
    34 \noindent
    34 \noindent
    35 Given the extended effort we have spent implementing a lexer in order
    35 Given the extended effort we have spent implementing a lexer in order
    36 to generate list of tokens, it might be surprising that in what
    36 to generate lists of tokens, it might be surprising that in what
    37 follows we shall often use strings as input, rather than list of
    37 follows we shall often use strings as input, rather than lists of
    38 tokens. This is for making the explanation more lucid. It does not
    38 tokens. This is for making the explanation more lucid and for quick
    39 make our previous work on lexers obsolete (remember they transform a
    39 examples. It does not make our previous work on lexers obsolete
    40 string into a list of tokens). Lexers will still be needed for
    40 (remember they transform a string into a list of tokens). Lexers will
    41 building a somewhat realistic compiler.
    41 still be needed for building a somewhat realistic compiler.
    42 
    42 
    43 As mentioned above, parser combinators are relatively agnostic about what
    43 As mentioned above, parser combinators are relatively agnostic about what
    44 kind of input they process. In my Scala code I use the following
    44 kind of input they process. In my Scala code I use the following
    45 polymorphic types for parser combinators:
    45 polymorphic types for parser combinators:
    46 
    46 
    49 \end{center}
    49 \end{center}
    50 
    50 
    51 \noindent That is they take as input something of type \texttt{I} and
    51 \noindent That is they take as input something of type \texttt{I} and
    52 return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
    52 return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
    53 needs to be of ``sequence-kind'', I actually have to often write
    53 needs to be of ``sequence-kind'', I actually have to often write
    54 \texttt{I <\% Seq[\_]} for the input type. This ensures the input is a
    54 \texttt{I <\% Seq[\_]} for the input type. This ensures the
    55 subtype of Scala sequences. The first component of the generated pairs
    55 input is a subtype of Scala sequences. The first component of the
    56 corresponds to what the parser combinator was able to process from the
    56 generated pairs corresponds to what the parser combinator was able to
    57 input and the second is the unprocessed part, or leftover, of the
    57 parse from the input and the second is the unprocessed, or
    58 input (therefore the type of this unprocessed part is the same as the
    58 leftover, part of the input (therefore the type of this unprocessed part is
    59 input). A parser combinator might return more than one such pair; the
    59 the same as the input). A parser combinator might return more than one
    60 idea is that there are potentially several ways of how to parse the
    60 such pair; the idea is that there are potentially several ways of how
    61 input.  As a concrete example, consider the string
    61 to parse the input.  As a concrete example, consider the string
    62 
    62 
    63 \begin{center}
    63 \begin{center}
    64 \tt\Grid{iffoo\VS testbar}
    64 \tt\Grid{iffoo\VS testbar}
    65 \end{center}
    65 \end{center}
    66 
    66 
    73            \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
    73            \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
    74 \end{center}
    74 \end{center}
    75 
    75 
    76 \noindent where the first pair means the parser could recognise
    76 \noindent where the first pair means the parser could recognise
    77 \texttt{if} from the input and leaves the \texttt{foo\VS testbar} as
    77 \texttt{if} from the input and leaves the \texttt{foo\VS testbar} as
    78 `unprocessed' part; in the other case it could recognise
    78 unprocessed part; in the other case it could recognise
    79 \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
    79 \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
    80 parser cannot recognise anything from the input at all, then parser
    80 parser cannot recognise anything from the input at all, then parser
    81 combinators just return the empty set $\{\}$. This will indicate
    81 combinators just return the empty set $\{\}$. This will indicate
    82 something ``went wrong''\ldots or more precisely, nothing could be
    82 something ``went wrong''\ldots or more precisely, nothing could be
    83 parsed.
    83 parsed.
    84 
    84 
    85 Also important to note is that the type \texttt{T} for the processed
    85 Also important to note is that the type \texttt{T} for the processed
    86 part is different from the input type \texttt{I} in the parse. In the
    86 part is different from the input type \texttt{I} in the parse. In the
    87 example above is just happens to be the same. The reason for the
    87 example above is just happens to be the same. The reason for the
    88 potential difference is that in general we are interested in
    88 difference is that in general we are interested in
    89 transforming our input into something ``different''\ldots for example
    89 transforming our input into something ``different''\ldots for example
    90 into a tree; or if we implement the grammar for arithmetic
    90 into a tree; or if we implement the grammar for arithmetic
    91 expressions, we might be interested in the actual integer number the
    91 expressions, we might be interested in the actual integer number the
    92 arithmetic expression, say \texttt{1 + 2 * 3}, stands for. In this way
    92 arithmetic expression, say \texttt{1 + 2 * 3}, stands for. In this way
    93 we can use parser combinators to implement relatively easily a
    93 we can use parser combinators to implement relatively easily a
    94 calculator, for instance.
    94 calculator, for instance.
    95 
    95 
    96 The main idea of parser combinators is that we can easily build parser
    96 The main idea of parser combinators is that we can easily build parser
    97 combinators out of smaller components following very closely the
    97 combinators out of smaller components following very closely the
    98 structure of a grammar. In order to implement this in an
    98 structure of a grammar. In order to implement this in a
    99 object-oriented programming language, like Scala, we need to specify
    99 functional/object-oriented programming language, like Scala, we need
   100 an abstract class for parser combinators. This abstract class states
   100 to specify an abstract class for parser combinators. In the abstract
   101 that the function \texttt{parse} takes an argument of type \texttt{I}
   101 class we specify that \texttt{I} is the \emph{input type} of the
   102 and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
   102 parser combinator and that \texttt{T} is the \emph{ouput type}.  This
       
   103 implies that the function \texttt{parse} takes an argument of type
       
   104 \texttt{I} and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
   103 
   105 
   104 \begin{center}
   106 \begin{center}
   105 \begin{lstlisting}[language=Scala]
   107 \begin{lstlisting}[language=Scala]
   106 abstract class Parser[I, T] {
   108 abstract class Parser[I, T] {
   107   def parse(in: I) : Set[(T, I)]
   109   def parse(in: I) : Set[(T, I)]
   111       yield head
   113       yield head
   112 }
   114 }
   113 \end{lstlisting}
   115 \end{lstlisting}
   114 \end{center}
   116 \end{center}
   115 
   117 
   116 \noindent It is the obligation in each instance (parser combinator) to
   118 \noindent It is the obligation in each instance of this class to
   117 supply an implementation for \texttt{parse}.  From this function we
   119 supply an implementation for \texttt{parse}.  From this function we
   118 can then ``centrally'' derive the function \texttt{parse\_all}, which
   120 can then ``centrally'' derive the function \texttt{parse\_all}, which
   119 just filters out all pairs whose second component is not empty (that
   121 just filters out all pairs whose second component is not empty (that
   120 is has still some unprocessed part). The reason is that at the end of
   122 is has still some unprocessed part). The reason is that at the end of
   121 the parsing we are only interested in the results where all the input
   123 the parsing we are only interested in the results where all the input
   253 \end{tabular}
   255 \end{tabular}
   254 \end{center}
   256 \end{center}
   255 
   257 
   256 \noindent We receive in the first two cases a successful
   258 \noindent We receive in the first two cases a successful
   257 output (that is a non-empty set). In each case, either
   259 output (that is a non-empty set). In each case, either
   258 \pcode{a} or \pcode{b} is in the processed part, and
   260 \pcode{a} or \pcode{b} is in the parsed part, and
   259 \pcode{cde} in the unprocessed part. Clearly this parser cannot
   261 \pcode{cde} in the unprocessed part. Clearly this parser cannot
   260 parse anything with \pcode{ccde}, therefore the empty
   262 parse anything with \pcode{ccde}, therefore the empty
   261 set is returned.
   263 set is returned.
   262 
   264 
   263 A bit more interesting is the \emph{sequence parser combinator}. Given
   265 A bit more interesting is the \emph{sequence parser combinator}. Given
   264 two parsers, say again, $p$ and $q$, we want to apply first the input
   266 two parsers, say again, $p$ and $q$, we want to apply first the input
   265 to $p$ producing a set of pairs; then apply $q$ to all the unparsed  
   267 to $p$ producing a set of pairs; then apply $q$ to all the unparsed  
   266 parts in the pairs; and then combine the results. Mathematically we would
   268 parts in the pairs; and then combine the results. Mathematically we would
   267 write something like this for the result set of pairs:
   269 write something like this for the set of pairs:
   268 
   270 
   269 \begin{center}
   271 \begin{center}
   270 \begin{tabular}{lcl}
   272 \begin{tabular}{lcl}
   271 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & 
   273 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & 
   272 $(\textit{output}_1, u_1) \in p(\text{input}) 
   274 $(\textit{output}_1, u_1) \in p(\text{input}) 
   275 \end{tabular}
   277 \end{tabular}
   276 \end{center}
   278 \end{center}
   277 
   279 
   278 \noindent Notice that the $p$ will first be run on the input,
   280 \noindent Notice that the $p$ will first be run on the input,
   279 producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$
   281 producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$
   280 stands for the unprocessed, or leftover, parts pf $p$. We want that
   282 stands for the unprocessed, or leftover, parts of $p$. We want that
   281 $q$ runs on all these unprocessed parts $u_1$. Therefore these
   283 $q$ runs on all these unprocessed parts $u_1$. Therefore these
   282 unprocessed parts are fed into the second parser $q$. The overall
   284 unprocessed parts are fed into the second parser $q$. The overall
   283 result of the sequence parser combinator is pairs of the form
   285 result of the sequence parser combinator is pairs of the form
   284 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the
   286 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the
   285 unprocessed part of the sequqnce p[arser combinator is the unprocessed
   287 unprocessed part of the sequqnce parser combinator is the unprocessed
   286 part the second parser $q$ leaves as leftover. The processed parts of
   288 part the second parser $q$ leaves as leftover. The parsed parts of the
   287 of the component parsers is a pair consisting of the outputs of $p$
   289 component parsers are combined in a pair, namely
   288 and $q$, namely $(\textit{output}_1, \textit{output}_2)$. This
   290 $(\textit{output}_1, \textit{output}_2)$. The reason is we want to
   289 behaviour can be implemented in Scala as follows:
   291 know what $p$ and $q$ were able to parse. This behaviour can be
       
   292 implemented in Scala as follows:
   290 
   293 
   291 \begin{center}
   294 \begin{center}
   292 \begin{lstlisting}[language=Scala,numbers=none]
   295 \begin{lstlisting}[language=Scala,numbers=none]
   293 class SeqParser[I, T, S]
   296 class SeqParser[I, T, S]
   294        (p: => Parser[I, T], 
   297        (p: => Parser[I, T], 
   300 }
   303 }
   301 \end{lstlisting}
   304 \end{lstlisting}
   302 \end{center}
   305 \end{center}
   303 
   306 
   304 \noindent This parser takes again as arguments two parsers, \texttt{p}
   307 \noindent This parser takes again as arguments two parsers, \texttt{p}
   305 and \texttt{q}. It implements \texttt{parse} as follows: let first run
   308 and \texttt{q}. It implements \texttt{parse} as follows: first run the
   306 the parser \texttt{p} on the input producing a set of pairs
   309 parser \texttt{p} on the input producing a set of pairs
   307 (\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
   310 (\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
   308 unprocessed parts left over by \texttt{p}. Let then \texttt{q} run on these
   311 unprocessed parts left over by \texttt{p} (recall that there can be
   309 unprocessed parts producing again a set of pairs. The output of the
   312 several such pairs). Let then \texttt{q} run on these unprocessed
   310 sequence parser combinator is then a set containing pairs where the
   313 parts producing again a set of pairs. The output of the sequence
   311 first components are again pairs, namely what the first parser could
   314 parser combinator is then a set containing pairs where the first
   312 parse together with what the second parser could parse; the second
   315 components are again pairs, namely what the first parser could parse
   313 component is the unprocessed part left over after running the second
   316 together with what the second parser could parse; the second component
   314 parser \texttt{q}. Therefore the input type of the sequence parser
   317 is the unprocessed part left over after running the second parser
   315 combinator is as usual \texttt{I}, but the output type is
   318 \texttt{q}. Note that the input type of the sequence parser combinator
       
   319 is as usual \texttt{I}, but the output type is
   316 
   320 
   317 \begin{center}
   321 \begin{center}
   318 \texttt{(T, S)}
   322 \texttt{(T, S)}
   319 \end{center}
   323 \end{center}
   320 
   324 
   321 \noindent
   325 \noindent
   322 This means \texttt{parse} in the sequence parser combinator returns
   326 Consequently, the function \texttt{parse} in the sequence parser
   323 sets of type \texttt{Set[((T, S), I)]}.  Notice that we have
   327 combinator returns sets of type \texttt{Set[((T, S), I)]}.  That means
   324 essentially two output types for the sequence parser combinator
   328 we have essentially two output types for the sequence parser
   325 (packaged in a pair), because in general \textit{p} and \textit{q}
   329 combinator (packaged in a pair), because in general \textit{p} and
   326 might produce different things (for example first we recognise a
   330 \textit{q} might produce different things (for example we recognise a
   327 number and then a string corresponding to an operator).  If any of the
   331 number with \texttt{p} and then with \texttt{q} a string corresponding
   328 runs of \textit{p} and \textit{q} fail, that is produce the empty set,
   332 to an operator).  If any of the runs of \textit{p} and \textit{q}
   329 then \texttt{parse} will also produce the empty set.
   333 fail, that is produce the empty set, then \texttt{parse} will also
       
   334 produce the empty set.
   330 
   335 
   331 With the shorthand notation we shall introduce later for the sequence
   336 With the shorthand notation we shall introduce later for the sequence
   332 parser combinator, we can write for example \pcode{"a" ~ "b"}, which
   337 parser combinator, we can write for example \pcode{"a" ~ "b"}, which
   333 is the parser combinator that first recognises the character
   338 is the parser combinator that first recognises the character
   334 \texttt{a} from a string and then \texttt{b}. Let us look again at
   339 \texttt{a} from a string and then \texttt{b}. Let us look again at
   335 three examples of how this parser combinator processes some strings:
   340 some examples of how this parser combinator processes some strings:
   336 
   341 
   337 \begin{center}
   342 \begin{center}
   338 \begin{tabular}{rcl}
   343 \begin{tabular}{rcl}
   339 input strings & & output\medskip\\
   344 input strings & & output\medskip\\
   340 \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
   345 \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
   383 
   388 
   384 \noindent The second and third example fail, because something is
   389 \noindent The second and third example fail, because something is
   385 ``missing'' in the sequence we are looking for. The first succeeds but
   390 ``missing'' in the sequence we are looking for. The first succeeds but
   386 notice how the results nest with sequences: the parsed part is a
   391 notice how the results nest with sequences: the parsed part is a
   387 nested pair of the form \pcode{((a, b), c)}. If we nest the sequence
   392 nested pair of the form \pcode{((a, b), c)}. If we nest the sequence
   388 parser differently, for example \pcode{"a" ~ ("b" ~ "c")}, then also
   393 parser differently, say \pcode{"a" ~ ("b" ~ "c")}, then also
   389 our output pairs nest differently
   394 our output pairs nest differently
   390 
   395 
   391 \begin{center}
   396 \begin{center}
   392 \begin{tabular}{rcl}
   397 \begin{tabular}{rcl}
   393 input strings & & output\medskip\\
   398 input strings & & output\medskip\\
   405 \texttt{\Grid{aaaa}} & $\rightarrow$ & 
   410 \texttt{\Grid{aaaa}} & $\rightarrow$ & 
   406 $\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
   411 $\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
   407 \end{tabular}
   412 \end{tabular}
   408 \end{center}
   413 \end{center}
   409 
   414 
   410 \noindent Notice how again the results nest deeper and deeper as pairs (the
   415 \noindent Notice again how the results nest deeper and deeper as pairs (the
   411 last \pcode{a} is in the unprocessed part). To consume everything of
   416 last \pcode{a} is in the unprocessed part). To consume everything of
   412 this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~
   417 this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~
   413   "a"}. Then the output is as follows:
   418   "a"}. Then the output is as follows:
   414 
   419 
   415 \begin{center}
   420 \begin{center}
   437 nesting of the component parsers.
   442 nesting of the component parsers.
   438 
   443 
   439 
   444 
   440 Consider also carefully that constructing a parser such \pcode{"a" |
   445 Consider also carefully that constructing a parser such \pcode{"a" |
   441   ("a" ~ "b")} will result in a typing error. The intention with this
   446   ("a" ~ "b")} will result in a typing error. The intention with this
   442 parser is that we want to parse an \texttt{a}, or an \texttt{a}
   447 parser is that we want to parse either an \texttt{a}, or an \texttt{a}
   443 followed by a \texttt{b}. However, the first parser has as output type
   448 followed by a \texttt{b}. However, the first parser has as output type
   444 a single character (recall the type of \texttt{CharParser}), but the
   449 a single character (recall the type of \texttt{CharParser}), but the
   445 second parser produces a pair of characters as output. The alternative
   450 second parser produces a pair of characters as output. The alternative
   446 parser is required to have both component parsers to have the same
   451 parser is required to have both component parsers to have the same
   447 type---we need to be able to build the union of two sets, which means
   452 type---the reason is that we need to be able to build the union of two
   448 in Scala they need to be of the same type.  Since ther are not, there
   453 sets, which requires in Scala that the sets have the same type.  Since
   449 is a typing error in this example.  We will see later how we can build
   454 they are not in this case, there is a typing error.  We will see later
   450 this parser without the typing error.
   455 how we can build this parser without the typing error.
   451 
   456 
   452 The next parser combinator, called \emph{semantic action}, does not
   457 The next parser combinator, called \emph{semantic action}, does not
   453 actually combine smaller parsers, but applies a function to the result
   458 actually combine two smaller parsers, but applies a function to the result
   454 of a parser.  It is implemented in Scala as follows
   459 of a parser.  It is implemented in Scala as follows
   455 
   460 
   456 \begin{center}
   461 \begin{center}
   457 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   462 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   458 class FunParser[I, T, S]
   463 class FunParser[I, T, S]
   471 produces sets of type \texttt{Set[(T, I)]}. The semantic action
   476 produces sets of type \texttt{Set[(T, I)]}. The semantic action
   472 combinator then applies the function \texttt{f} to all the `processed'
   477 combinator then applies the function \texttt{f} to all the `processed'
   473 parser outputs. Since this function is of type \texttt{T => S}, we
   478 parser outputs. Since this function is of type \texttt{T => S}, we
   474 obtain a parser with output type \texttt{S}. Again Scala lets us
   479 obtain a parser with output type \texttt{S}. Again Scala lets us
   475 introduce some shorthand notation for this parser
   480 introduce some shorthand notation for this parser
   476 combinator. Therefore we will write \texttt{p ==> f} for it.
   481 combinator. Therefore we will write short \texttt{p ==> f} for it.
   477 
   482 
   478 What are semantic actions good for? Well, they allow you to transform
   483 What are semantic actions good for? Well, they allow you to transform
   479 the parsed input into datastructures you can use for further
   484 the parsed input into datastructures you can use for further
   480 processing. A simple example would be to transform parsed characters
   485 processing. A simple (contrived) example would be to transform parsed
   481 into ASCII numbers. Suppose we define a function \texttt{f} (from
   486 characters into ASCII numbers. Suppose we define a function \texttt{f}
   482 characters to ints) and use a \texttt{CharParser} for parsing
   487 (from characters to \texttt{Int}s) and use a \texttt{CharParser} for parsing
   483 the character \texttt{c}.
   488 the character \texttt{c}.
       
   489 
   484 
   490 
   485 \begin{center}
   491 \begin{center}
   486 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   492 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   487 val f = (c: Char) => c.toInt
   493 val f = (c: Char) => c.toInt
   488 val c = new CharParser('c')
   494 val c = new CharParser('c')
   503 In the first line we obtain the expected result \texttt{Set(('c',
   509 In the first line we obtain the expected result \texttt{Set(('c',
   504   "bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the
   510   "bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the
   505 character has been transformed into an ASCII number.
   511 character has been transformed into an ASCII number.
   506 
   512 
   507 A slightly less contrived example is about parsing numbers (recall
   513 A slightly less contrived example is about parsing numbers (recall
   508 \texttt{NumParser} above). However, we want to do this here for strings.
   514 \texttt{NumParser} above). However, we want to do this here for
   509 For this assume we have the following \texttt{RegexParser}.
   515 strings, not for tokens.  For this assume we have the following
       
   516 (atomic) \texttt{RegexParser}.
   510 
   517 
   511 \begin{center}
   518 \begin{center}
   512   \begin{lstlisting}[language=Scala,xleftmargin=0mm,
   519   \begin{lstlisting}[language=Scala,xleftmargin=0mm,
   513     basicstyle=\small\ttfamily, numbers=none]
   520     basicstyle=\small\ttfamily, numbers=none]
   514 import scala.util.matching.Regex
   521 import scala.util.matching.Regex
   525 \noindent
   532 \noindent
   526 This parser takes a regex as argument and splits up a string into a
   533 This parser takes a regex as argument and splits up a string into a
   527 prefix and the rest according to this regex
   534 prefix and the rest according to this regex
   528 (\texttt{reg.findPrefixMatchOf} generates a match---in the successful
   535 (\texttt{reg.findPrefixMatchOf} generates a match---in the successful
   529 case---and the corresponding strings can be extracted with
   536 case---and the corresponding strings can be extracted with
   530 \texttt{matched} and \texttt{after}). Using this parser we can define a
   537 \texttt{matched} and \texttt{after}). The input and output type for
   531 \texttt{NumParser} for strings as follows:
   538 this parser is \texttt{String}. Using \texttt{RegexParser} we can
       
   539 define a \texttt{NumParser} for \texttt{Strings} to \texttt{Int} as
       
   540 follows:
   532 
   541 
   533 \begin{center}
   542 \begin{center}
   534 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   543 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   535 val NumParser = RegexParser("[0-9]+".r)
   544 val NumParser = RegexParser("[0-9]+".r)
   536 \end{lstlisting}
   545 \end{lstlisting}
   537 \end{center}
   546 \end{center}
   538 
   547 
   539 \noindent
   548 \noindent
   540 This parser will recognise a number at the beginning of a string, for
   549 This parser will recognise a number at the beginning of a string. For
   541 example
   550 example
   542 
   551 
   543 \begin{center}
   552 \begin{center}
   544 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   553 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   545 NumParser.parse("123abc")
   554 NumParser.parse("123abc")
   558 \end{lstlisting}
   567 \end{lstlisting}
   559 \end{center}  
   568 \end{center}  
   560 
   569 
   561 \noindent
   570 \noindent
   562 The function in the semantic action converts a string into an
   571 The function in the semantic action converts a string into an
   563 \texttt{Int}. Let us come back to semantic actions when we are going
   572 \texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},
   564 to implement actual context-free gammars.
   573 but this time \texttt{123} is an \texttt{Int}. Let us come back to
       
   574 semantic actions when we are going to implement actual context-free
       
   575 gammars.
   565 
   576 
   566 \subsubsection*{Shorthand notation for parser combinators}
   577 \subsubsection*{Shorthand notation for parser combinators}
   567 
   578 
   568 Before we proceed, let us just explain the shorthand notation for
   579 Before we proceed, let us just explain the shorthand notation for
   569 parser combinators. Like for regular expressions, the shorthand notation
   580 parser combinators. Like for regular expressions, the shorthand notation
   570 will make our life much easier when writing actual parsers. We can define
   581 will make our life much easier when writing actual parsers. We can define
   571 some implicits which allow us to write \pcode{p | q}, \pcode{p ~ q} and
   582 some implicits which allow us to write
   572 \pcode{p ==> f} as well as to use plain strings for specifying simple
   583 
   573 string parsers.
   584 \begin{center}
       
   585 \begin{tabular}{ll}  
       
   586   \pcode{p | q} & alternative parser\\
       
   587   \pcode{p ~ q} & sequence parser\\ 
       
   588   \pcode{p ==> f} & semantic action parser
       
   589 \end{tabular}
       
   590 \end{center}
       
   591 
       
   592 \noindent
       
   593 as well as to use plain strings for specifying simple string parsers.
   574 
   594 
   575 The idea is that this shorthand notation allows us to easily translate
   595 The idea is that this shorthand notation allows us to easily translate
   576 context-free grammars into code. For example recall our context-free
   596 context-free grammars into code. For example recall our context-free
   577 grammar for palindromes:
   597 grammar for palindromes:
   578 
   598 
   579 \begin{plstx}[margin=3cm]
   599 \begin{plstx}[margin=3cm]
   580 : \meta{P} ::=  a\cdot \meta{P}\cdot a | b\cdot \meta{P}\cdot b | a | b | \epsilon\\
   600 : \meta{Pal} ::=  a\cdot \meta{Pal}\cdot a | b\cdot \meta{Pal}\cdot b | a | b | \epsilon\\
   581 \end{plstx}
   601 \end{plstx}
   582 
   602 
   583 \noindent
   603 \noindent
   584 Each alternative in this grammar translates into an alternative parser
   604 Each alternative in this grammar translates into an alternative parser
   585 combinator.  The $\cdot$ can be translated to a sequence parser
   605 combinator.  The $\cdot$ can be translated to a sequence parser
   590 \subsubsection*{How to build parsers using parser combinators?}
   610 \subsubsection*{How to build parsers using parser combinators?}
   591 
   611 
   592 The beauty of parser combinators is the ease with which they can be
   612 The beauty of parser combinators is the ease with which they can be
   593 implemented and how easy it is to translate context-free grammars into
   613 implemented and how easy it is to translate context-free grammars into
   594 code (though the grammars need to be non-left-recursive). To
   614 code (though the grammars need to be non-left-recursive). To
   595 demonstrate this recall the grammar for palindromes from above.
   615 demonstrate this consider again the grammar for palindromes from above.
   596 The first idea would be to translate it into the following code
   616 The first idea would be to translate it into the following code
   597 
   617 
   598 \begin{center}
   618 \begin{center}
   599 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   619 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   600 lazy val Pal : Parser[String, String] = 
   620 lazy val Pal : Parser[String, String] = 
   605 \noindent
   625 \noindent
   606 Unfortunately, this does not quite work yet as it produces a typing
   626 Unfortunately, this does not quite work yet as it produces a typing
   607 error. The reason is that the parsers \texttt{"a"}, \texttt{"b"} and
   627 error. The reason is that the parsers \texttt{"a"}, \texttt{"b"} and
   608 \texttt{""} all produce strings as output type and therefore can be
   628 \texttt{""} all produce strings as output type and therefore can be
   609 put into an alternative \texttt{...| "a" | "b" | ""}. But both
   629 put into an alternative \texttt{...| "a" | "b" | ""}. But both
   610 \pcode{"a" ~ Pal ~ "a"} and \pcode{"b" ~ Pal ~ "b"} produce pairs of
   630 sequence parsers \pcode{"a" ~ Pal ~ "a"} and \pcode{"b" ~ Pal ~ "b"}
   611 the form $(((\_, \_), \_), \_)$---that is how the sequence parser
   631 produce pairs of the form
   612 combinator nests results when \pcode{\~} is used between two
   632 
   613 components. The solution is to use a semantic action that ``flattens''
   633 \begin{center}
   614 these pairs and appends the corresponding strings, like
   634 (((\texttt{a}-part, \texttt{Pal}-part), \texttt{a}-part), unprocessed part)
       
   635 \end{center}
       
   636 
       
   637 \noindent That is how the
       
   638 sequence parser combinator nests results when \pcode{\~} is used
       
   639 between two components. The solution is to use a semantic action that
       
   640 ``flattens'' these pairs and appends the corresponding strings, like
   615 
   641 
   616 \begin{center}
   642 \begin{center}
   617 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   643 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   618 lazy val Pal : Parser[String, String] =  
   644 lazy val Pal : Parser[String, String] =  
   619   (("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } |
   645   (("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } |
   621     "a" | "b" | "")
   647     "a" | "b" | "")
   622 \end{lstlisting}
   648 \end{lstlisting}
   623 \end{center}
   649 \end{center}
   624 
   650 
   625 \noindent
   651 \noindent
   626 Now in all cases we have strings as output type for the parser variants.
   652 How does this work? Well, recall again what the pairs look like for
   627 The semantic action
   653 the parser \pcode{"a" ~ Pal ~ "a"}.  The pattern in the semantic
   628 
   654 action matches the nested pairs (the \texttt{x} with the
   629 
   655 \texttt{a}-part and so on).  Unfortunately when we have such nested
   630 \noindent
   656 pairs, Scala requires us to define the function using the
   631 Important to note is that we must define \texttt{Pal}-parser as a
   657 \pcode{case}-syntax
   632 \emph{lazy} value.
   658 
       
   659 \begin{center}
       
   660 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   661 { case ((x, y), z) => ... }
       
   662 \end{lstlisting}
       
   663 \end{center}
       
   664 
       
   665 \noindent
       
   666 If we have more sequence parser combinators or have them differently nested,
       
   667 then the pattern in the semantic action needs to be adjusted accordingly.
       
   668 The action we implement above is to concatenate all three strings, which
       
   669 means after the semantic action is applied the output type of the parser 
       
   670 is \texttt{String}, which means it fits with the alternative parsers
       
   671 \texttt{...| "a" | "b" | ""}.
       
   672 
       
   673 If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain
       
   674 as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrom
       
   675 (an empty set would mean something is wrong). But also notice what the
       
   676 intermediate results are generated by \pcode{Pal.parse("abaaaba")}
       
   677 
       
   678 \begin{center}
       
   679 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   680 Set((abaaaba,""),(aba,aaba), (a,baaaba), ("",abaaaba))
       
   681 \end{lstlisting}
       
   682 \end{center}
       
   683 
       
   684 \noindent
       
   685 That there are more than one output might be slightly unexpected, but
       
   686 can be explained as follows: the pairs represent all possible
       
   687 (partial) parses of the string \pcode{"abaaaba"}. The first pair above
       
   688 correesponds to a complete parse (all output is consumed) and this is
       
   689 what \pcode{Pal.parse_all} returns. The second pair is a small
       
   690 ``sub-palindrome'' that can also be parsed, but the parse fails with
       
   691 the rest \pcode{aaba}, which is therefore left as unprocessed. The
       
   692 third one is an attempt to parse the whole string with the
       
   693 single-character parser \pcode{a}. That of course only partially
       
   694 succeeds, by leaving \pcode{"baaaba"} as the unprocessed
       
   695 part. Finally, since we allow the empty string to be a palindrom we
       
   696 also obtain the last pair, where actually nothing is consumed from the
       
   697 input string. While all this works as intended, we need to be careful
       
   698 with this (especially with including the \pcode{""} parser in our
       
   699 grammar): if during parsing the set of parsing attempts gets too big,
       
   700 then the parsing process can become very slow as the potential
       
   701 candidates for applying rules can snowball.
       
   702 
       
   703 
       
   704 Important is also to note is that we must define the
       
   705 \texttt{Pal}-parser as a \emph{lazy} value in Scala. Look again at the
       
   706 code: \texttt{Pal} occurs on the right-hand side of the definition. If we had
       
   707 just written
       
   708 
       
   709 \begin{center}
       
   710 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   711 val Pal : Parser[String, String] =  ...rhs...
       
   712 \end{lstlisting}
       
   713 \end{center}
       
   714 
       
   715 \noindent
       
   716 then Scala before making this assignemnt to \texttt{Pal} attempts to
       
   717 find out what the expression on the right-hand side evaluates to. This
       
   718 is straightforward in case of simple expressions \texttt{2 + 3}, but
       
   719 the expression above contains \texttt{Pal} in the right-hand
       
   720 side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal}
       
   721 evaluates to and start a new recursion, which means it falls into an
       
   722 infinite loop. The definition of \texttt{Pal} is recursive and the
       
   723 \pcode{lazy} key-word prevents it from being fully evaluated. Therefore
       
   724 whenever we want to define a recursive parser we have to write
       
   725 
       
   726 \begin{center}
       
   727 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   728 lazy val SomeParser : Parser[...,...] =  ...rhs...
       
   729 \end{lstlisting}
       
   730 \end{center}
       
   731 
       
   732 \noindent That was not necessary for our atomic parsers, like
       
   733 \texttt{RegexParser} or \texttt{CharParser}, because they are not recursive.
       
   734 Note that this is also the reason why we had to write
       
   735 
       
   736 \begin{center}
       
   737 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   738 class AltParser[I, T]
       
   739        (p: => Parser[I, T], 
       
   740         q: => Parser[I, T]) extends Parser[I, T] {...}
       
   741 
       
   742 class SeqParser[I, T, S]
       
   743        (p: => Parser[I, T], 
       
   744         q: => Parser[I, S]) extends Parser[I, (T, S)] {...}
       
   745 \end{lstlisting}
       
   746 \end{center}
       
   747 
       
   748 \noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of
       
   749 the argument types for \texttt{p} and \texttt{q} prevent Scala from
       
   750 evaluating the arguments. Normally, Scala would first evaluate what
       
   751 kind of parsers \texttt{p} and \texttt{q} are, and only then generate
       
   752 the alternative parser combinator, repsectively sequence parser
       
   753 combinator. Since the argumants can be recursive parsers, such as
       
   754 \texttt{Pal}, this would lead again to an infinite loop.
       
   755 
       
   756 As a final example in this section, let us consider the grammar for
       
   757 well-nested parentheses:
       
   758 
       
   759 \begin{plstx}[margin=3cm]
       
   760 : \meta{P} ::=  (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\
       
   761 \end{plstx}
       
   762 
       
   763 \noindent
       
   764 Let us assume we want to not just recognise strings of
       
   765 well-nested parentheses but also transfrom round parentheses
       
   766 into curly braces. We can do this by using a semantic
       
   767 action:
       
   768 
       
   769 \begin{center}
       
   770   \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,
       
   771     xleftmargin=0mm, numbers=none]
       
   772 lazy val P : Parser[String, String] = 
       
   773   "(" ~ P ~ ")" ~ P ==> { case (((_,x),_),y) => "{" + x + "}" + y } | ""
       
   774 \end{lstlisting}
       
   775 \end{center}
       
   776 
       
   777 \noindent
       
   778 Here we define a function where which ignores the parentheses in the
       
   779 pairs, but replaces them in the right places with curly braces when
       
   780 assembling the new string in the right-hand side. If we run
       
   781 \pcode{P.parse_all("(((()()))())")} we obtain
       
   782 \texttt{Set(\{\{\{\{\}\{\}\}\}\{\}\})} as expected.
       
   783 
       
   784 
   633 
   785 
   634 \subsubsection*{Implementing an Interpreter}
   786 \subsubsection*{Implementing an Interpreter}
   635 
   787 
   636 %\bigskip
   788 %\bigskip
   637 %takes advantage of the full generality---have a look
   789 %takes advantage of the full generality---have a look