handouts/ho06.tex
changeset 590 c6a1e19e9801
parent 589 0451b8b67f62
child 591 863e502f6a5c
equal deleted inserted replaced
589:0451b8b67f62 590:c6a1e19e9801
    30 $\Rightarrow$
    30 $\Rightarrow$
    31 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
    31 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
    32 \end{center} 
    32 \end{center} 
    33 
    33 
    34 \noindent
    34 \noindent
    35 Given the extended effort we have spent implementing a lexer
    35 Given the extended effort we have spent implementing a lexer in order
    36 in order to generate list of tokens, it might be surprising that in
    36 to generate list of tokens, it might be surprising that in what
    37 what follows we shall often use strings as input. This is for making
    37 follows we shall often use strings as input, rather than list of
    38 the explanation more lucid. It does not make our previous work on
    38 tokens. This is for making the explanation more lucid. It does not
    39 lexers obsolete (remember they transform a string into a list of
    39 make our previous work on lexers obsolete (remember they transform a
    40 tokens). Lexers will still be needed for building a somewhat realistic
    40 string into a list of tokens). Lexers will still be needed for
    41 compiler.
    41 building a somewhat realistic compiler.
    42 
    42 
    43 But as said, parser combinators are relatively agnostic about what
    43 As mentioned above, parser combinators are relatively agnostic about what
    44 kind of input they process. In my Scala code I use the following
    44 kind of input they process. In my Scala code I use the following
    45 polymorphic types for parser combinators:
    45 polymorphic types for parser combinators:
    46 
    46 
    47 \begin{center}
    47 \begin{center}
    48 input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
    48 input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
    49 \end{center}
    49 \end{center}
    50 
    50 
    51 \noindent That is they take as input something of type \texttt{I} and
    51 \noindent That is they take as input something of type \texttt{I} and
    52 return a set of pairs of type \texttt{Set[(T, I)]}. Since the input needs
    52 return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
    53 to be of ``sequence-kind'', I actually have to often write \texttt{I
    53 needs to be of ``sequence-kind'', I actually have to often write
    54   <\% Seq[\_]} for the input type. This ensures the input is a
    54 \texttt{I <\% Seq[\_]} for the input type. This ensures the input is a
    55 subtype of Scala sequences. The first component of the generated pairs
    55 subtype of Scala sequences. The first component of the generated pairs
    56 corresponds to what the parser combinator was able to process from the
    56 corresponds to what the parser combinator was able to process from the
    57 input and the second is the unprocessed part of the input (therefore
    57 input and the second is the unprocessed part, or leftover, of the
    58 the type of this unprocessed part is the same as the input). As we
    58 input (therefore the type of this unprocessed part is the same as the
    59 shall see shortly, a parser combinator might return more than one such
    59 input). A parser combinator might return more than one such pair; the
    60 pair; the idea is that there are potentially several ways of how to
    60 idea is that there are potentially several ways of how to parse the
    61 parse the input.  As a concrete example, consider the string
    61 input.  As a concrete example, consider the string
    62 
    62 
    63 \begin{center}
    63 \begin{center}
    64 \tt\Grid{iffoo\VS testbar}
    64 \tt\Grid{iffoo\VS testbar}
    65 \end{center}
    65 \end{center}
    66 
    66 
    72 $\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
    72 $\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
    73            \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
    73            \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
    74 \end{center}
    74 \end{center}
    75 
    75 
    76 \noindent where the first pair means the parser could recognise
    76 \noindent where the first pair means the parser could recognise
    77 \texttt{if} from the input and leaves the rest as `unprocessed' as the
    77 \texttt{if} from the input and leaves the \texttt{foo\VS testbar} as
    78 second component of the pair; in the other case it could recognise
    78 `unprocessed' part; in the other case it could recognise
    79 \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
    79 \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
    80 parser cannot recognise anything from the input at all, then parser
    80 parser cannot recognise anything from the input at all, then parser
    81 combinators just return the empty set $\{\}$. This will indicate
    81 combinators just return the empty set $\{\}$. This will indicate
    82 something ``went wrong''\ldots or more precisely, nothing could be
    82 something ``went wrong''\ldots or more precisely, nothing could be
    83 parsed.
    83 parsed.
    84 
    84 
    85 Also important to note is that the type \texttt{T} for the processed
    85 Also important to note is that the type \texttt{T} for the processed
    86 part is different from the input type. In the example above is just
    86 part is different from the input type \texttt{I} in the parse. In the
    87 happens to be the same. The reason for the difference is that in
    87 example above is just happens to be the same. The reason for the
    88 general we are interested in transforming our input into something
    88 potential difference is that in general we are interested in
    89 ``different''\ldots for example into a tree; or if we implement the
    89 transforming our input into something ``different''\ldots for example
    90 grammar for arithmetic expressions, we might be interested in the
    90 into a tree; or if we implement the grammar for arithmetic
    91 actual integer number the arithmetic expression, say \texttt{1 + 2 *
    91 expressions, we might be interested in the actual integer number the
    92   3}, stands for. In this way we can use parser combinators to
    92 arithmetic expression, say \texttt{1 + 2 * 3}, stands for. In this way
    93 implement relatively easily a calculator, for instance.
    93 we can use parser combinators to implement relatively easily a
       
    94 calculator, for instance.
    94 
    95 
    95 The main idea of parser combinators is that we can easily build parser
    96 The main idea of parser combinators is that we can easily build parser
    96 combinators out of smaller components following very closely the
    97 combinators out of smaller components following very closely the
    97 structure of a grammar. In order to implement this in an
    98 structure of a grammar. In order to implement this in an
    98 object-oriented programming language, like Scala, we need to specify
    99 object-oriented programming language, like Scala, we need to specify
   101 and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
   102 and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
   102 
   103 
   103 \begin{center}
   104 \begin{center}
   104 \begin{lstlisting}[language=Scala]
   105 \begin{lstlisting}[language=Scala]
   105 abstract class Parser[I, T] {
   106 abstract class Parser[I, T] {
   106   def parse(ts: I): Set[(T, I)]
   107   def parse(in: I) : Set[(T, I)]
   107 
   108 
   108   def parse_all(ts: I): Set[T] =
   109   def parse_all(in: I) : Set[T] =
   109     for ((head, tail) <- parse(ts); if (tail.isEmpty)) 
   110     for ((head, tail) <- parse(in); if (tail.isEmpty)) 
   110       yield head
   111       yield head
   111 }
   112 }
   112 \end{lstlisting}
   113 \end{lstlisting}
   113 \end{center}
   114 \end{center}
   114 
   115 
   132   $s$ is the unprocessed part of the input string.
   133   $s$ is the unprocessed part of the input string.
   133 \item Otherwise return the empty set $\{\}$.	
   134 \item Otherwise return the empty set $\{\}$.	
   134 \end{itemize}
   135 \end{itemize}
   135 
   136 
   136 \noindent
   137 \noindent
   137 The input type of this simple parser combinator for characters is
   138 The input type of this simple parser combinator is \texttt{String} and
   138 \texttt{String} and the output type is \texttt{Char}. This means
   139 the output type is \texttt{Char}. This means \texttt{parse} returns
   139 \texttt{parse} returns  \mbox{\texttt{Set[(Char, String)]}}. 
   140 \mbox{\texttt{Set[(Char, String)]}}.  The code in Scala is as follows:
   140 The code in Scala is as follows:
       
   141 
   141 
   142 \begin{center}
   142 \begin{center}
   143 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   143 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   144 case class CharParser(c: Char) extends Parser[String, Char] {
   144 case class CharParser(c: Char) extends Parser[String, Char] {
   145   def parse(in: String) = 
   145   def parse(in: String) = 
   156 empty set (in Scala \texttt{Set()}). Since this parser recognises
   156 empty set (in Scala \texttt{Set()}). Since this parser recognises
   157 characters and just returns characters as the processed part, the
   157 characters and just returns characters as the processed part, the
   158 output type of the parser is \texttt{Char}.
   158 output type of the parser is \texttt{Char}.
   159 
   159 
   160 If we want to parse a list of tokens and interested in recognising a
   160 If we want to parse a list of tokens and interested in recognising a
   161 number token, we could write something like this
   161 number token, for example, we could write something like this
   162 
   162 
   163 \begin{center}
   163 \begin{center}
   164 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
   164 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
   165 case object NumParser extends Parser[List[Token], Int] {
   165 case object NumParser extends Parser[List[Token], Int] {
   166   def parse(ts: List[Token]) = ts match {
   166   def parse(ts: List[Token]) = ts match {
   175 In this parser the input is of type \texttt{List[Token]}. The function
   175 In this parser the input is of type \texttt{List[Token]}. The function
   176 parse looks at the input \texttt{ts} and checks whether the first
   176 parse looks at the input \texttt{ts} and checks whether the first
   177 token is a \texttt{Num\_token} (let us assume our lexer generated
   177 token is a \texttt{Num\_token} (let us assume our lexer generated
   178 these tokens for numbers). But this parser does not just return this
   178 these tokens for numbers). But this parser does not just return this
   179 token (and the rest of the list), like the \texttt{CharParser} above,
   179 token (and the rest of the list), like the \texttt{CharParser} above,
   180 rather it extracts the string \texttt{s} from the token and converts it
   180 rather it extracts also the string \texttt{s} from the token and
   181 into an integer. The hope is that the lexer did its work well and this
   181 converts it into an integer. The hope is that the lexer did its work
   182 conversion always succeeds. The consequence of this is that the output
   182 well and this conversion always succeeds. The consequence of this is
   183 type for this parser is \texttt{Int}, not \texttt{Token}. Such a
   183 that the output type for this parser is \texttt{Int}, not
   184 conversion would be needed if we want to implement a simple calculator
   184 \texttt{Token}. Such a conversion would be needed if we want to
   185 program.
   185 implement a simple calculator program, because string-numbers need to
       
   186 be transformed into \texttt{Int}-numbers in order to do the
       
   187 calculations.
   186 
   188 
   187 
   189 
   188 These simple parsers that just look at the input and do a simple
   190 These simple parsers that just look at the input and do a simple
   189 transformation are often called \emph{atomic} parser combinators.
   191 transformation are often called \emph{atomic} parser combinators.
   190 More interesting are the parser combinators that build larger parsers
   192 More interesting are the parser combinators that build larger parsers
   212 \end{center}
   214 \end{center}
   213 
   215 
   214 \noindent The types of this parser combinator are again generic (we
   216 \noindent The types of this parser combinator are again generic (we
   215 have \texttt{I} for the input type, and \texttt{T} for the output
   217 have \texttt{I} for the input type, and \texttt{T} for the output
   216 type). The alternative parser builds a new parser out of two existing
   218 type). The alternative parser builds a new parser out of two existing
   217 parsers \texttt{p} and \texttt{q} given as arguments.  Both parsers
   219 parsers \texttt{p} and \texttt{q} which are given as arguments.  Both
   218 need to be able to process input of type \texttt{I} and return in
   220 parsers need to be able to process input of type \texttt{I} and return
   219 \texttt{parse} the same output type \texttt{Set[(T,
   221 in \texttt{parse} the same output type \texttt{Set[(T,
   220   I)]}.\footnote{There is an interesting detail of Scala, namely the
   222   I)]}.\footnote{There is an interesting detail of Scala, namely the
   221   \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They
   223   \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They
   222   will prevent the evaluation of the arguments before they are
   224   will prevent the evaluation of the arguments before they are
   223   used. This is often called \emph{lazy evaluation} of the
   225   used. This is often called \emph{lazy evaluation} of the
   224   arguments. We will explain this later.}  The alternative parser
   226   arguments. We will explain this later.}  The alternative parser runs
   225 should run the input with the first parser \texttt{p} (producing a set
   227 the input with the first parser \texttt{p} (producing a set of pairs)
   226 of pairs) and then run the same input with \texttt{q} (producing
   228 and then runs the same input with \texttt{q} (producing another set of
   227 another set of pairs).  The result should be then just the union of
   229 pairs).  The result should be then just the union of both sets, which
   228 both sets, which is the operation \texttt{++} in Scala.
   230 is the operation \texttt{++} in Scala.
   229 
   231 
   230 The alternative parser combinator allows us to construct a parser that
   232 The alternative parser combinator allows us to construct a parser that
   231 parses either a character \texttt{a} or \texttt{b} using the
   233 parses either a character \texttt{a} or \texttt{b} using the
   232 \texttt{CharParser} shown above. For this we can write
   234 \texttt{CharParser} shown above. For this we can write
   233 
   235 
   238 \end{center}
   240 \end{center}
   239 
   241 
   240 \noindent Later on we will use Scala mechanism for introducing some
   242 \noindent Later on we will use Scala mechanism for introducing some
   241 more readable shorthand notation for this, like \texttt{"a" |
   243 more readable shorthand notation for this, like \texttt{"a" |
   242   "b"}. Let us look in detail at what this parser combinator produces
   244   "b"}. Let us look in detail at what this parser combinator produces
   243 with some sample strings
   245 with some sample strings.
   244 
   246 
   245 \begin{center}
   247 \begin{center}
   246 \begin{tabular}{rcl}
   248 \begin{tabular}{rcl}
   247 input strings & & output\medskip\\
   249 input strings & & output\medskip\\
   248 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
   250 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
   258 parse anything with \pcode{ccde}, therefore the empty
   260 parse anything with \pcode{ccde}, therefore the empty
   259 set is returned.
   261 set is returned.
   260 
   262 
   261 A bit more interesting is the \emph{sequence parser combinator}. Given
   263 A bit more interesting is the \emph{sequence parser combinator}. Given
   262 two parsers, say again, $p$ and $q$, we want to apply first the input
   264 two parsers, say again, $p$ and $q$, we want to apply first the input
   263 to $p$ producing a set of pairs; then apply $q$ to all the unparsed
   265 to $p$ producing a set of pairs; then apply $q$ to all the unparsed  
   264 parts in the pairs; and then combine the results. Mathematically we would
   266 parts in the pairs; and then combine the results. Mathematically we would
   265 write something like this for the expected set of pairs:
   267 write something like this for the result set of pairs:
   266 
   268 
   267 \begin{center}
   269 \begin{center}
   268 \begin{tabular}{lcl}
   270 \begin{tabular}{lcl}
   269 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & 
   271 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & 
   270 $(\textit{output}_1, u_1) \in p(\text{input}) 
   272 $(\textit{output}_1, u_1) \in p(\text{input}) 
   271 \;\wedge\;$\\
   273 \;\wedge\;$\\
   272 && $(\textit{output}_2, u_2) \in q(u_1)\}$
   274 && $(\textit{output}_2, u_2) \in q(u_1)\}$
   273 \end{tabular}
   275 \end{tabular}
   274 \end{center}
   276 \end{center}
   275 
   277 
   276 \noindent Notice that the $p$ wil first be run on the input, producing
       
   277 pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands
       
   278 for the unprocessed, or leftover, parts. We want that $q$ runs on all
       
   279 these unprocessed parts $u_1$. This again will produce some
       
   280 processed part , $p$ and
       
   281 $q$, we apply both parsers to the input (remember parsers are
       
   282 functions) and combine the output (remember they are sets of pairs)
       
   283  
       
   284 \begin{center}
       
   285 $p(\text{input}) \cup q(\text{input})$
       
   286 \end{center}
       
   287 
       
   288 \noindent In Scala we would implement alternative parser
       
   289 combinator as follows
       
   290 
       
   291 \begin{center}
       
   292 \begin{lstlisting}[language=Scala, numbers=none]
       
   293 class AltParser[I, T]
       
   294        (p: => Parser[I, T], 
       
   295         q: => Parser[I, T]) extends Parser[I, T] {
       
   296   def parse(in: I) = p.parse(in) ++ q.parse(in)
       
   297 }
       
   298 \end{lstlisting}
       
   299 \end{center}
       
   300 
       
   301 \noindent The types of this parser combinator are again generic (we
       
   302 just have \texttt{I} for the input type, and \texttt{T} for the output
       
   303 type). The alternative parser builds a new parser out of two existing
       
   304 parsers \texttt{p} and \texttt{q}.  Both need to be able to process
       
   305 input of type \texttt{I} and return the same output type
       
   306 \texttt{Set[(T, I)]}.\footnote{There is an interesting detail of
       
   307   Scala, namely the \texttt{=>} in front of the types of \texttt{p}
       
   308   and \texttt{q}. They will prevent the evaluation of the arguments
       
   309   before they are used. This is often called \emph{lazy evaluation} of
       
   310   the arguments. We will explain this later.}  Therefore the output
       
   311 type of this parser is \texttt{T}. The alternative parser should run
       
   312 the input with the first parser \texttt{p} (producing a set of pairs)
       
   313 and then run the same input with \texttt{q} (producing another set of
       
   314 pairs).  The result should be then just the union of both sets, which
       
   315 is the operation \texttt{++} in Scala.
       
   316 
       
   317 The alternative parser combinator already allows us to
       
   318 construct a parser that parses either a character \texttt{a}
       
   319 or \texttt{b}, as
       
   320 
       
   321 \begin{center}
       
   322 \begin{lstlisting}[language=Scala, numbers=none]
       
   323 new AltParser(CharParser('a'), CharParser('b'))
       
   324 \end{lstlisting}
       
   325 \end{center}
       
   326 
       
   327 \noindent Later on we will use again Scala mechanism for introducing some
       
   328 more readable shorthand notation for this, like \texttt{"a" | "b"}. Let us look in detail at what this parser combinator produces with some sample strings 
       
   329 
       
   330 \begin{center}
       
   331 \begin{tabular}{rcl}
       
   332 input strings & & output\medskip\\
       
   333 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
       
   334 \texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
       
   335 \texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
       
   336 \end{tabular}
       
   337 \end{center}
       
   338 
       
   339 \noindent We receive in the first two cases a successful
       
   340 output (that is a non-empty set). In each case, either
       
   341 \pcode{a} or \pcode{b} is in the processed part, and
       
   342 \pcode{cde} in the unprocessed part. Clearly this parser cannot
       
   343 parse anything in the string \pcode{ccde}, therefore the empty
       
   344 set.
       
   345 
       
   346 A bit more interesting is the \emph{sequence parser combinator}. Given
       
   347 two parsers, say again, $p$ and $q$, we want to apply first the input
       
   348 to $p$ producing a set of pairs; then apply $q$ to all the unparsed
       
   349 parts in the pairs; and then combine the results like
       
   350 
       
   351 \begin{center}
       
   352 \begin{tabular}{lcl}
       
   353 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & 
       
   354 $(\textit{output}_1, u_1) \in p(\text{input}) 
       
   355 \;\wedge\;$\\
       
   356 && $(\textit{output}_2, u_2) \in q(u_1)\}$
       
   357 \end{tabular}
       
   358 \end{center}
       
   359 
       
   360 \noindent Notice that the $p$ will first be run on the input,
   278 \noindent Notice that the $p$ will first be run on the input,
   361 producing pairs of the form $\textit{output}_1$ and unprocessed part
   279 producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$
   362 $u_1$. This unprocessed part is fed into the second parser $q$. The
   280 stands for the unprocessed, or leftover, parts pf $p$. We want that
   363 overall result of the sequence parser combinator is pairs of the form
   281 $q$ runs on all these unprocessed parts $u_1$. Therefore these
       
   282 unprocessed parts are fed into the second parser $q$. The overall
       
   283 result of the sequence parser combinator is pairs of the form
   364 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the
   284 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the
   365 unprocessed part of both parsers is the unprocessed part the second
   285 unprocessed part of the sequqnce p[arser combinator is the unprocessed
   366 parser $q$ leaves as leftover. The processed parts of both
   286 part the second parser $q$ leaves as leftover. The processed parts of
   367 parsers is a pair consisting of the outputs of $p$ and $q$, namely
   287 of the component parsers is a pair consisting of the outputs of $p$
   368 $(\textit{output}_1, \textit{output}_2)$. This behaviour can be
   288 and $q$, namely $(\textit{output}_1, \textit{output}_2)$. This
   369 implemented in Scala as follows:
   289 behaviour can be implemented in Scala as follows:
   370 
   290 
   371 \begin{center}
   291 \begin{center}
   372 \begin{lstlisting}[language=Scala,numbers=none]
   292 \begin{lstlisting}[language=Scala,numbers=none]
   373 class SeqParser[I, T, S]
   293 class SeqParser[I, T, S]
   374        (p: => Parser[I, T], 
   294        (p: => Parser[I, T], 
   393 component is the unprocessed part left over after running the second
   313 component is the unprocessed part left over after running the second
   394 parser \texttt{q}. Therefore the input type of the sequence parser
   314 parser \texttt{q}. Therefore the input type of the sequence parser
   395 combinator is as usual \texttt{I}, but the output type is
   315 combinator is as usual \texttt{I}, but the output type is
   396 
   316 
   397 \begin{center}
   317 \begin{center}
   398 \texttt{Set[((T, S), I)]}
   318 \texttt{(T, S)}
   399 \end{center}
   319 \end{center}
   400 
   320 
   401 \noindent
   321 \noindent
   402 If any of the runs of \textit{p} and \textit{q} fail, that is produce
   322 This means \texttt{parse} in the sequence parser combinator returns
   403 the empty set, then \texttt{parse} will also produce the empty set.
   323 sets of type \texttt{Set[((T, S), I)]}.  Notice that we have
   404 Notice that we have now two output types for the sequence parser
   324 essentially two output types for the sequence parser combinator
   405 combinator, because in general \textit{p} and \textit{q} might produce
   325 (packaged in a pair), because in general \textit{p} and \textit{q}
   406 different things (for example first we recognise a number and then a
   326 might produce different things (for example first we recognise a
   407 string corresponding to an operator).
   327 number and then a string corresponding to an operator).  If any of the
       
   328 runs of \textit{p} and \textit{q} fail, that is produce the empty set,
       
   329 then \texttt{parse} will also produce the empty set.
   408 
   330 
   409 With the shorthand notation we shall introduce later for the sequence
   331 With the shorthand notation we shall introduce later for the sequence
   410 parser combinator, we can write for example \pcode{"a" ~ "b"}, which
   332 parser combinator, we can write for example \pcode{"a" ~ "b"}, which
   411 is the parser combinator that first recognises the character
   333 is the parser combinator that first recognises the character
   412 \texttt{a} from a string and then \texttt{b}. Let us look again at
   334 \texttt{a} from a string and then \texttt{b}. Let us look again at
   458 \end{tabular}
   380 \end{tabular}
   459 \end{center}
   381 \end{center}
   460 
   382 
   461 
   383 
   462 \noindent The second and third example fail, because something is
   384 \noindent The second and third example fail, because something is
   463 ``missing'' in the sequence we are looking for. Also notice how the
   385 ``missing'' in the sequence we are looking for. The first succeeds but
   464 results nest with sequences: the parsed part is a nested pair of the
   386 notice how the results nest with sequences: the parsed part is a
   465 form \pcode{((a, b), c)}. If we nest the sequence parser differently,
   387 nested pair of the form \pcode{((a, b), c)}. If we nest the sequence
   466 for example \pcode{"a" ~ ("b" ~ "c")}, then also our output pairs nest
   388 parser differently, for example \pcode{"a" ~ ("b" ~ "c")}, then also
   467 differently
   389 our output pairs nest differently
   468 
   390 
   469 \begin{center}
   391 \begin{center}
   470 \begin{tabular}{rcl}
   392 \begin{tabular}{rcl}
   471 input strings & & output\medskip\\
   393 input strings & & output\medskip\\
   472 \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})), \texttt{\Grid{de}})\right\}$\\
   394 \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})), \texttt{\Grid{de}})\right\}$\\
   508 \]
   430 \]
   509 
   431 
   510 \noindent where the unprocessed (empty) parts have been stripped away
   432 \noindent where the unprocessed (empty) parts have been stripped away
   511 from the pairs; everything where the second part was not empty has
   433 from the pairs; everything where the second part was not empty has
   512 been thrown away as well, because they represent
   434 been thrown away as well, because they represent
   513 ultimately-unsuccessful-parses.
   435 ultimately-unsuccessful-parses. The main point is that the sequence
   514 
   436 parser combinator returns pairs that can nest according to the
   515 
   437 nesting of the component parsers.
   516 Note carefully that constructing a parser such \pcode{"a" | ("a" ~
   438 
   517   "b")} will result in a typing error. The intention is that we want
   439 
   518 to parse an \texttt{a}, or an \texttt{a} followed by a
   440 Consider also carefully that constructing a parser such \pcode{"a" |
   519 \texttt{b}. However, the first parser has as output type a single
   441   ("a" ~ "b")} will result in a typing error. The intention with this
   520 character (recall the type of \texttt{CharParser}), but the second
   442 parser is that we want to parse an \texttt{a}, or an \texttt{a}
   521 parser produces a pair of characters as output. The alternative parser
   443 followed by a \texttt{b}. However, the first parser has as output type
   522 is required to have both component parsers to have the same type---we
   444 a single character (recall the type of \texttt{CharParser}), but the
   523 need to be able to build the union of two sets, which means in Scala
   445 second parser produces a pair of characters as output. The alternative
   524 they need to be of the same type.  Therefore the typing error.
   446 parser is required to have both component parsers to have the same
   525 We will see later how we can build
   447 type---we need to be able to build the union of two sets, which means
       
   448 in Scala they need to be of the same type.  Since ther are not, there
       
   449 is a typing error in this example.  We will see later how we can build
   526 this parser without the typing error.
   450 this parser without the typing error.
   527 
   451 
   528 The next parser combinator, called \emph{semantic action}, does not
   452 The next parser combinator, called \emph{semantic action}, does not
   529 actually combine smaller parsers, but applies a function to the result
   453 actually combine smaller parsers, but applies a function to the result
   530 of a parser.  It is implemented in Scala as follows
   454 of a parser.  It is implemented in Scala as follows
   539 }
   463 }
   540 \end{lstlisting}
   464 \end{lstlisting}
   541 \end{center}
   465 \end{center}
   542 
   466 
   543 
   467 
   544 \noindent This parser combinator takes a parser \texttt{p}
   468 \noindent This parser combinator takes a parser \texttt{p} (with input
   545 with output type \texttt{T} as one argument as well as a
   469 type \texttt{I} and output type \texttt{T}) as one argument but also a
   546 function \texttt{f} with type \texttt{T => S}. The parser
   470 function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p}
   547 \texttt{p} produces sets of type \texttt{(T, I)}. The
   471 produces sets of type \texttt{Set[(T, I)]}. The semantic action
   548 semantic action combinator then applies the function
   472 combinator then applies the function \texttt{f} to all the `processed'
   549 \texttt{f} to all the parser outputs. Since this function is of
   473 parser outputs. Since this function is of type \texttt{T => S}, we
   550 type \texttt{T => S}, we obtain a parser with output type
   474 obtain a parser with output type \texttt{S}. Again Scala lets us
   551 \texttt{S}. Again Scala lets us introduce some shorthand
   475 introduce some shorthand notation for this parser
   552 notation for this parser combinator. Therefore we will write
   476 combinator. Therefore we will write \texttt{p ==> f} for it.
   553 \texttt{p ==> f} for it.
       
   554 
   477 
   555 What are semantic actions good for? Well, they allow you to transform
   478 What are semantic actions good for? Well, they allow you to transform
   556 the parsed input into a datastructure you can use for further
   479 the parsed input into datastructures you can use for further
   557 processing. A simple example would be to transform parsed characters
   480 processing. A simple example would be to transform parsed characters
   558 into ASCII numbers. Suppose we define a function \texttt{f} (from
   481 into ASCII numbers. Suppose we define a function \texttt{f} (from
   559 characters to ints) and use a \texttt{CharParser} for parsing
   482 characters to ints) and use a \texttt{CharParser} for parsing
   560 the character \texttt{c}.
   483 the character \texttt{c}.
   561 
   484 
   623 \end{lstlisting}
   546 \end{lstlisting}
   624 \end{center}  
   547 \end{center}  
   625 
   548 
   626 \noindent
   549 \noindent
   627 produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
   550 produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
   628 still a string (the double-quotes are not printed by Scala). We need
   551 still a string (the required double-quotes are not printed by
   629 to convert it into the corresponding \texttt{Int}. We can do this as
   552 Scala). We want to convert this string into the corresponding
   630 follows using a semantic action
   553 \texttt{Int}. We can do this as follows using a semantic action
   631 
   554 
   632 \begin{center}
   555 \begin{center}
   633 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   556 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   634 (NumParser ==> (s => s.toInt)).parse("123abc")
   557 (NumParser ==> (s => s.toInt)).parse("123abc")
   635 \end{lstlisting}
   558 \end{lstlisting}
   642 
   565 
   643 \subsubsection*{Shorthand notation for parser combinators}
   566 \subsubsection*{Shorthand notation for parser combinators}
   644 
   567 
   645 Before we proceed, let us just explain the shorthand notation for
   568 Before we proceed, let us just explain the shorthand notation for
   646 parser combinators. Like for regular expressions, the shorthand notation
   569 parser combinators. Like for regular expressions, the shorthand notation
   647 will make our life much easier when writing actual parsers.
   570 will make our life much easier when writing actual parsers. We can define
       
   571 some implicits which allow us to write \pcode{p | q}, \pcode{p ~ q} and
       
   572 \pcode{p ==> f} as well as to use plain strings for specifying simple
       
   573 string parsers.
       
   574 
       
   575 The idea is that this shorthand notation allows us to easily translate
       
   576 context-free grammars into code. For example recall our context-free
       
   577 grammar for palindromes:
       
   578 
       
   579 \begin{plstx}[margin=3cm]
       
   580 : \meta{P} ::=  a\cdot \meta{P}\cdot a | b\cdot \meta{P}\cdot b | a | b | \epsilon\\
       
   581 \end{plstx}
       
   582 
       
   583 \noindent
       
   584 Each alternative in this grammar translates into an alternative parser
       
   585 combinator.  The $\cdot$ can be translated to a sequence parser
       
   586 combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply
       
   587 written as \texttt{"a"}, \texttt{"b"} and \texttt{""}.
       
   588 
   648 
   589 
   649 \subsubsection*{How to build parsers using parser combinators?}
   590 \subsubsection*{How to build parsers using parser combinators?}
   650 
   591 
   651 The beauty of parser combinators is the ease with which they can be
   592 The beauty of parser combinators is the ease with which they can be
   652 implemented and how easy it is to translate context-free grammars into
   593 implemented and how easy it is to translate context-free grammars into
   653 code (though the grammars need to be non-left-recursive). To
   594 code (though the grammars need to be non-left-recursive). To
   654 demonstrate this recall our context-free grammar for palindromes:
   595 demonstrate this recall the grammar for palindromes from above.
   655 
   596 The first idea would be to translate it into the following code
   656 \begin{plstx}[margin=3cm]
       
   657 : \meta{P} ::=  a\cdot \meta{P}\cdot a | b\cdot \meta{P}\cdot b | a | b | \epsilon\\
       
   658 \end{plstx}
       
   659 
       
   660 \noindent
       
   661 Given the parser combinators for alternatives and sequeneces, this grammar should be
       
   662 straightforward to implement. The first idea would be
       
   663 
   597 
   664 \begin{center}
   598 \begin{center}
   665 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   599 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   666 lazy val Pal : Parser[String, String] = 
   600 lazy val Pal : Parser[String, String] = 
   667   (("a" ~ Pal ~ "a") | ("b" ~ Pal ~ "b") | "a" | "b" | "")
   601   (("a" ~ Pal ~ "a") | ("b" ~ Pal ~ "b") | "a" | "b" | "")
   668 \end{lstlisting}
   602 \end{lstlisting}
   669 \end{center}
   603 \end{center}
   670 
   604 
   671 \noindent
   605 \noindent
   672 Unfortunately, this does not work as it produces a typing error. The
   606 Unfortunately, this does not quite work yet as it produces a typing
   673 reason is that the parsers \texttt{"a"}, \texttt{"b"} and \texttt{""}
   607 error. The reason is that the parsers \texttt{"a"}, \texttt{"b"} and
   674 all produce strings and therefore can be put into an alternative
   608 \texttt{""} all produce strings as output type and therefore can be
   675 \texttt{...| "a" | "b" | ""}. But both \pcode{"a" ~ Pal ~ "a"} and
   609 put into an alternative \texttt{...| "a" | "b" | ""}. But both
   676 \pcode{"b" ~ Pal ~ "b"} produce pairs of the form
   610 \pcode{"a" ~ Pal ~ "a"} and \pcode{"b" ~ Pal ~ "b"} produce pairs of
   677 $(((\_, \_), \_), \_)$---that is how the sequence parser combinator
   611 the form $(((\_, \_), \_), \_)$---that is how the sequence parser
   678 nests results when \pcode{\~} is used between two components. The
   612 combinator nests results when \pcode{\~} is used between two
   679 solution is to use a semantic action that ``flattens'' these pairs and
   613 components. The solution is to use a semantic action that ``flattens''
   680 appends the corresponding strings, like
   614 these pairs and appends the corresponding strings, like
   681 
   615 
   682 \begin{center}
   616 \begin{center}
   683 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   617 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   684 lazy val Pal : Parser[String, String] =  
   618 lazy val Pal : Parser[String, String] =  
   685   (("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } |
   619   (("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } |
   687     "a" | "b" | "")
   621     "a" | "b" | "")
   688 \end{lstlisting}
   622 \end{lstlisting}
   689 \end{center}
   623 \end{center}
   690 
   624 
   691 \noindent
   625 \noindent
       
   626 Now in all cases we have strings as output type for the parser variants.
   692 The semantic action
   627 The semantic action
   693 
   628 
   694 
   629 
   695 \noindent
   630 \noindent
   696 Important to note is that we must define \texttt{Pal}-parser as a
   631 Important to note is that we must define \texttt{Pal}-parser as a