handouts/ho06.tex
changeset 587 5ddedcd92d84
parent 586 451a95e1bc25
child 588 a4646557016d
equal deleted inserted replaced
586:451a95e1bc25 587:5ddedcd92d84
     7 \begin{document}
     7 \begin{document}
     8 
     8 
     9 \section*{Handout 6 (Parser Combinators)}
     9 \section*{Handout 6 (Parser Combinators)}
    10 
    10 
    11 This handout explains how \emph{parser combinators} work and how they
    11 This handout explains how \emph{parser combinators} work and how they
    12 can be implemented in Scala. Their distinguishing feature is that they
    12 can be implemented in Scala. Their most distinguishing feature is that
    13 are very easy to implement (admittedly it is only easy in a functional
    13 they are very easy to implement (admittedly it is only easy in a
    14 programming language). However, parser combinators require that the
    14 functional programming language).  Another good point of parser
    15 grammar to be parsed is \emph{not} left-recursive and they are
    15 combinators is that they can deal with any kind of input as long as
    16 efficient only when the grammar is unambiguous. It is the
    16 this input is of ``sequence-kind'', for example a string or a list of
    17 responsibility of the grammar designer to ensure these two properties.
    17 tokens. The only two properties of the input we need is to be able to
    18 
    18 test when it is empty and ``sequentially'' take it apart. Strings and
    19 Another good point of parser combinators is that they can deal with any kind
    19 lists fit this bill. However, parser combinators also have their
    20 of input as long as this input of ``sequence-kind'', for example a
    20 drawbacks. For example they require that the grammar to be parsed is
    21 string or a list of tokens. The general idea behind parser combinators
    21 \emph{not} left-recursive and they are efficient only when the grammar
    22 is to transform the input into sets of pairs, like so
    22 is unambiguous. It is the responsibility of the grammar designer to
       
    23 ensure these two properties.
       
    24 
       
    25 The general idea behind parser combinators is to transform the input
       
    26 into sets of pairs, like so
    23 
    27 
    24 \begin{center}
    28 \begin{center}
    25 $\underbrace{\text{list of tokens}}_{\text{input}}$ 
    29 $\underbrace{\text{list of tokens}}_{\text{input}}$ 
    26 $\Rightarrow$
    30 $\Rightarrow$
    27 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
    31 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
    28 \end{center} 
    32 \end{center} 
    29 
    33 
    30 \noindent As said, the input can be anything as long as it is a
    34 \noindent
    31 ``sequence''. The only property of the input we need is to be able to
    35 Given the extended effort we have spent in order to implement a lexer
    32 test when it is empty. Obviously we can do this for strings and lists.
    36 in order to generate list of tokens, it might be surprising that in
    33 For more lucidity we shall below often use strings as input in order
    37 what follows we shall often use strings as input. This is for making
    34 to illustrate matters. However, this does not make our previous work
    38 the explanation more lucid. It does not make our previous work on
    35 on lexers obsolete (remember they transform a string into a list of
    39 lexers obsolete (remember they transform a string into a list of
    36 tokens). Lexers will still be needed to build a somewhat realistic
    40 tokens). Lexers will still be needed for building a somewhat realistic
    37 compiler.
    41 compiler.
    38 
    42 
    39 
    43 But as said, parser combinators are relatively agnostic about what
    40 In my Scala code I use the following polymorphic types for parser combinators:
    44 kind of input they process. In my Scala code I use the following
       
    45 polymorphic types for parser combinators:
    41 
    46 
    42 \begin{center}
    47 \begin{center}
    43 input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
    48 input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
    44 \end{center}
    49 \end{center}
    45 
    50 
    46 \noindent that is they take as input something of type \texttt{I} and
    51 \noindent That is they take as input something of type \texttt{I} and
    47 return a set of pairs of \texttt{Set[(T, I)]}. Since the input needs
    52 return a set of pairs of type \texttt{Set[(T, I)]}. Since the input needs
    48 to be of ``sequence-kind'' I actually have to often write \texttt{I
    53 to be of ``sequence-kind'', I actually have to often write \texttt{I
    49   <\% Seq[\_]} for the input type in order to indicate it is a
    54   <\% Seq[\_]} for the input type. This ensures the input is a
    50 subtype of Scala sequences. The first component of the generated pairs
    55 subtype of Scala sequences. The first component of the generated pairs
    51 corresponds to what the parser combinator was able to process from the
    56 corresponds to what the parser combinator was able to process from the
    52 input and the second is the unprocessed part of the input (therefore
    57 input and the second is the unprocessed part of the input (therefore
    53 the type of this unprocessed part is the same as the input). As we
    58 the type of this unprocessed part is the same as the input). As we
    54 shall see shortly, a parser combinator might return more than one such
    59 shall see shortly, a parser combinator might return more than one such
    55 pair; the idea being that there are potentially several ways of how to
    60 pair; the idea is that there are potentially several ways of how to
    56 parse the input.  As a concrete example, consider the string
    61 parse the input.  As a concrete example, consider the string
    57 
    62 
    58 \begin{center}
    63 \begin{center}
    59 \tt\Grid{iffoo\VS testbar}
    64 \tt\Grid{iffoo\VS testbar}
    60 \end{center}
    65 \end{center}
    66 \begin{center}
    71 \begin{center}
    67 $\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
    72 $\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
    68            \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
    73            \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
    69 \end{center}
    74 \end{center}
    70 
    75 
    71 \noindent where the first pair means the parser could
    76 \noindent where the first pair means the parser could recognise
    72 recognise \texttt{if} from the input and leaves the rest as
    77 \texttt{if} from the input and leaves the rest as `unprocessed' as the
    73 `unprocessed' as the second component of the pair; in the
    78 second component of the pair; in the other case it could recognise
    74 other case it could recognise \texttt{iffoo} and leaves
    79 \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
    75 \texttt{\VS testbar} as unprocessed. If the parser cannot
    80 parser cannot recognise anything from the input at all, then parser
    76 recognise anything from the input, then parser combinators just
    81 combinators just return the empty set $\{\}$. This will indicate
    77 return the empty set $\{\}$. This will indicate something
    82 something ``went wrong''\ldots or more precisely, nothing could be
    78 ``went wrong''\ldots or more precisely, nothing could be
       
    79 parsed.
    83 parsed.
    80 
    84 
    81 Also important to note is that the type \texttt{T} for the processed
    85 Also important to note is that the type \texttt{T} for the processed
    82 part is different from the input type. The reason is that in general
    86 part is different from the input type. In the example above is just
    83 we are interested in transform our input into something
    87 happens to be the same. The reason for the difference is that in
    84 ``different''\ldots for example into a tree, or if we implement the
    88 general we are interested in transforming our input into something
    85 grammar for arithmetic expressions we might be interested in the
    89 ``different''\ldots for example into a tree; or if we implement the
       
    90 grammar for arithmetic expressions, we might be interested in the
    86 actual integer number the arithmetic expression, say \texttt{1 + 2 *
    91 actual integer number the arithmetic expression, say \texttt{1 + 2 *
    87   3}, stands for. In this way we can use parser combinators to
    92   3}, stands for. In this way we can use parser combinators to
    88 implement relatively easily a calculator.
    93 implement relatively easily a calculator, for instance.
    89 
    94 
    90 The main idea of parser combinators is that we can easily build parser
    95 The main idea of parser combinators is that we can easily build parser
    91 combinators out of smaller components following very closely the
    96 combinators out of smaller components following very closely the
    92 structure of a grammar. In order to implement this in an
    97 structure of a grammar. In order to implement this in an
    93 object-oriented programming language, like Scala, we need to specify
    98 object-oriented programming language, like Scala, we need to specify
   128 \item Otherwise return the empty set $\{\}$.	
   133 \item Otherwise return the empty set $\{\}$.	
   129 \end{itemize}
   134 \end{itemize}
   130 
   135 
   131 \noindent
   136 \noindent
   132 The input type of this simple parser combinator for characters is
   137 The input type of this simple parser combinator for characters is
   133 \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. 
   138 \texttt{String} and the output type is \texttt{Char}. This means
       
   139 \texttt{parse} returns  \mbox{\texttt{Set[(Char, String)]}}. 
   134 The code in Scala is as follows:
   140 The code in Scala is as follows:
   135 
   141 
   136 \begin{center}
   142 \begin{center}
   137 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   143 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   138 case class CharParser(c: Char) extends Parser[String, Char] {
   144 case class CharParser(c: Char) extends Parser[String, Char] {
   139   def parse(sb: String) = 
   145   def parse(in: String) = 
   140     if (sb.head == c) Set((c, sb.tail)) else Set()
   146     if (in.head == c) Set((c, in.tail)) else Set()
   141 }
   147 }
   142 \end{lstlisting}
   148 \end{lstlisting}
   143 \end{center}
   149 \end{center}
   144 
   150 
   145 \noindent You can see the \texttt{parse} function tests whether the
   151 \noindent You can see the \texttt{parse} tests whether the
   146 first character of the input string \texttt{sb} is equal to
   152 first character of the input string \texttt{in} is equal to
   147 \texttt{c}. If yes, then it splits the string into the recognised part
   153 \texttt{c}. If yes, then it splits the string into the recognised part
   148 \texttt{c} and the unprocessed part \texttt{sb.tail}. In case
   154 \texttt{c} and the unprocessed part \texttt{in.tail}. In case
   149 \texttt{sb} does not start with \texttt{c} then the parser returns the
   155 \texttt{in} does not start with \texttt{c} then the parser returns the
   150 empty set (in Scala \texttt{Set()}). Since this parser recognises
   156 empty set (in Scala \texttt{Set()}). Since this parser recognises
   151 characters and just returns characters as the processed part, the
   157 characters and just returns characters as the processed part, the
   152 output type of the parser is \texttt{Char}.
   158 output type of the parser is \texttt{Char}.
   153 
   159 
   154 If we want to parse a list of tokens and interested in recognising a
   160 If we want to parse a list of tokens and interested in recognising a
   172 these tokens for numbers. But this parser does not just return this
   178 these tokens for numbers. But this parser does not just return this
   173 token (and the rest of the list), like the \texttt{CharParser} above,
   179 token (and the rest of the list), like the \texttt{CharParser} above,
   174 rather extracts the string \texttt{s} from the token and converts it
   180 rather extracts the string \texttt{s} from the token and converts it
   175 into an integer. The hope is that the lexer did its work well and this
   181 into an integer. The hope is that the lexer did its work well and this
   176 conversion always succeeds. The consequence of this is that the output
   182 conversion always succeeds. The consequence of this is that the output
   177 type for this parser is \texttt{Int}. Such a conversion would be
   183 type for this parser is \texttt{Int}, not \texttt{Token}. Such a
   178 needed if we want to implement a simple calculator program.
   184 conversion would be needed if we want to implement a simple calculator
       
   185 program.
   179 
   186 
   180 
   187 
   181 These simple parsers that just look at the input and do a simple
   188 These simple parsers that just look at the input and do a simple
   182 transformation are often called \emph{atomic} parser combinators.
   189 transformation are often called \emph{atomic} parser combinators.
   183 More interesting are the parser combinators that build larger parsers
   190 More interesting are the parser combinators that build larger parsers
   184 out of smaller component parsers. For example the \emph{alternative
   191 out of smaller component parsers. There are three such parser
       
   192 combinators that can be implemented generically. The \emph{alternative
   185   parser combinator} is as follows: given two parsers, say, $p$ and
   193   parser combinator} is as follows: given two parsers, say, $p$ and
   186 $q$, we apply both parsers to the input (remember parsers are
   194 $q$, we apply both parsers to the input (remember parsers are
   187 functions) and combine the output (remember they are sets of pairs)
   195 functions) and combine the output (remember they are sets of pairs):
   188  
   196  
   189 \begin{center}
   197 \begin{center}
   190 $p(\text{input}) \cup q(\text{input})$
   198 $p(\text{input}) \cup q(\text{input})$
   191 \end{center}
   199 \end{center}
   192 
   200 
   193 \noindent In Scala we would implement alternative parser
   201 \noindent In Scala we can implement alternative parser
   194 combinator as follows
   202 combinator as follows
   195 
   203 
   196 \begin{center}
   204 \begin{center}
   197 \begin{lstlisting}[language=Scala, numbers=none]
   205 \begin{lstlisting}[language=Scala, numbers=none]
   198 class AltParser[I, T]
   206 class AltParser[I, T]
   199        (p: => Parser[I, T], 
   207        (p: => Parser[I, T], 
   200         q: => Parser[I, T]) extends Parser[I, T] {
   208         q: => Parser[I, T]) extends Parser[I, T] {
   201   def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
   209   def parse(in: I) = p.parse(in) ++ q.parse(in)
       
   210 }
       
   211 \end{lstlisting}
       
   212 \end{center}
       
   213 
       
   214 \noindent The types of this parser combinator are again generic (we
       
   215 have \texttt{I} for the input type, and \texttt{T} for the output
       
   216 type). The alternative parser builds a new parser out of two existing
       
   217 parsers \texttt{p} and \texttt{q} given as arguments.  Both parsers
       
   218 need to be able to process input of type \texttt{I} and return in
       
   219 \texttt{parse} the same output type \texttt{Set[(T,
       
   220   I)]}.\footnote{There is an interesting detail of Scala, namely the
       
   221   \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They
       
   222   will prevent the evaluation of the arguments before they are
       
   223   used. This is often called \emph{lazy evaluation} of the
       
   224   arguments. We will explain this later.}  The alternative parser
       
   225 should run the input with the first parser \texttt{p} (producing a set
       
   226 of pairs) and then run the same input with \texttt{q} (producing
       
   227 another set of pairs).  The result should be then just the union of
       
   228 both sets, which is the operation \texttt{++} in Scala.
       
   229 
       
   230 The alternative parser combinator allows us to construct a parser that
       
   231 parses either a character \texttt{a} or \texttt{b} using the
       
   232 \texttt{CharParser} shown above. For this we can write
       
   233 
       
   234 \begin{center}
       
   235 \begin{lstlisting}[language=Scala, numbers=none]
       
   236 new AltParser(CharParser('a'), CharParser('b'))
       
   237 \end{lstlisting}
       
   238 \end{center}
       
   239 
       
   240 \noindent Later on we will use Scala mechanism for introducing some
       
   241 more readable shorthand notation for this, like \texttt{"a" ||
       
   242   "b"}. Let us look in detail at what this parser combinator produces
       
   243 with some sample strings
       
   244 
       
   245 \begin{center}
       
   246 \begin{tabular}{rcl}
       
   247 input strings & & output\medskip\\
       
   248 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
       
   249 \texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
       
   250 \texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
       
   251 \end{tabular}
       
   252 \end{center}
       
   253 
       
   254 \noindent We receive in the first two cases a successful
       
   255 output (that is a non-empty set). In each case, either
       
   256 \pcode{a} or \pcode{b} is in the processed part, and
       
   257 \pcode{cde} in the unprocessed part. Clearly this parser cannot
       
   258 parse anything with \pcode{ccde}, therefore the empty
       
   259 set is returned.
       
   260 
       
   261 A bit more interesting is the \emph{sequence parser combinator}. Given
       
   262 two parsers, say again, $p$ and $q$, we want to apply first the input
       
   263 to $p$ producing a set of pairs; then apply $q$ to all the unparsed
       
   264 parts in the pairs; and then combine the results. Mathematically we would
       
   265 write something like this for the expected set of pairs:
       
   266 
       
   267 \begin{center}
       
   268 \begin{tabular}{lcl}
       
   269 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & 
       
   270 $(\textit{output}_1, u_1) \in p(\text{input}) 
       
   271 \;\wedge\;$\\
       
   272 && $(\textit{output}_2, u_2) \in q(u_1)\}$
       
   273 \end{tabular}
       
   274 \end{center}
       
   275 
       
   276 \noindent Notice that the $p$ wil first be run on the input, producing
       
   277 pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands
       
   278 for the unprocessed, or left-over, parts. We want that $q$ runs on all
       
   279 these unprocessed parts $u_1$. This again will produce some
       
   280 processed part , $p$ and
       
   281 $q$, we apply both parsers to the input (remember parsers are
       
   282 functions) and combine the output (remember they are sets of pairs)
       
   283  
       
   284 \begin{center}
       
   285 $p(\text{input}) \cup q(\text{input})$
       
   286 \end{center}
       
   287 
       
   288 \noindent In Scala we would implement alternative parser
       
   289 combinator as follows
       
   290 
       
   291 \begin{center}
       
   292 \begin{lstlisting}[language=Scala, numbers=none]
       
   293 class AltParser[I, T]
       
   294        (p: => Parser[I, T], 
       
   295         q: => Parser[I, T]) extends Parser[I, T] {
       
   296   def parse(in: I) = p.parse(in) ++ q.parse(in)
   202 }
   297 }
   203 \end{lstlisting}
   298 \end{lstlisting}
   204 \end{center}
   299 \end{center}
   205 
   300 
   206 \noindent The types of this parser combinator are again generic (we
   301 \noindent The types of this parser combinator are again generic (we
   260 \;\wedge\;$\\
   355 \;\wedge\;$\\
   261 && $(\textit{output}_2, u_2) \in q(u_1)\}$
   356 && $(\textit{output}_2, u_2) \in q(u_1)\}$
   262 \end{tabular}
   357 \end{tabular}
   263 \end{center}
   358 \end{center}
   264 
   359 
   265 \noindent Notice that the $p$ wil first be run on the input, producing
   360 \noindent Notice that the $p$ will first be run on the input,
   266 pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands
   361 producing pairs of the form $\textit{output}_1$ and unprocessed part
   267 for the unprocessed, or left-over, parts. We want that $q$ runs on all
   362 $u_1$. This unprocessed part is fed into the second parser $q$. The
   268 these unprocessed parts $u_1$. This again will produce some
       
   269 processed part , $p$ and
       
   270 $q$, we apply both parsers to the input (remember parsers are
       
   271 functions) and combine the output (remember they are sets of pairs)
       
   272  
       
   273 \begin{center}
       
   274 $p(\text{input}) \cup q(\text{input})$
       
   275 \end{center}
       
   276 
       
   277 \noindent In Scala we would implement alternative parser
       
   278 combinator as follows
       
   279 
       
   280 \begin{center}
       
   281 \begin{lstlisting}[language=Scala, numbers=none]
       
   282 class AltParser[I, T]
       
   283        (p: => Parser[I, T], 
       
   284         q: => Parser[I, T]) extends Parser[I, T] {
       
   285   def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
       
   286 }
       
   287 \end{lstlisting}
       
   288 \end{center}
       
   289 
       
   290 \noindent The types of this parser combinator are again generic (we
       
   291 just have \texttt{I} for the input type, and \texttt{T} for the output
       
   292 type). The alternative parser builds a new parser out of two existing
       
   293 parsers \texttt{p} and \texttt{q}.  Both need to be able to process
       
   294 input of type \texttt{I} and return the same output type
       
   295 \texttt{Set[(T, I)]}.\footnote{There is an interesting detail of
       
   296   Scala, namely the \texttt{=>} in front of the types of \texttt{p}
       
   297   and \texttt{q}. They will prevent the evaluation of the arguments
       
   298   before they are used. This is often called \emph{lazy evaluation} of
       
   299   the arguments. We will explain this later.}  Therefore the output
       
   300 type of this parser is \texttt{T}. The alternative parser should run
       
   301 the input with the first parser \texttt{p} (producing a set of pairs)
       
   302 and then run the same input with \texttt{q} (producing another set of
       
   303 pairs).  The result should be then just the union of both sets, which
       
   304 is the operation \texttt{++} in Scala.
       
   305 
       
   306 The alternative parser combinator already allows us to
       
   307 construct a parser that parses either a character \texttt{a}
       
   308 or \texttt{b}, as
       
   309 
       
   310 \begin{center}
       
   311 \begin{lstlisting}[language=Scala, numbers=none]
       
   312 new AltParser(CharParser('a'), CharParser('b'))
       
   313 \end{lstlisting}
       
   314 \end{center}
       
   315 
       
   316 \noindent Later on we will use again Scala mechanism for introducing some
       
   317 more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some sample strings 
       
   318 
       
   319 \begin{center}
       
   320 \begin{tabular}{rcl}
       
   321 input strings & & output\medskip\\
       
   322 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
       
   323 \texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
       
   324 \texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
       
   325 \end{tabular}
       
   326 \end{center}
       
   327 
       
   328 \noindent We receive in the first two cases a successful
       
   329 output (that is a non-empty set). In each case, either
       
   330 \pcode{a} or \pcode{b} is in the processed part, and
       
   331 \pcode{cde} in the unprocessed part. Clearly this parser cannot
       
   332 parse anything in the string \pcode{ccde}, therefore the empty
       
   333 set.
       
   334 
       
   335 A bit more interesting is the \emph{sequence parser combinator}. Given
       
   336 two parsers, say again, $p$ and $q$, we want to apply first the input
       
   337 to $p$ producing a set of pairs; then apply $q$ to all the unparsed
       
   338 parts in the pairs; and then combine the results like
       
   339 
       
   340 \begin{center}
       
   341 \begin{tabular}{lcl}
       
   342 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & 
       
   343 $(\textit{output}_1, u_1) \in p(\text{input}) 
       
   344 \;\wedge\;$\\
       
   345 && $(\textit{output}_2, u_2) \in q(u_1)\}$
       
   346 \end{tabular}
       
   347 \end{center}
       
   348 
       
   349 \noindent Notice that the $p$ wil first be run on the input, producing
       
   350 pairs of the form $\textit{output}_1$ and unprocessed part $u_1$. The
       
   351 overall result of the sequence parser combinator is pairs of the form
   363 overall result of the sequence parser combinator is pairs of the form
   352 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the
   364 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the
   353 unprocessed parts of both parsers is the unprocessed parts the second
   365 unprocessed part of both parsers is the unprocessed part the second
   354 parser $q$ produces as left-over. The processed parts of both parsers
   366 parser $q$ produces leaves as left-over. The processed parts of both
   355 is just the pair of the outputs
   367 parsers is a pair consisting of the outputs of $p$ and $q$, namely
   356 $(\textit{output}_1, \textit{output}_2)$. This behaviour can be
   368 $(\textit{output}_1, \textit{output}_2)$. This behaviour can be
   357 implemented in Scala as follows:
   369 implemented in Scala as follows:
   358 
   370 
   359 \begin{center}
   371 \begin{center}
   360 \begin{lstlisting}[language=Scala,numbers=none]
   372 \begin{lstlisting}[language=Scala,numbers=none]
   361 class SeqParser[I, T, S]
   373 class SeqParser[I, T, S]
   362        (p: => Parser[I, T], 
   374        (p: => Parser[I, T], 
   363         q: => Parser[I, S]) extends Parser[I, (T, S)] {
   375         q: => Parser[I, S]) extends Parser[I, (T, S)] {
   364   def parse(sb: I) = 
   376   def parse(in: I) = 
   365     for ((output1, u1) <- p.parse(sb); 
   377     for ((output1, u1) <- p.parse(in); 
   366          (output2, u2) <- q.parse(u1)) 
   378          (output2, u2) <- q.parse(u1)) 
   367             yield ((output1, output2), u2)
   379             yield ((output1, output2), u2)
   368 }
   380 }
   369 \end{lstlisting}
   381 \end{lstlisting}
   370 \end{center}
   382 \end{center}
   371 
   383 
   372 \noindent This parser takes again as input two parsers, \texttt{p}
   384 \noindent This parser takes again as arguments two parsers, \texttt{p}
   373 and \texttt{q}. It implements \texttt{parse} as follows: let
   385 and \texttt{q}. It implements \texttt{parse} as follows: let first run
   374 first run the parser \texttt{p} on the input producing a set
   386 the parser \texttt{p} on the input producing a set of pairs
   375 of pairs (\texttt{output1}, \texttt{u1}). The \texttt{u1}
   387 (\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
   376 stands for the unprocessed parts left over by \texttt{p}. Let
   388 unprocessed parts left over by \texttt{p}. Let then \texttt{q} run on these
   377 \texttt{q} run on these unprocessed parts producing again a
   389 unprocessed parts producing again a set of pairs. The output of the
   378 set of pairs. The output of the sequence parser combinator is
   390 sequence parser combinator is then a set containing pairs where the
   379 then a set containing pairs where the first components are
   391 first components are again pairs, namely what the first parser could
   380 again pairs, namely what the first parser could parse together
   392 parse together with what the second parser could parse; the second
   381 with what the second parser could parse; the second component
   393 component is the unprocessed part left over after running the second
   382 is the unprocessed part left over after running the second
   394 parser \texttt{q}. Therefore the input type of the sequence parser
   383 parser \texttt{q}. Therefore the input type of the sequence
   395 combinator is as usual \texttt{I}, but the output type is
   384 parser combinator is as usual \texttt{I}, but the output type
       
   385 is
       
   386 
   396 
   387 \begin{center}
   397 \begin{center}
   388 \texttt{Set[((T, S), I)]}
   398 \texttt{Set[((T, S), I)]}
   389 \end{center}
   399 \end{center}
   390 
   400 
   394 Notice that we have now two output types for the sequence parser
   404 Notice that we have now two output types for the sequence parser
   395 combinator, because in general \textit{p} and \textit{q} might produce
   405 combinator, because in general \textit{p} and \textit{q} might produce
   396 different things (for example first we recognise a number and then a
   406 different things (for example first we recognise a number and then a
   397 string corresponding to an operator).
   407 string corresponding to an operator).
   398 
   408 
   399 
   409 With the shorthand notation we shall introduce later for the sequence
   400 We have not yet looked at this in detail, but Scala allows us to
   410 parser combinator, we can write for example \pcode{"a" ~ "b"}, which
   401 provide some shorthand notation for the sequence parser combinator. We
   411 is the parser combinator that first recognises the character
   402 can write for example \pcode{"a" ~ "b"}, which is the parser
   412 \texttt{a} from a string and then \texttt{b}. Let us look again at
   403 combinator that first recognises the character \texttt{a} from a
   413 three examples of how this parser combinator processes some strings:
   404 string and then \texttt{b}. Let us look again at three examples of
       
   405 how this parser combinator processes strings:
       
   406 
   414 
   407 \begin{center}
   415 \begin{center}
   408 \begin{tabular}{rcl}
   416 \begin{tabular}{rcl}
   409 input strings & & output\medskip\\
   417 input strings & & output\medskip\\
   410 \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
   418 \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
   412 \texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
   420 \texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
   413 \end{tabular}
   421 \end{tabular}
   414 \end{center}
   422 \end{center}
   415 
   423 
   416 \noindent In the first line we have a successful parse, because the
   424 \noindent In the first line we have a successful parse, because the
   417 string started with \texttt{ab}, which is the prefix we are looking
   425 string starts with \texttt{ab}, which is the prefix we are looking
   418 for. But since the parsing combinator is constructed as sequence of
   426 for. But since the parsing combinator is constructed as sequence of
   419 the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
   427 the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
   420 result is a nested pair of the form \texttt{((a, b), cde)}. It is
   428 result is a nested pair of the form \texttt{((a, b), cde)}. It is
   421 \emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
   429 \emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
   422 expects.  The parser returns the empty set in the other examples,
   430 expect.  The parser returns the empty set in the other examples,
   423 because they do not fit with what the parser is supposed to parse.
   431 because they do not fit with what the parser is supposed to parse.
   424 
   432 
   425 
   433 
   426 A slightly more complicated parser is \pcode{("a" || "b") ~ "c"}
   434 A slightly more complicated parser is \pcode{("a" || "b") ~ "c"} which
   427 which parses as first character either an \texttt{a} or \texttt{b}
   435 parses as first character either an \texttt{a} or \texttt{b}, followed
   428 followed by a \texttt{c}. This parser produces the following outputs.
   436 by a \texttt{c}. This parser produces the following outputs.
   429 
   437 
   430 \begin{center}
   438 \begin{center}
   431 \begin{tabular}{rcl}
   439 \begin{tabular}{rcl}
   432 input strings & & output\medskip\\
   440 input strings & & output\medskip\\
   433 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
   441 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
   463 \texttt{\Grid{aaaa}} & $\rightarrow$ & 
   471 \texttt{\Grid{aaaa}} & $\rightarrow$ & 
   464 $\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
   472 $\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
   465 \end{tabular}
   473 \end{tabular}
   466 \end{center}
   474 \end{center}
   467 
   475 
   468 \noindent Notice how the results nests deeper and deeper as pairs (the
   476 \noindent Notice how again the results nest deeper and deeper as pairs (the
   469 last \pcode{a} is in the unprocessed part). To consume everything of
   477 last \pcode{a} is in the unprocessed part). To consume everything of
   470 this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~
   478 this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~
   471   "a"}. Then the output is as follows:
   479   "a"}. Then the output is as follows:
   472 
   480 
   473 \begin{center}
   481 \begin{center}
   478 \end{tabular}
   486 \end{tabular}
   479 \end{center}
   487 \end{center}
   480 
   488 
   481 \noindent This is an instance where the parser consumed
   489 \noindent This is an instance where the parser consumed
   482 completely the input, meaning the unprocessed part is just the
   490 completely the input, meaning the unprocessed part is just the
   483 empty string. So if we called \pcode{parse_all} instead of \pcode{parse}
   491 empty string. So if we called \pcode{parse_all}, instead of \pcode{parse},
   484 we would get back the result
   492 we would get back the result
   485 
   493 
   486 \[
   494 \[
   487 \left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
   495 \left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
   488 \]
   496 \]
   489 
   497 
   490 \noindent where the unprocessed (empty) parts have been stripped away
   498 \noindent where the unprocessed (empty) parts have been stripped away
   491 from the pairs; everything where the second part was not empty has
   499 from the pairs; everything where the second part was not empty has
   492 been thrown away as ultimately-unsuccessful-parses.
   500 been thrown away as well, because they represent
   493 
   501 ultimately-unsuccessful-parses.
   494 
   502 
   495 Note carefully that constructing a parser such \pcode{'a' ||
   503 
   496 ('a' ~ 'b')} will result in a typing error. The first
   504 Note carefully that constructing a parser such \pcode{"a" || ("a" ~
   497 parser has as output type a single character (recall the type
   505   "b")} will result in a typing error. The intention is that we want
   498 of \texttt{CharParser}), but the second parser produces a pair
   506 to parse an \texttt{a}, or an \texttt{a} followed by a
   499 of characters as output. The alternative parser is however
   507 \texttt{b}. However, the first parser has as output type a single
   500 required to have both component parsers to have the same type.
   508 character (recall the type of \texttt{CharParser}), but the second
   501 We will see later how we can build this parser without the
   509 parser produces a pair of characters as output. The alternative parser
   502 typing error.
   510 is required to have both component parsers to have the same type---we
   503 
   511 need to be able to build the union of two sets, which means in Scala
   504 The next parser combinator does not actually combine smaller
   512 they need to be of the same type.  We will see later how we can build
   505 parsers, but applies a function to the result of a parser.
   513 this parser without the typing error.
   506 It is implemented in Scala as follows
   514 
       
   515 The next parser combinator, called \emph{semantic action}, does not
       
   516 actually combine smaller parsers, but applies a function to the result
       
   517 of a parser.  It is implemented in Scala as follows
   507 
   518 
   508 \begin{center}
   519 \begin{center}
   509 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   520 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   510 class FunParser[I, T, S]
   521 class FunParser[I, T, S]
   511          (p: => Parser[I, T], 
   522          (p: => Parser[I, T], 
   512           f: T => S) extends Parser[I, S] {
   523           f: T => S) extends Parser[I, S] {
   513   def parse(sb: I) = 
   524   def parse(in: I) = 
   514     for ((head, tail) <- p.parse(sb)) yield (f(head), tail)
   525     for ((head, tail) <- p.parse(in)) yield (f(head), tail)
   515 }
   526 }
   516 \end{lstlisting}
   527 \end{lstlisting}
   517 \end{center}
   528 \end{center}
   518 
   529 
   519 
   530 
   520 \noindent This parser combinator takes a parser \texttt{p}
   531 \noindent This parser combinator takes a parser \texttt{p}
   521 with output type \texttt{T} as one argument as well as a
   532 with output type \texttt{T} as one argument as well as a
   522 function \texttt{f} with type \texttt{T => S}. The parser
   533 function \texttt{f} with type \texttt{T => S}. The parser
   523 \texttt{p} produces sets of type \texttt{(T, I)}. The
   534 \texttt{p} produces sets of type \texttt{(T, I)}. The
   524 \texttt{FunParser} combinator then applies the function
   535 semantic action combinastor then applies the function
   525 \texttt{f} to all the parser outputs. Since this function is of
   536 \texttt{f} to all the parser outputs. Since this function is of
   526 type \texttt{T => S}, we obtain a parser with output type
   537 type \texttt{T => S}, we obtain a parser with output type
   527 \texttt{S}. Again Scala lets us introduce some shorthand
   538 \texttt{S}. Again Scala lets us introduce some shorthand
   528 notation for this parser combinator. Therefore we will write
   539 notation for this parser combinator. Therefore we will write
   529 \texttt{p ==> f} for it.
   540 \texttt{p ==> f} for it.
       
   541 
       
   542 What are semantic actions good for? Ultimately, they allow is to
       
   543 transform the parsed input into a datastructure we can use for further
       
   544 processing. A simple example would be to transform parsed characters
       
   545 into ASCII numbers. Suppose we define a function \texttt{f} (from
       
   546 characters to ints) and a \texttt{CharParser}.
       
   547 
       
   548 \begin{center}
       
   549 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   550 val f = (c: Char) => c.toInt
       
   551 val c = new CharParser('c')
       
   552 \end{lstlisting}
       
   553 \end{center}
       
   554 
       
   555 \noindent
       
   556 Then
       
   557 
       
   558 \begin{center}
       
   559 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   560 c.parse("cbd")
       
   561 (c ==> f).parse("cbd")
       
   562 \end{lstlisting}
       
   563 \end{center}
       
   564 
       
   565 \noindent
       
   566 the first line produces \texttt{Set(('c', "bd"))}, whereas the second
       
   567 produces \texttt{Set((99, "bd"))}.
       
   568 
       
   569 \subsubsection*{Shorthand notation for parser combinators}
       
   570 
       
   571 Before we proceed, let us just explain the shorthand notation for
       
   572 parser combinators. Like for regular expressions, the shorthand notation
       
   573 will make our life much easier when writing actual parsers.
   530 
   574 
   531 \subsubsection*{How to build parsers using parser combinators?}
   575 \subsubsection*{How to build parsers using parser combinators?}
   532 
   576 
   533 \subsubsection*{Implementing an Interpreter}
   577 \subsubsection*{Implementing an Interpreter}
   534 
   578