handouts/ho06.tex
author Christian Urban <urbanc@in.tum.de>
Wed, 24 Oct 2018 20:37:37 +0100
changeset 587 5ddedcd92d84
parent 586 451a95e1bc25
child 588 a4646557016d
permissions -rw-r--r--
updated


\documentclass{article}
\usepackage{../style}
\usepackage{../langs}


\begin{document}

\section*{Handout 6 (Parser Combinators)}

This handout explains how \emph{parser combinators} work and how they
can be implemented in Scala. Their most distinguishing feature is that
they are very easy to implement (admittedly it is only easy in a
functional programming language).  Another good point of parser
combinators is that they can deal with any kind of input as long as
this input is of ``sequence-kind'', for example a string or a list of
tokens. The only two properties of the input we need is to be able to
test when it is empty and ``sequentially'' take it apart. Strings and
lists fit this bill. However, parser combinators also have their
drawbacks. For example they require that the grammar to be parsed is
\emph{not} left-recursive and they are efficient only when the grammar
is unambiguous. It is the responsibility of the grammar designer to
ensure these two properties.

The general idea behind parser combinators is to transform the input
into sets of pairs, like so

\begin{center}
$\underbrace{\text{list of tokens}}_{\text{input}}$ 
$\Rightarrow$
$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
\end{center} 

\noindent
Given the extended effort we have spent in order to implement a lexer
in order to generate list of tokens, it might be surprising that in
what follows we shall often use strings as input. This is for making
the explanation more lucid. It does not make our previous work on
lexers obsolete (remember they transform a string into a list of
tokens). Lexers will still be needed for building a somewhat realistic
compiler.

But as said, parser combinators are relatively agnostic about what
kind of input they process. In my Scala code I use the following
polymorphic types for parser combinators:

\begin{center}
input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
\end{center}

\noindent That is they take as input something of type \texttt{I} and
return a set of pairs of type \texttt{Set[(T, I)]}. Since the input needs
to be of ``sequence-kind'', I actually have to often write \texttt{I
  <\% Seq[\_]} for the input type. This ensures the input is a
subtype of Scala sequences. The first component of the generated pairs
corresponds to what the parser combinator was able to process from the
input and the second is the unprocessed part of the input (therefore
the type of this unprocessed part is the same as the input). As we
shall see shortly, a parser combinator might return more than one such
pair; the idea is that there are potentially several ways of how to
parse the input.  As a concrete example, consider the string

\begin{center}
\tt\Grid{iffoo\VS testbar}
\end{center}

\noindent We might have a parser combinator which tries to
interpret this string as a keyword (\texttt{if}) or as an
identifier (\texttt{iffoo}). Then the output will be the set

\begin{center}
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
           \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
\end{center}

\noindent where the first pair means the parser could recognise
\texttt{if} from the input and leaves the rest as `unprocessed' as the
second component of the pair; in the other case it could recognise
\texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
parser cannot recognise anything from the input at all, then parser
combinators just return the empty set $\{\}$. This will indicate
something ``went wrong''\ldots or more precisely, nothing could be
parsed.

Also important to note is that the type \texttt{T} for the processed
part is different from the input type. In the example above is just
happens to be the same. The reason for the difference is that in
general we are interested in transforming our input into something
``different''\ldots for example into a tree; or if we implement the
grammar for arithmetic expressions, we might be interested in the
actual integer number the arithmetic expression, say \texttt{1 + 2 *
  3}, stands for. In this way we can use parser combinators to
implement relatively easily a calculator, for instance.

The main idea of parser combinators is that we can easily build parser
combinators out of smaller components following very closely the
structure of a grammar. In order to implement this in an
object-oriented programming language, like Scala, we need to specify
an abstract class for parser combinators. This abstract class states
that the function \texttt{parse} takes an argument of type \texttt{I}
and returns a set of type \mbox{\texttt{Set[(T, I)]}}.

\begin{center}
\begin{lstlisting}[language=Scala]
abstract class Parser[I, T] {
  def parse(ts: I): Set[(T, I)]

  def parse_all(ts: I): Set[T] =
    for ((head, tail) <- parse(ts); if (tail.isEmpty)) 
      yield head
}
\end{lstlisting}
\end{center}

\noindent It is the obligation in each instance (parser combinator) to
supply an implementation for \texttt{parse}.  From this function we
can then ``centrally'' derive the function \texttt{parse\_all}, which
just filters out all pairs whose second component is not empty (that
is has still some unprocessed part). The reason is that at the end of
the parsing we are only interested in the results where all the input
has been consumed and no unprocessed part is left over.

One of the simplest parser combinators recognises just a
single character, say $c$, from the beginning of strings. Its
behaviour can be described as follows:

\begin{itemize}
\item If the head of the input string starts with a $c$, then return 
  the set
  \[\{(c, \textit{tail of}\; s)\}\]
  where \textit{tail of} 
  $s$ is the unprocessed part of the input string.
\item Otherwise return the empty set $\{\}$.	
\end{itemize}

\noindent
The input type of this simple parser combinator for characters is
\texttt{String} and the output type is \texttt{Char}. This means
\texttt{parse} returns  \mbox{\texttt{Set[(Char, String)]}}. 
The code in Scala is as follows:

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
case class CharParser(c: Char) extends Parser[String, Char] {
  def parse(in: String) = 
    if (in.head == c) Set((c, in.tail)) else Set()
}
\end{lstlisting}
\end{center}

\noindent You can see the \texttt{parse} tests whether the
first character of the input string \texttt{in} is equal to
\texttt{c}. If yes, then it splits the string into the recognised part
\texttt{c} and the unprocessed part \texttt{in.tail}. In case
\texttt{in} does not start with \texttt{c} then the parser returns the
empty set (in Scala \texttt{Set()}). Since this parser recognises
characters and just returns characters as the processed part, the
output type of the parser is \texttt{Char}.

If we want to parse a list of tokens and interested in recognising a
number token, we could write something like this

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
case object NumParser extends Parser[List[Token], Int] {
  def parse(ts: List[Token]) = ts match {
    case Num_token(s)::ts => Set((s.toInt, ts)) 
    case _ => Set ()
  }
}
\end{lstlisting}
\end{center}

\noindent
In this parser the input is of type \texttt{List[Token]}. The function
parse looks at the input \texttt{ts} and checks whether the first
token is a \texttt{Num\_token}. Let us assume our lexer generated
these tokens for numbers. But this parser does not just return this
token (and the rest of the list), like the \texttt{CharParser} above,
rather extracts the string \texttt{s} from the token and converts it
into an integer. The hope is that the lexer did its work well and this
conversion always succeeds. The consequence of this is that the output
type for this parser is \texttt{Int}, not \texttt{Token}. Such a
conversion would be needed if we want to implement a simple calculator
program.


These simple parsers that just look at the input and do a simple
transformation are often called \emph{atomic} parser combinators.
More interesting are the parser combinators that build larger parsers
out of smaller component parsers. There are three such parser
combinators that can be implemented generically. The \emph{alternative
  parser combinator} is as follows: given two parsers, say, $p$ and
$q$, we apply both parsers to the input (remember parsers are
functions) and combine the output (remember they are sets of pairs):
 
\begin{center}
$p(\text{input}) \cup q(\text{input})$
\end{center}

\noindent In Scala we can implement alternative parser
combinator as follows

\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
class AltParser[I, T]
       (p: => Parser[I, T], 
        q: => Parser[I, T]) extends Parser[I, T] {
  def parse(in: I) = p.parse(in) ++ q.parse(in)
}
\end{lstlisting}
\end{center}

\noindent The types of this parser combinator are again generic (we
have \texttt{I} for the input type, and \texttt{T} for the output
type). The alternative parser builds a new parser out of two existing
parsers \texttt{p} and \texttt{q} given as arguments.  Both parsers
need to be able to process input of type \texttt{I} and return in
\texttt{parse} the same output type \texttt{Set[(T,
  I)]}.\footnote{There is an interesting detail of Scala, namely the
  \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They
  will prevent the evaluation of the arguments before they are
  used. This is often called \emph{lazy evaluation} of the
  arguments. We will explain this later.}  The alternative parser
should run the input with the first parser \texttt{p} (producing a set
of pairs) and then run the same input with \texttt{q} (producing
another set of pairs).  The result should be then just the union of
both sets, which is the operation \texttt{++} in Scala.

The alternative parser combinator allows us to construct a parser that
parses either a character \texttt{a} or \texttt{b} using the
\texttt{CharParser} shown above. For this we can write

\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
new AltParser(CharParser('a'), CharParser('b'))
\end{lstlisting}
\end{center}

\noindent Later on we will use Scala mechanism for introducing some
more readable shorthand notation for this, like \texttt{"a" ||
  "b"}. Let us look in detail at what this parser combinator produces
with some sample strings

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}

\noindent We receive in the first two cases a successful
output (that is a non-empty set). In each case, either
\pcode{a} or \pcode{b} is in the processed part, and
\pcode{cde} in the unprocessed part. Clearly this parser cannot
parse anything with \pcode{ccde}, therefore the empty
set is returned.

A bit more interesting is the \emph{sequence parser combinator}. Given
two parsers, say again, $p$ and $q$, we want to apply first the input
to $p$ producing a set of pairs; then apply $q$ to all the unparsed
parts in the pairs; and then combine the results. Mathematically we would
write something like this for the expected set of pairs:

\begin{center}
\begin{tabular}{lcl}
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & 
$(\textit{output}_1, u_1) \in p(\text{input}) 
\;\wedge\;$\\
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
\end{tabular}
\end{center}

\noindent Notice that the $p$ wil first be run on the input, producing
pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands
for the unprocessed, or left-over, parts. We want that $q$ runs on all
these unprocessed parts $u_1$. This again will produce some
processed part , $p$ and
$q$, we apply both parsers to the input (remember parsers are
functions) and combine the output (remember they are sets of pairs)
 
\begin{center}
$p(\text{input}) \cup q(\text{input})$
\end{center}

\noindent In Scala we would implement alternative parser
combinator as follows

\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
class AltParser[I, T]
       (p: => Parser[I, T], 
        q: => Parser[I, T]) extends Parser[I, T] {
  def parse(in: I) = p.parse(in) ++ q.parse(in)
}
\end{lstlisting}
\end{center}

\noindent The types of this parser combinator are again generic (we
just have \texttt{I} for the input type, and \texttt{T} for the output
type). The alternative parser builds a new parser out of two existing
parsers \texttt{p} and \texttt{q}.  Both need to be able to process
input of type \texttt{I} and return the same output type
\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of
  Scala, namely the \texttt{=>} in front of the types of \texttt{p}
  and \texttt{q}. They will prevent the evaluation of the arguments
  before they are used. This is often called \emph{lazy evaluation} of
  the arguments. We will explain this later.}  Therefore the output
type of this parser is \texttt{T}. The alternative parser should run
the input with the first parser \texttt{p} (producing a set of pairs)
and then run the same input with \texttt{q} (producing another set of
pairs).  The result should be then just the union of both sets, which
is the operation \texttt{++} in Scala.

The alternative parser combinator already allows us to
construct a parser that parses either a character \texttt{a}
or \texttt{b}, as

\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
new AltParser(CharParser('a'), CharParser('b'))
\end{lstlisting}
\end{center}

\noindent Later on we will use again Scala mechanism for introducing some
more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some sample strings 

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}

\noindent We receive in the first two cases a successful
output (that is a non-empty set). In each case, either
\pcode{a} or \pcode{b} is in the processed part, and
\pcode{cde} in the unprocessed part. Clearly this parser cannot
parse anything in the string \pcode{ccde}, therefore the empty
set.

A bit more interesting is the \emph{sequence parser combinator}. Given
two parsers, say again, $p$ and $q$, we want to apply first the input
to $p$ producing a set of pairs; then apply $q$ to all the unparsed
parts in the pairs; and then combine the results like

\begin{center}
\begin{tabular}{lcl}
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & 
$(\textit{output}_1, u_1) \in p(\text{input}) 
\;\wedge\;$\\
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
\end{tabular}
\end{center}

\noindent Notice that the $p$ will first be run on the input,
producing pairs of the form $\textit{output}_1$ and unprocessed part
$u_1$. This unprocessed part is fed into the second parser $q$. The
overall result of the sequence parser combinator is pairs of the form
$((\textit{output}_1, \textit{output}_2), u_2)$. This means the
unprocessed part of both parsers is the unprocessed part the second
parser $q$ produces leaves as left-over. The processed parts of both
parsers is a pair consisting of the outputs of $p$ and $q$, namely
$(\textit{output}_1, \textit{output}_2)$. This behaviour can be
implemented in Scala as follows:

\begin{center}
\begin{lstlisting}[language=Scala,numbers=none]
class SeqParser[I, T, S]
       (p: => Parser[I, T], 
        q: => Parser[I, S]) extends Parser[I, (T, S)] {
  def parse(in: I) = 
    for ((output1, u1) <- p.parse(in); 
         (output2, u2) <- q.parse(u1)) 
            yield ((output1, output2), u2)
}
\end{lstlisting}
\end{center}

\noindent This parser takes again as arguments two parsers, \texttt{p}
and \texttt{q}. It implements \texttt{parse} as follows: let first run
the parser \texttt{p} on the input producing a set of pairs
(\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
unprocessed parts left over by \texttt{p}. Let then \texttt{q} run on these
unprocessed parts producing again a set of pairs. The output of the
sequence parser combinator is then a set containing pairs where the
first components are again pairs, namely what the first parser could
parse together with what the second parser could parse; the second
component is the unprocessed part left over after running the second
parser \texttt{q}. Therefore the input type of the sequence parser
combinator is as usual \texttt{I}, but the output type is

\begin{center}
\texttt{Set[((T, S), I)]}
\end{center}

\noindent
If any of the runs of \textit{p} and \textit{q} fail, that is produce
the empty set, then \texttt{parse} will also produce the empty set.
Notice that we have now two output types for the sequence parser
combinator, because in general \textit{p} and \textit{q} might produce
different things (for example first we recognise a number and then a
string corresponding to an operator).

With the shorthand notation we shall introduce later for the sequence
parser combinator, we can write for example \pcode{"a" ~ "b"}, which
is the parser combinator that first recognises the character
\texttt{a} from a string and then \texttt{b}. Let us look again at
three examples of how this parser combinator processes some strings:

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}

\noindent In the first line we have a successful parse, because the
string starts with \texttt{ab}, which is the prefix we are looking
for. But since the parsing combinator is constructed as sequence of
the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
result is a nested pair of the form \texttt{((a, b), cde)}. It is
\emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
expect.  The parser returns the empty set in the other examples,
because they do not fit with what the parser is supposed to parse.


A slightly more complicated parser is \pcode{("a" || "b") ~ "c"} which
parses as first character either an \texttt{a} or \texttt{b}, followed
by a \texttt{c}. This parser produces the following outputs.

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}

\noindent
Now consider the parser \pcode{("a" ~ "b") ~ "c"} which parses
\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces
the following outputs.

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}


\noindent The second and third example fail, because something is
``missing'' in the sequence we are looking for. Also notice how the
results nest with sequences: the parsed part is a nested pair of the
form \pcode{((a, b), c)}. Two more examples: first consider the parser
\pcode{("a" ~ "a") ~ "a"} and the input \pcode{aaaa}:

\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
\end{tabular}
\end{center}

\noindent Notice how again the results nest deeper and deeper as pairs (the
last \pcode{a} is in the unprocessed part). To consume everything of
this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~
  "a"}. Then the output is as follows:

\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
\end{tabular}
\end{center}

\noindent This is an instance where the parser consumed
completely the input, meaning the unprocessed part is just the
empty string. So if we called \pcode{parse_all}, instead of \pcode{parse},
we would get back the result

\[
\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
\]

\noindent where the unprocessed (empty) parts have been stripped away
from the pairs; everything where the second part was not empty has
been thrown away as well, because they represent
ultimately-unsuccessful-parses.


Note carefully that constructing a parser such \pcode{"a" || ("a" ~
  "b")} will result in a typing error. The intention is that we want
to parse an \texttt{a}, or an \texttt{a} followed by a
\texttt{b}. However, the first parser has as output type a single
character (recall the type of \texttt{CharParser}), but the second
parser produces a pair of characters as output. The alternative parser
is required to have both component parsers to have the same type---we
need to be able to build the union of two sets, which means in Scala
they need to be of the same type.  We will see later how we can build
this parser without the typing error.

The next parser combinator, called \emph{semantic action}, does not
actually combine smaller parsers, but applies a function to the result
of a parser.  It is implemented in Scala as follows

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
class FunParser[I, T, S]
         (p: => Parser[I, T], 
          f: T => S) extends Parser[I, S] {
  def parse(in: I) = 
    for ((head, tail) <- p.parse(in)) yield (f(head), tail)
}
\end{lstlisting}
\end{center}


\noindent This parser combinator takes a parser \texttt{p}
with output type \texttt{T} as one argument as well as a
function \texttt{f} with type \texttt{T => S}. The parser
\texttt{p} produces sets of type \texttt{(T, I)}. The
semantic action combinastor then applies the function
\texttt{f} to all the parser outputs. Since this function is of
type \texttt{T => S}, we obtain a parser with output type
\texttt{S}. Again Scala lets us introduce some shorthand
notation for this parser combinator. Therefore we will write
\texttt{p ==> f} for it.

What are semantic actions good for? Ultimately, they allow is to
transform the parsed input into a datastructure we can use for further
processing. A simple example would be to transform parsed characters
into ASCII numbers. Suppose we define a function \texttt{f} (from
characters to ints) and a \texttt{CharParser}.

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
val f = (c: Char) => c.toInt
val c = new CharParser('c')
\end{lstlisting}
\end{center}

\noindent
Then

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
c.parse("cbd")
(c ==> f).parse("cbd")
\end{lstlisting}
\end{center}

\noindent
the first line produces \texttt{Set(('c', "bd"))}, whereas the second
produces \texttt{Set((99, "bd"))}.

\subsubsection*{Shorthand notation for parser combinators}

Before we proceed, let us just explain the shorthand notation for
parser combinators. Like for regular expressions, the shorthand notation
will make our life much easier when writing actual parsers.

\subsubsection*{How to build parsers using parser combinators?}

\subsubsection*{Implementing an Interpreter}

%\bigskip
%takes advantage of the full generality---have a look
%what it produces if we call it with the string \texttt{abc}
%
%\begin{center}
%\begin{tabular}{rcl}
%input string & & output\medskip\\
%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
%\end{tabular}
%\end{center}



\end{document}

%%% Local Variables: 
%%% mode: latex  
%%% TeX-master: t
%%% End: