\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\begin{document}
\section*{Handout 6 (Parser Combinators)}
This handout explains how \emph{parser combinators} work and how they
can be implemented in Scala. Their distinguishing feature is that they
are very easy to implement (admittedly it is only easy in a functional
programming language). However, parser combinators require that the
grammar to be parsed is \emph{not} left-recursive and they are
efficient only when the grammar is unambiguous. It is the
responsibility of the grammar designer to ensure these two properties.
Another good point of parser combinators is that they can deal with any kind
of input as long as this input of ``sequence-kind'', for example a
string or a list of tokens. The general idea behind parser combinators
is to transform the input into sets of pairs, like so
\begin{center}
$\underbrace{\text{list of tokens}}_{\text{input}}$
$\Rightarrow$
$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
\end{center}
\noindent As said, the input can be anything as long as it is a
``sequence''. The only property of the input we need is to be able to
test when it is empty. Obviously we can do this for strings and lists.
For more lucidity we shall below often use strings as input in order
to illustrate matters. However, this does not make our previous work
on lexers obsolete (remember they transform a string into a list of
tokens). Lexers will still be needed to build a somewhat realistic
compiler.
In my Scala code I use the following polymorphic types for parser combinators:
\begin{center}
input:\;\; \texttt{I} \qquad output:\;\; \texttt{T}
\end{center}
\noindent that is they take as input something of type \texttt{I} and
return a set of pairs of \texttt{Set[(T, I)]}. Since the input needs
to be of ``sequence-kind'' I actually have to often write \texttt{I
<\% Seq[\_]} for the input type in order to indicate it is a
subtype of Scala sequences. The first component of the generated pairs
corresponds to what the parser combinator was able to process from the
input and the second is the unprocessed part of the input (therefore
the type of this unprocessed part is the same as the input). As we
shall see shortly, a parser combinator might return more than one such
pair; the idea being that there are potentially several ways of how to
parse the input. As a concrete example, consider the string
\begin{center}
\tt\Grid{iffoo\VS testbar}
\end{center}
\noindent We might have a parser combinator which tries to
interpret this string as a keyword (\texttt{if}) or as an
identifier (\texttt{iffoo}). Then the output will be the set
\begin{center}
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right),
\left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
\end{center}
\noindent where the first pair means the parser could
recognise \texttt{if} from the input and leaves the rest as
`unprocessed' as the second component of the pair; in the
other case it could recognise \texttt{iffoo} and leaves
\texttt{\VS testbar} as unprocessed. If the parser cannot
recognise anything from the input, then parser combinators just
return the empty set $\{\}$. This will indicate something
``went wrong''\ldots or more precisely, nothing could be
parsed.
Also important to note is that the type \texttt{T} for the processed
part is different from the input type. The reason is that in general
we are interested in transform our input into something
``different''\ldots for example into a tree, or if we implement the
grammar for arithmetic expressions we might be interested in the
actual integer number the arithmetic expression, say \texttt{1 + 2 *
3}, stands for. In this way we can use parser combinators to
implement relatively easily a calculator.
The main idea of parser combinators is that we can easily build parser
combinators out of smaller components following very closely the
structure of a grammar. In order to implement this in an
object-oriented programming language, like Scala, we need to specify
an abstract class for parser combinators. This abstract class states
that the function \texttt{parse} takes an argument of type \texttt{I}
and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
\begin{center}
\begin{lstlisting}[language=Scala]
abstract class Parser[I, T] {
def parse(ts: I): Set[(T, I)]
def parse_all(ts: I): Set[T] =
for ((head, tail) <- parse(ts); if (tail.isEmpty))
yield head
}
\end{lstlisting}
\end{center}
\noindent It is the obligation in each instance (parser combinator) to
supply an implementation for \texttt{parse}. From this function we
can then ``centrally'' derive the function \texttt{parse\_all}, which
just filters out all pairs whose second component is not empty (that
is has still some unprocessed part). The reason is that at the end of
the parsing we are only interested in the results where all the input
has been consumed and no unprocessed part is left over.
One of the simplest parser combinators recognises just a
single character, say $c$, from the beginning of strings. Its
behaviour can be described as follows:
\begin{itemize}
\item If the head of the input string starts with a $c$, then return
the set
\[\{(c, \textit{tail of}\; s)\}\]
where \textit{tail of}
$s$ is the unprocessed part of the input string.
\item Otherwise return the empty set $\{\}$.
\end{itemize}
\noindent
The input type of this simple parser combinator for characters is
\texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}.
The code in Scala is as follows:
\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
case class CharParser(c: Char) extends Parser[String, Char] {
def parse(sb: String) =
if (sb.head == c) Set((c, sb.tail)) else Set()
}
\end{lstlisting}
\end{center}
\noindent You can see the \texttt{parse} function tests whether the
first character of the input string \texttt{sb} is equal to
\texttt{c}. If yes, then it splits the string into the recognised part
\texttt{c} and the unprocessed part \texttt{sb.tail}. In case
\texttt{sb} does not start with \texttt{c} then the parser returns the
empty set (in Scala \texttt{Set()}). Since this parser recognises
characters and just returns characters as the processed part, the
output type of the parser is \texttt{Char}.
If we want to parse a list of tokens and interested in recognising a
number token, we could write something like this
\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
case object NumParser extends Parser[List[Token], Int] {
def parse(ts: List[Token]) = ts match {
case Num_token(s)::ts => Set((s.toInt, ts))
case _ => Set ()
}
}
\end{lstlisting}
\end{center}
\noindent
In this parser the input is of type \texttt{List[Token]}. The function
parse looks at the input \texttt{ts} and checks whether the first
token is a \texttt{Num\_token}. Let us assume our lexer generated
these tokens for numbers. But this parser does not just return this
token (and the rest of the list), like the \texttt{CharParser} above,
rather extracts the string \texttt{s} from the token and converts it
into an integer. The hope is that the lexer did its work well and this
conversion always succeeds. The consequence of this is that the output
type for this parser is \texttt{Int}. Such a conversion would be
needed if we want to implement a simple calculator program.
These simple parsers that just look at the input and do a simple
transformation are often called \emph{atomic} parser combinators.
More interesting are the parser combinators that build larger parsers
out of smaller component parsers. For example the \emph{alternative
parser combinator} is as follows: given two parsers, say, $p$ and
$q$, we apply both parsers to the input (remember parsers are
functions) and combine the output (remember they are sets of pairs)
\begin{center}
$p(\text{input}) \cup q(\text{input})$
\end{center}
\noindent In Scala we would implement alternative parser
combinator as follows
\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
class AltParser[I, T]
(p: => Parser[I, T],
q: => Parser[I, T]) extends Parser[I, T] {
def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
}
\end{lstlisting}
\end{center}
\noindent The types of this parser combinator are again generic (we
just have \texttt{I} for the input type, and \texttt{T} for the output
type). The alternative parser builds a new parser out of two existing
parsers \texttt{p} and \texttt{q}. Both need to be able to process
input of type \texttt{I} and return the same output type
\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of
Scala, namely the \texttt{=>} in front of the types of \texttt{p}
and \texttt{q}. They will prevent the evaluation of the arguments
before they are used. This is often called \emph{lazy evaluation} of
the arguments. We will explain this later.} Therefore the output
type of this parser is \texttt{T}. The alternative parser should run
the input with the first parser \texttt{p} (producing a set of pairs)
and then run the same input with \texttt{q} (producing another set of
pairs). The result should be then just the union of both sets, which
is the operation \texttt{++} in Scala.
The alternative parser combinator already allows us to
construct a parser that parses either a character \texttt{a}
or \texttt{b}, as
\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
new AltParser(CharParser('a'), CharParser('b'))
\end{lstlisting}
\end{center}
\noindent Later on we will use again Scala mechanism for introducing some
more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some sample strings
\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}
\noindent We receive in the first two cases a successful
output (that is a non-empty set). In each case, either
\pcode{a} or \pcode{b} is in the processed part, and
\pcode{cde} in the unprocessed part. Clearly this parser cannot
parse anything in the string \pcode{ccde}, therefore the empty
set.
A bit more interesting is the \emph{sequence parser combinator}. Given
two parsers, say again, $p$ and $q$, we want to apply first the input
to $p$ producing a set of pairs; then apply $q$ to all the unparsed
parts in the pairs; and then combine the results like
\begin{center}
\begin{tabular}{lcl}
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ &
$(\textit{output}_1, u_1) \in p(\text{input})
\;\wedge\;$\\
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
\end{tabular}
\end{center}
\noindent Notice that the $p$ wil first be run on the input, producing
pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands
for the unprocessed, or left-over, parts. We want that $q$ runs on all
these unprocessed parts $u_1$. This again will produce some
processed part , $p$ and
$q$, we apply both parsers to the input (remember parsers are
functions) and combine the output (remember they are sets of pairs)
\begin{center}
$p(\text{input}) \cup q(\text{input})$
\end{center}
\noindent In Scala we would implement alternative parser
combinator as follows
\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
class AltParser[I, T]
(p: => Parser[I, T],
q: => Parser[I, T]) extends Parser[I, T] {
def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
}
\end{lstlisting}
\end{center}
\noindent The types of this parser combinator are again generic (we
just have \texttt{I} for the input type, and \texttt{T} for the output
type). The alternative parser builds a new parser out of two existing
parsers \texttt{p} and \texttt{q}. Both need to be able to process
input of type \texttt{I} and return the same output type
\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of
Scala, namely the \texttt{=>} in front of the types of \texttt{p}
and \texttt{q}. They will prevent the evaluation of the arguments
before they are used. This is often called \emph{lazy evaluation} of
the arguments. We will explain this later.} Therefore the output
type of this parser is \texttt{T}. The alternative parser should run
the input with the first parser \texttt{p} (producing a set of pairs)
and then run the same input with \texttt{q} (producing another set of
pairs). The result should be then just the union of both sets, which
is the operation \texttt{++} in Scala.
The alternative parser combinator already allows us to
construct a parser that parses either a character \texttt{a}
or \texttt{b}, as
\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
new AltParser(CharParser('a'), CharParser('b'))
\end{lstlisting}
\end{center}
\noindent Later on we will use again Scala mechanism for introducing some
more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some sample strings
\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}
\noindent We receive in the first two cases a successful
output (that is a non-empty set). In each case, either
\pcode{a} or \pcode{b} is in the processed part, and
\pcode{cde} in the unprocessed part. Clearly this parser cannot
parse anything in the string \pcode{ccde}, therefore the empty
set.
A bit more interesting is the \emph{sequence parser combinator}. Given
two parsers, say again, $p$ and $q$, we want to apply first the input
to $p$ producing a set of pairs; then apply $q$ to all the unparsed
parts in the pairs; and then combine the results like
\begin{center}
\begin{tabular}{lcl}
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ &
$(\textit{output}_1, u_1) \in p(\text{input})
\;\wedge\;$\\
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
\end{tabular}
\end{center}
\noindent Notice that the $p$ wil first be run on the input, producing
pairs of the form $\textit{output}_1$ and unprocessed part $u_1$. The
overall result of the sequence parser combinator is pairs of the form
$((\textit{output}_1, \textit{output}_2), u_2)$. This means the
unprocessed parts of both parsers is the unprocessed parts the second
parser $q$ produces as left-over. The processed parts of both parsers
is just the pair of the outputs
$(\textit{output}_1, \textit{output}_2)$. This behaviour can be
implemented in Scala as follows:
\begin{center}
\begin{lstlisting}[language=Scala,numbers=none]
class SeqParser[I, T, S]
(p: => Parser[I, T],
q: => Parser[I, S]) extends Parser[I, (T, S)] {
def parse(sb: I) =
for ((output1, u1) <- p.parse(sb);
(output2, u2) <- q.parse(u1))
yield ((output1, output2), u2)
}
\end{lstlisting}
\end{center}
\noindent This parser takes again as input two parsers, \texttt{p}
and \texttt{q}. It implements \texttt{parse} as follows: let
first run the parser \texttt{p} on the input producing a set
of pairs (\texttt{output1}, \texttt{u1}). The \texttt{u1}
stands for the unprocessed parts left over by \texttt{p}. Let
\texttt{q} run on these unprocessed parts producing again a
set of pairs. The output of the sequence parser combinator is
then a set containing pairs where the first components are
again pairs, namely what the first parser could parse together
with what the second parser could parse; the second component
is the unprocessed part left over after running the second
parser \texttt{q}. Therefore the input type of the sequence
parser combinator is as usual \texttt{I}, but the output type
is
\begin{center}
\texttt{Set[((T, S), I)]}
\end{center}
\noindent
If any of the runs of \textit{p} and \textit{q} fail, that is produce
the empty set, then \texttt{parse} will also produce the empty set.
Notice that we have now two output types for the sequence parser
combinator, because in general \textit{p} and \textit{q} might produce
different things (for example first we recognise a number and then a
string corresponding to an operator).
We have not yet looked at this in detail, but Scala allows us to
provide some shorthand notation for the sequence parser combinator. We
can write for example \pcode{"a" ~ "b"}, which is the parser
combinator that first recognises the character \texttt{a} from a
string and then \texttt{b}. Let us look again at three examples of
how this parser combinator processes strings:
\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}
\noindent In the first line we have a successful parse, because the
string started with \texttt{ab}, which is the prefix we are looking
for. But since the parsing combinator is constructed as sequence of
the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
result is a nested pair of the form \texttt{((a, b), cde)}. It is
\emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
expects. The parser returns the empty set in the other examples,
because they do not fit with what the parser is supposed to parse.
A slightly more complicated parser is \pcode{("a" || "b") ~ "c"}
which parses as first character either an \texttt{a} or \texttt{b}
followed by a \texttt{c}. This parser produces the following outputs.
\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}
\noindent
Now consider the parser \pcode{("a" ~ "b") ~ "c"} which parses
\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces
the following outputs.
\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}
\noindent The second and third example fail, because something is
``missing'' in the sequence we are looking for. Also notice how the
results nest with sequences: the parsed part is a nested pair of the
form \pcode{((a, b), c)}. Two more examples: first consider the parser
\pcode{("a" ~ "a") ~ "a"} and the input \pcode{aaaa}:
\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{aaaa}} & $\rightarrow$ &
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
\end{tabular}
\end{center}
\noindent Notice how the results nests deeper and deeper as pairs (the
last \pcode{a} is in the unprocessed part). To consume everything of
this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~
"a"}. Then the output is as follows:
\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{aaaa}} & $\rightarrow$ &
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
\end{tabular}
\end{center}
\noindent This is an instance where the parser consumed
completely the input, meaning the unprocessed part is just the
empty string. So if we called \pcode{parse_all} instead of \pcode{parse}
we would get back the result
\[
\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
\]
\noindent where the unprocessed (empty) parts have been stripped away
from the pairs; everything where the second part was not empty has
been thrown away as ultimately-unsuccessful-parses.
Note carefully that constructing a parser such \pcode{'a' ||
('a' ~ 'b')} will result in a typing error. The first
parser has as output type a single character (recall the type
of \texttt{CharParser}), but the second parser produces a pair
of characters as output. The alternative parser is however
required to have both component parsers to have the same type.
We will see later how we can build this parser without the
typing error.
The next parser combinator does not actually combine smaller
parsers, but applies a function to the result of a parser.
It is implemented in Scala as follows
\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
class FunParser[I, T, S]
(p: => Parser[I, T],
f: T => S) extends Parser[I, S] {
def parse(sb: I) =
for ((head, tail) <- p.parse(sb)) yield (f(head), tail)
}
\end{lstlisting}
\end{center}
\noindent This parser combinator takes a parser \texttt{p}
with output type \texttt{T} as one argument as well as a
function \texttt{f} with type \texttt{T => S}. The parser
\texttt{p} produces sets of type \texttt{(T, I)}. The
\texttt{FunParser} combinator then applies the function
\texttt{f} to all the parser outputs. Since this function is of
type \texttt{T => S}, we obtain a parser with output type
\texttt{S}. Again Scala lets us introduce some shorthand
notation for this parser combinator. Therefore we will write
\texttt{p ==> f} for it.
\subsubsection*{How to build parsers using parser combinators?}
\subsubsection*{Implementing an Interpreter}
%\bigskip
%takes advantage of the full generality---have a look
%what it produces if we call it with the string \texttt{abc}
%
%\begin{center}
%\begin{tabular}{rcl}
%input string & & output\medskip\\
%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
%\end{tabular}
%\end{center}
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: