+ −
% !TEX program = xelatex+ −
\documentclass{article}+ −
\usepackage{../style}+ −
\usepackage{../langs}+ −
\usepackage{../grammar}+ −
\usepackage{../graphics}+ −
+ −
\begin{document}+ −
+ −
\section*{Handout 6 (Parser Combinators)}+ −
+ −
This handout explains how \emph{parser combinators} work and how they+ −
can be implemented in Scala. Their most distinguishing feature is that+ −
they are very easy to implement (admittedly it is only easy in a+ −
functional programming language). Another good point of parser+ −
combinators is that they can deal with any kind of input as long as+ −
this input is of ``sequence-kind'', for example a string or a list of+ −
tokens. The only two properties of the input we need is to be able to+ −
test when it is empty and ``sequentially'' take it apart. Strings and+ −
lists fit this bill. However, parser combinators also have their+ −
drawbacks. For example they require that the grammar to be parsed is+ −
\emph{not} left-recursive and they are efficient only when the grammar+ −
is unambiguous. It is the responsibility of the grammar designer to+ −
ensure these two properties hold.+ −
+ −
The general idea behind parser combinators is to transform the input+ −
into sets of pairs, like so+ −
+ −
\begin{center}+ −
$\underbrace{\text{list of tokens}}_{\text{input}}$ + −
$\quad\Rightarrow\quad$+ −
$\underbrace{\text{set of (parsed part, unprocessed part)}}_{\text{output}}$+ −
\end{center} + −
+ −
\noindent+ −
Given the extended effort we have spent implementing a lexer in order+ −
to generate lists of tokens, it might be surprising that in what+ −
follows we shall often use strings as input, rather than lists of+ −
tokens. This is for making the explanation more lucid and ensure the+ −
examples are simple. It does not make our previous work on lexers obsolete+ −
(remember they transform a string into a list of tokens). Lexers will+ −
still be needed for building a somewhat realistic compiler. See also+ −
a question in the homework regarding this issue.+ −
+ −
As mentioned above, parser combinators are relatively agnostic about what+ −
kind of input they process. In my Scala code I use the following+ −
polymorphic types for parser combinators:+ −
+ −
\begin{center}+ −
input:\;\; \texttt{I} \qquad output:\;\; \texttt{T}+ −
\end{center}+ −
+ −
\noindent That is they take as input something of type \texttt{I} and+ −
return a set of pairs of type \texttt{Set[(T, I)]}. Since the input+ −
needs to be of ``sequence-kind'', I actually have to often write+ −
\code{(using is: I => Seq[_])} for the input type. This ensures the+ −
input is a subtype of Scala sequences.\footnote{This is a new feature+ −
in Scala 3 and is about type-classes, meaning if you use Scala 2 you will have difficulties+ −
with running my code.} The first component of the generated pairs+ −
corresponds to what the parser combinator was able to parse from the+ −
input and the second is the unprocessed, or leftover, part of the+ −
input (therefore the type of this unprocessed part is the same as the+ −
input). A parser combinator might return more than one such pair; the+ −
idea is that there are potentially several ways of how to parse the+ −
input. As a concrete example, consider the string+ −
+ −
\begin{center}+ −
\tt\Grid{iffoo\VS testbar}+ −
\end{center}+ −
+ −
\noindent We might have a parser combinator which tries to+ −
interpret this string as a keyword (\texttt{if}) or as an+ −
identifier (\texttt{iffoo}). Then the output will be the set+ −
+ −
\begin{center}+ −
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), + −
\left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$+ −
\end{center}+ −
+ −
\noindent where the first pair means the parser could recognise+ −
\texttt{if} from the input and leaves the \texttt{foo\VS testbar} as+ −
unprocessed part; in the other case it could recognise+ −
\texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the+ −
parser cannot recognise anything from the input at all, then parser+ −
combinators just return the empty set $\{\}$. This will indicate+ −
something ``went wrong''\ldots or more precisely, nothing could be+ −
parsed.+ −
+ −
Also important to note is that the output type \texttt{T} for the+ −
processed part can potentially be different from the input type+ −
\texttt{I} in the parser. In the example above is just happens to be+ −
the same. The reason for the difference is that in general we are+ −
interested in transforming our input into something+ −
``different''\ldots for example into a tree; or if we implement the+ −
grammar for arithmetic expressions, we might be interested in the+ −
actual integer number the arithmetic expression, say \texttt{1 + 2 *+ −
3}, stands for. In this way we can use parser combinators to+ −
implement relatively easily a calculator, for instance (we shall do+ −
this later on).+ −
+ −
The main driving force behind parser combinators is that we can easily+ −
build parser combinators out of smaller components following very+ −
closely the structure of a grammar. In order to implement this in a+ −
functional/object-oriented programming language, like Scala, we need+ −
to specify an abstract class for parser combinators. In the abstract+ −
class we specify that \texttt{I} is the \emph{input type} of the+ −
parser combinator and that \texttt{T} is the \emph{output type}. This+ −
implies that the function \texttt{parse} takes an argument of type+ −
\texttt{I} and returns a set of type \mbox{\texttt{Set[(T, I)]}}.+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala]+ −
abstract class Parser[I, T](using is: I => Seq[_]) {+ −
def parse(in: I): Set[(T, I)] + −
+ −
def parse_all(in: I) : Set[T] =+ −
for ((hd, tl) <- parse(in); + −
if is(tl).isEmpty) yield hd+ −
}+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent It is the obligation in each instance of this class to+ −
supply an implementation for \texttt{parse}. From this function we+ −
can then ``centrally'' derive the function \texttt{parse\_all}, which+ −
just filters out all pairs whose second component is not empty (that+ −
is has still some unprocessed part). The reason is that at the end of+ −
the parsing we are only interested in the results where all the input+ −
has been consumed and no unprocessed part is left over.+ −
+ −
One of the simplest parser combinators recognises just a+ −
single character, say $c$, from the beginning of strings. Its+ −
behaviour can be described as follows:+ −
+ −
\begin{itemize}+ −
\item If the head of the input string $s$ starts with a $c$, then return + −
the set+ −
\[\{(c, \textit{tail-of-}s)\}\]+ −
where \textit{tail of} + −
$s$ is the unprocessed part of the input string.+ −
\item Otherwise return the empty set $\{\}$. + −
\end{itemize}+ −
+ −
\noindent+ −
The input type of this simple parser combinator is \texttt{String} and+ −
the output type is \texttt{Char}. This means \texttt{parse} returns+ −
\mbox{\texttt{Set[(Char, String)]}}. The code in Scala is as follows:+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
case class CharParser(c: Char) extends Parser[String, Char] {+ −
def parse(s: String) = + −
if (s != "" && s.head == c) Set((c, s.tail)) else Set()+ −
}+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent You can see \texttt{parse} tests here whether the+ −
first character of the input string \texttt{s} is equal to+ −
\texttt{c}. If yes, then it splits the string into the recognised part+ −
\texttt{c} and the unprocessed part \texttt{s.tail}. In case+ −
\texttt{s} does not start with \texttt{c} then the parser returns the+ −
empty set (in Scala \texttt{Set()}). Since this parser recognises+ −
characters and just returns characters as the processed part, the+ −
output type of the parser is \texttt{Char}.+ −
+ −
If we want to parse a list of tokens and interested in recognising a+ −
number token, for example, we could write something like this+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]+ −
case object NumParser extends Parser[List[Token], Int] {+ −
def parse(ts: List[Token]) = ts match {+ −
case Num_token(s)::rest => Set((s.toInt, rest)) + −
case _ => Set ()+ −
}+ −
}+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent+ −
In this parser the input is of type \texttt{List[Token]}. The function+ −
parse looks at the input \texttt{ts} and checks whether the first+ −
token is a \texttt{Num\_token} (let us assume our lexer generated+ −
these tokens for numbers). But this parser does not just return this+ −
token (and the rest of the list), like the \texttt{CharParser} above,+ −
rather it extracts also the string \texttt{s} from the token and+ −
converts it into an integer. The hope is that the lexer did its work+ −
well and this conversion always succeeds. The consequence of this is+ −
that the output type for this parser is \texttt{Int}, not+ −
\texttt{Token}. Such a conversion would be needed in our parser,+ −
because when we encounter a number in our program, we want to do+ −
some calculations based on integers, not strings (or tokens). + −
+ −
+ −
These simple parsers that just look at the input and do a simple+ −
transformation are often called \emph{atomic} parser combinators.+ −
More interesting are the parser combinators that build larger parsers+ −
out of smaller component parsers. There are three such parser+ −
combinators that can be implemented generically. The \emph{alternative+ −
parser combinator} is as follows: given two parsers, say, $p$ and+ −
$q$, we apply both parsers to the input (remember parsers are+ −
functions) and combine the output (remember they are sets of pairs):+ −
+ −
\begin{center}+ −
$p(\text{input}) \cup q(\text{input})$+ −
\end{center}+ −
+ −
\noindent In Scala we can implement alternative parser+ −
combinator as follows+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala, numbers=none]+ −
class AltParser[I, T]+ −
(p: => Parser[I, T], + −
q: => Parser[I, T])(using I => Seq[_])+ −
extends Parser[I, T] {+ −
def parse(in: I) = p.parse(in) ++ q.parse(in) + −
} + −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent The types of this parser combinator are again generic (we+ −
have \texttt{I} for the input type, and \texttt{T} for the output+ −
type). The alternative parser builds a new parser out of two existing+ −
parsers \texttt{p} and \texttt{q} which are given as arguments. Both+ −
parsers need to be able to process input of type \texttt{I} and return+ −
in \texttt{parse} the same output type \texttt{Set[(T,+ −
I)]}.\footnote{There is an interesting detail of Scala, namely the+ −
\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. These arrows+ −
will prevent the evaluation of the arguments before they are+ −
used. This is often called \emph{lazy evaluation} of the+ −
arguments. We will explain this later.} The alternative parser runs+ −
the input with the first parser \texttt{p} (producing a set of pairs)+ −
and then runs the same input with \texttt{q} (producing another set of+ −
pairs). The result should be then just the union of both sets, which+ −
is the operation \texttt{++} in Scala.+ −
+ −
The alternative parser combinator allows us to construct a parser that+ −
parses either a character \texttt{a} or \texttt{b} using the+ −
\texttt{CharParser} shown above. For this we can write\footnote{Note+ −
that we cannot use a \texttt{case}-class for \texttt{AltParser}s+ −
because of the problem with laziness and Scala quirks. Hating+ −
\texttt{new} like the plague, we will work around this later with+ −
some syntax tricks. ;o)}+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala, numbers=none]+ −
new AltParser(CharParser('a'), CharParser('b'))+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent Later on we will use Scala mechanism for introducing some+ −
more readable shorthand notation for this, like \texttt{p"a" ||+ −
p"b"}. But first let us look in detail at what this parser combinator produces+ −
with some sample strings.+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
input strings & & output\medskip\\+ −
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}},\; \texttt{\Grid{cde}})\right\}$\\+ −
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}},\; \texttt{\Grid{cde}})\right\}$\\+ −
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent We receive in the first two cases a successful output (that+ −
is a non-empty set). In each case, either \pcode{a} or \pcode{b} is in+ −
the parsed part, and \pcode{cde} in the unprocessed part. Clearly this+ −
parser cannot parse anything of the form \pcode{ccde}, therefore the+ −
empty set is returned in the last case. Observe that parser+ −
combinators only look at the beginning of the given input: they do not+ −
fish out something in the ``middle'' of the input.+ −
+ −
A bit more interesting is the \emph{sequence parser combinator}. Given+ −
two parsers, say again, $p$ and $q$, we want to apply first the input+ −
to $p$ producing a set of pairs; then apply $q$ to all the unparsed + −
parts in the pairs; and then combine the results. Mathematically we would+ −
write something like this for the set of pairs:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & + −
$(\textit{output}_1, u_1) \in p(\text{input}) + −
\;\wedge\;$\\+ −
&& $(\textit{output}_2, u_2) \in q(u_1)\}$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Notice that the $p$ will first be run on the input,+ −
producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$+ −
stands for the unprocessed, or leftover, parts of $p$. We want that+ −
$q$ runs on all these unprocessed parts $u_1$. Therefore these+ −
unprocessed parts are fed into the second parser $q$. The overall+ −
result of the sequence parser combinator is pairs of the form+ −
$((\textit{output}_1, \textit{output}_2), u_2)$. This means the+ −
unprocessed part of the sequence parser combinator is the unprocessed+ −
part the second parser $q$ leaves as leftover. The parsed parts of the+ −
component parsers are combined in a pair, namely+ −
$(\textit{output}_1, \textit{output}_2)$. The reason is we want to+ −
know what $p$ and $q$ were able to parse. This behaviour can be+ −
implemented in Scala as follows:+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,numbers=none]+ −
class SeqParser[I, T, S]+ −
(p: => Parser[I, T], + −
q: => Parser[I, S])(using I => Seq[_])+ −
extends Parser[I, (T, S)] {+ −
def parse(in: I) = + −
for ((output1, u1) <- p.parse(in); + −
(output2, u2) <- q.parse(u1)) + −
yield ((output1, output2), u2)+ −
}+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent This parser takes again as arguments two parsers, \texttt{p}+ −
and \texttt{q}. It implements \texttt{parse} as follows: first run the+ −
parser \texttt{p} on the input producing a set of pairs+ −
(\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the+ −
unprocessed parts left over by \texttt{p} (recall that there can be+ −
several such pairs). Let then \texttt{q} run on these unprocessed+ −
parts producing again a set of pairs. The output of the sequence+ −
parser combinator is then a set containing pairs where the first+ −
components are again pairs, namely what the first parser could parse+ −
together with what the second parser could parse; the second component+ −
is the unprocessed part left over after running the second parser+ −
\texttt{q}. Note that the input type of the sequence parser combinator+ −
is as usual \texttt{I}, but the output type is+ −
+ −
\begin{center}+ −
\texttt{(T, S)}+ −
\end{center}+ −
+ −
\noindent+ −
Consequently, the function \texttt{parse} in the sequence parser+ −
combinator returns sets of type \texttt{Set[((T, S), I)]}. That means+ −
we have essentially two output types for the sequence parser+ −
combinator (packaged in a pair), because in general \textit{p} and+ −
\textit{q} might produce different things (for example we recognise a+ −
number with \texttt{p} and then with \texttt{q} a string corresponding+ −
to an operator). If any of the runs of \textit{p} and \textit{q}+ −
fail, that is produce the empty set, then \texttt{parse} will also+ −
produce the empty set.+ −
+ −
With the shorthand notation we shall introduce later for the sequence+ −
parser combinator, we can write for example \pcode{p"a" ~ p"b"}, which+ −
is the parser combinator that first recognises the character+ −
\texttt{a} from a string and then \texttt{b}. (Actually, we will be+ −
able to write just \pcode{p"ab"} for such parsers, but it is good to+ −
understand first what happens behind the scenes.) Let us look again+ −
at some examples of how the sequence parser combinator processes some+ −
strings:+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
input strings & & output\medskip\\+ −
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}),\; \texttt{\Grid{cde}})\right\}$\\+ −
\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\+ −
\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent In the first line we have a successful parse, because the+ −
string starts with \texttt{ab}, which is the prefix we are looking+ −
for. But since the parsing combinator is constructed as sequence of+ −
the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the+ −
result is a nested pair of the form \texttt{((a, b), cde)}. It is+ −
\emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously+ −
expect. The parser returns the empty set in the other examples,+ −
because they do not fit with what the parser is supposed to parse.+ −
+ −
+ −
A slightly more complicated parser is \pcode{(p"a" || p"b") ~ p"c"} which+ −
parses as first character either an \texttt{a} or \texttt{b}, followed+ −
by a \texttt{c}. This parser produces the following outputs.+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
input strings & & output\medskip\\+ −
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\+ −
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\+ −
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
Now consider the parser \pcode{(p"a" ~ p"b") ~ p"c"} which parses+ −
\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces+ −
the following outputs.+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
input strings & & output\medskip\\+ −
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\+ −
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\+ −
\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$+ −
\end{tabular}+ −
\end{center}+ −
+ −
+ −
\noindent The second and third example fail, because something is+ −
``missing'' in the sequence we are looking for. The first succeeds but+ −
notice how the results nest with sequences: the parsed part is a+ −
nested pair of the form \pcode{((a, b), c)}. If we nest the sequence+ −
parser differently, say \pcode{p"a" ~ (p"b" ~ p"c")}, then also+ −
our output pairs nest differently+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
input strings & & output\medskip\\+ −
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})),\; \texttt{\Grid{de}})\right\}$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
Two more examples: first consider the parser+ −
\pcode{(p"a" ~ p"a") ~ p"a"} and the input \pcode{aaaa}:+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
input string & & output\medskip\\+ −
\texttt{\Grid{aaaa}} & $\rightarrow$ & + −
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Notice again how the results nest deeper and deeper as pairs (the+ −
last \pcode{a} is in the unprocessed part). To consume everything of+ −
this string we can use the parser \pcode{((p"a" ~ p"a") ~ p"a") ~+ −
p"a"}. Then the output is as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
input string & & output\medskip\\+ −
\texttt{\Grid{aaaa}} & $\rightarrow$ & + −
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent This is an instance where the parser consumed+ −
completely the input, meaning the unprocessed part is just the+ −
empty string. So if we called \pcode{parse_all}, instead of \pcode{parse},+ −
we would get back the result+ −
+ −
\[+ −
\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}+ −
\]+ −
+ −
\noindent where the unprocessed (empty) parts have been stripped away+ −
from the pairs; everything where the second part was not empty has+ −
been thrown away as well, because they represent+ −
ultimately-unsuccessful-parses. The main point is that the sequence+ −
parser combinator returns pairs that can nest according to the+ −
nesting of the component parsers.+ −
+ −
+ −
Consider also carefully that constructing a parser such+ −
+ −
\begin{center}+ −
\pcode{p"a" || (p"a" ~ p"b")}+ −
\end{center}+ −
+ −
\noindent+ −
will result in a typing error. The intention with this+ −
parser is that we want to parse either an \texttt{a}, or an \texttt{a}+ −
followed by a \texttt{b}. However, the first parser has as output type+ −
a single character (recall the type of \texttt{CharParser}), but the+ −
second parser produces a pair of characters as output. The alternative+ −
parser is required to have both component parsers to have the same+ −
type---the reason is that we need to be able to build the union of two+ −
sets, which requires in Scala that the sets have the same type. Since+ −
they are not in this case, there is a typing error. We will see later+ −
how we can build this parser without the typing error.+ −
+ −
The next parser combinator, called \emph{semantic action} or \emph{map-parser}, does not+ −
actually combine two smaller parsers, but applies a function to the result+ −
of a parser. It is implemented in Scala as follows+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
class MapParser[I, T, S]+ −
(p: => Parser[I, T], + −
f: T => S)(using I => Seq[_]) extends Parser[I, S] {+ −
def parse(in: I) =+ −
for ((hd, tl) <- p.parse(in)) yield (f(hd), tl)+ −
}+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
+ −
\noindent This parser combinator takes a parser \texttt{p} (with input+ −
type \texttt{I} and output type \texttt{T}) as one argument but also a+ −
function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p}+ −
produces sets of type \texttt{Set[(S, I)]}. The semantic action+ −
combinator then applies the function \texttt{f} to all the `processed'+ −
parser outputs. Since this function is of type \texttt{T => S}, we+ −
obtain a parser with output type \texttt{S}. Again Scala lets us+ −
introduce some shorthand notation for this parser+ −
combinator. Therefore we will write short \texttt{p.map(f)} for it.+ −
+ −
What are semantic actions good for? Well, they allow you to transform+ −
the parsed input into datastructures you can use for further+ −
processing. A simple (contrived) example would be to transform parsed+ −
characters into ASCII numbers. Suppose we define a function \texttt{f}+ −
(from characters to \texttt{Int}s) and use a \texttt{CharParser} for parsing+ −
the character \texttt{c}.+ −
+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
val f = (c: Char) => c.toInt+ −
val c = new CharParser('c')+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent+ −
We then can run the following two parsers on the input \texttt{cbd}:+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
c.parse("cbd")+ −
c.map(f).parse("cbd")+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent+ −
In the first line we obtain the expected result \texttt{Set(('c',+ −
"bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the+ −
character has been transformed into an ASCII number.+ −
+ −
A slightly less contrived example is about parsing numbers (recall+ −
\texttt{NumParser} above). However, we want to do this here for+ −
strings, not for tokens. For this assume we have the following+ −
(atomic) \texttt{RegexParser}.+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,xleftmargin=0mm,+ −
basicstyle=\small\ttfamily, numbers=none]+ −
import scala.util.matching.Regex+ −
+ −
case class RegexParser(reg: Regex) extends Parser[String, String] {+ −
def parse(in: String) = reg.findPrefixMatchOf(in) match {+ −
case None => Set()+ −
case Some(m) => Set((m.matched, m.after.toString)) + −
}+ −
}+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent+ −
This parser takes a regex as argument and splits up a string into a+ −
prefix and the rest according to this regex+ −
(\texttt{reg.findPrefixMatchOf} generates a match---in the successful+ −
case---and the corresponding strings can be extracted with+ −
\texttt{matched} and \texttt{after}). The input and output type for+ −
this parser is \texttt{String}. Using \texttt{RegexParser} we can+ −
define a \texttt{NumParser} for \texttt{Strings} to \texttt{Int} as+ −
follows:+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
val NumParser = RegexParser("[0-9]+".r)+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent+ −
This parser will recognise a number at the beginning of a string. For+ −
example+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
NumParser.parse("123abc")+ −
\end{lstlisting}+ −
\end{center} + −
+ −
\noindent+ −
produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is+ −
still a string (the expected double-quotes are not printed by+ −
Scala). We want to convert this string into the corresponding+ −
\texttt{Int}. We can do this as follows using a semantic action+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
NumParser.map{s => s.toInt}.parse("123abc")+ −
\end{lstlisting}+ −
\end{center} + −
+ −
\noindent+ −
The function in the semantic action converts a string into an+ −
\texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},+ −
but this time \texttt{123} is an \texttt{Int}. Think carefully what+ −
the input and output type of the parser is without the semantic action+ −
adn what with the semantic action (the type of the function can+ −
already tell you). Let us come back to semantic actions when we are+ −
going to implement actual context-free grammars.+ −
+ −
\subsubsection*{Shorthand notation for parser combinators}+ −
+ −
Before we proceed, let us just explain the shorthand notation for+ −
parser combinators. Like for regular expressions, the shorthand+ −
notation will make our life much easier when writing actual+ −
parsers. We can define some extensions\footnote{In Scala 2 this was+ −
generically called as ``implicits''.} which allow us to write+ −
+ −
\begin{center}+ −
\begin{tabular}{ll} + −
\pcode{p || q} & alternative parser\\+ −
\pcode{p ~ q} & sequence parser\\ + −
\pcode{p.map(f)} & semantic action parser+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
We will also use the \texttt{p}-string-interpolation for specifying simple string parsers.+ −
+ −
The idea is that this shorthand notation allows us to easily translate+ −
context-free grammars into code. For example recall our context-free+ −
grammar for palindromes:+ −
+ −
\begin{plstx}[margin=3cm]+ −
: \meta{Pal} ::= a\cdot \meta{Pal}\cdot a | b\cdot \meta{Pal}\cdot b | a | b | \epsilon\\+ −
\end{plstx}+ −
+ −
\noindent+ −
Each alternative in this grammar translates into an alternative parser+ −
combinator. The $\cdot$ can be translated to a sequence parser+ −
combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply+ −
written as \texttt{p"a"}, \texttt{p"b"} and \texttt{p""}.+ −
+ −
+ −
\subsubsection*{How to build more interesting parsers using parser combinators?}+ −
+ −
The beauty of parser combinators is the ease with which they can be+ −
implemented and how easy it is to translate context-free grammars into+ −
code (though the grammars need to be non-left-recursive). To+ −
demonstrate this consider again the grammar for palindromes from above.+ −
The first idea would be to translate it into the following code+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
lazy val Pal : Parser[String, String] = + −
((p"a" ~ Pal ~ p"a") || (p"b" ~ Pal ~ p"b") || p"a" || p"b" || p"")+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent+ −
Unfortunately, this does not quite work yet as it produces a typing+ −
error. The reason is that the parsers \texttt{p"a"}, \texttt{p"b"} and+ −
\texttt{p""} all produce strings as output type and therefore can be+ −
put into an alternative \texttt{...|| p"a" || p"b" || p""}. But both+ −
sequence parsers \pcode{p"a" ~ Pal ~ p"a"} and \pcode{p"b" ~ Pal ~ p"b"}+ −
produce pairs of the form+ −
+ −
\begin{center}+ −
(((\texttt{a}-part, \texttt{Pal}-part), \texttt{a}-part), unprocessed part)+ −
\end{center}+ −
+ −
\noindent That is how the+ −
sequence parser combinator nests results when \pcode{\~} is used+ −
between two components. The solution is to use a semantic action that+ −
``flattens'' these pairs and appends the corresponding strings, like+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
lazy val Pal : Parser[String, String] = + −
((p"a" ~ Pal ~ p"a").map{ case ((x, y), z) => x + y + z } ||+ −
(p"b" ~ Pal ~ p"b").map{ case ((x, y), z) => x + y + z } ||+ −
p"a" || p"b" || p"")+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent+ −
How does this work? Well, recall again what the pairs look like for+ −
the parser \pcode{p"a" ~ Pal ~ p"a"}. The pattern in the semantic+ −
action matches the nested pairs (the \texttt{x} with the+ −
\texttt{a}-part and so on). Unfortunately when we have such nested+ −
pairs, Scala requires us to define the function using the+ −
\pcode{case}-syntax+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
{ case ((x, y), z) => ... }+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent+ −
If we have more sequence parser combinators or have them differently nested,+ −
then the pattern in the semantic action needs to be adjusted accordingly.+ −
The action we implement above is to concatenate all three strings, which+ −
means after the semantic action is applied the output type of the parser + −
is \texttt{String}, which means it fits with the alternative parsers+ −
\texttt{...|| p"a" || p"b" || p""}.+ −
+ −
If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain+ −
as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrome+ −
(an empty set would mean something is wrong). But also notice what the+ −
intermediate results are generated by \pcode{Pal.parse("abaaaba")}+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
Set((abaaaba,""),(aba,aaba), (a,baaaba), ("",abaaaba))+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent+ −
That there are more than one output might be slightly unexpected, but+ −
can be explained as follows: the pairs represent all possible+ −
(partial) parses of the string \pcode{"abaaaba"}. The first pair above+ −
corresponds to a complete parse (all output is consumed) and this is+ −
what \pcode{Pal.parse_all} returns. The second pair is a small+ −
``sub-palindrome'' that can also be parsed, but the parse fails with+ −
the rest \pcode{aaba}, which is therefore left as unprocessed. The+ −
third one is an attempt to parse the whole string with the+ −
single-character parser \pcode{a}. That of course only partially+ −
succeeds, by leaving \pcode{"baaaba"} as the unprocessed+ −
part. Finally, since we allow the empty string to be a palindrome we+ −
also obtain the last pair, where actually nothing is consumed from the+ −
input string. While all this works as intended, we need to be careful+ −
with this (especially with including the \pcode{""} parser in our+ −
grammar): if during parsing the set of parsing attempts gets too big,+ −
then the parsing process can become very slow as the potential+ −
candidates for applying rules can snowball.+ −
+ −
+ −
Important is also to note is that we must define the+ −
\texttt{Pal}-parser as a \emph{lazy} value in Scala. Look again at the+ −
code: \texttt{Pal} occurs on the right-hand side of the definition. If we had+ −
just written+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
val Pal : Parser[String, String] = ...rhs...+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent+ −
then Scala before making this assignment to \texttt{Pal} attempts to+ −
find out what the expression on the right-hand side evaluates to. This+ −
is straightforward in case of simple expressions \texttt{2 + 3}, but+ −
the expression above contains \texttt{Pal} in the right-hand+ −
side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal}+ −
evaluates to and start a new recursion, which means it falls into an+ −
infinite loop. The definition of \texttt{Pal} is recursive and the+ −
\pcode{lazy} key-word prevents it from being fully evaluated. Therefore+ −
whenever we want to define a recursive parser we have to write+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
lazy val SomeParser : Parser[...,...] = ...rhs...+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent That was not necessary for our atomic parsers, like+ −
\texttt{RegexParser} or \texttt{CharParser}, because they are not recursive.+ −
Note that this is also the reason why we had to write+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
class AltParser[I, T]+ −
(p: => Parser[I, T], + −
q: => Parser[I, T]) ... extends Parser[I, T] {...}+ −
+ −
class SeqParser[I, T, S]+ −
(p: => Parser[I, T], + −
q: => Parser[I, S]) ... extends Parser[I, (T, S)] {...}+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of+ −
the argument types for \texttt{p} and \texttt{q} prevent Scala from+ −
evaluating the arguments. Normally, Scala would first evaluate what+ −
kind of parsers \texttt{p} and \texttt{q} are, and only then generate+ −
the alternative parser combinator, respectively sequence parser+ −
combinator. Since the arguments can be recursive parsers, such as+ −
\texttt{Pal}, this would lead again to an infinite loop.+ −
+ −
As a final example in this section, let us consider the grammar for+ −
well-nested parentheses:+ −
+ −
\begin{plstx}[margin=3cm]+ −
: \meta{P} ::= (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\+ −
\end{plstx}+ −
+ −
\noindent+ −
Let us assume we want to not just recognise strings of+ −
well-nested parentheses but also transform round parentheses+ −
into curly braces. We can do this by using a semantic+ −
action:+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,+ −
xleftmargin=0mm, numbers=none]+ −
lazy val P : Parser[String, String] = + −
(p"(" ~ P ~ p")" ~ P).map{ case (((_,x),_),y) => "{" + x + "}" + y } || p""+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent+ −
Here we define a function where which ignores the parentheses in the+ −
pairs, but replaces them in the right places with curly braces when+ −
assembling the new string in the right-hand side. If we run+ −
\pcode{P.parse_all("(((()()))())")} we obtain+ −
\texttt{Set(\{\{\{\{\}\{\}\}\}\{\}\})} as expected.+ −
+ −
+ −
+ −
\subsubsection*{Implementing an Interpreter}+ −
+ −
The first step before implementing an interpreter for a full-blown+ −
language is to implement a simple calculator for arithmetic+ −
expressions. Suppose our arithmetic expressions are given by the+ −
grammar:+ −
+ −
\begin{plstx}[margin=3cm,one per line]+ −
: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E} + −
| \meta{E} \cdot - \cdot \meta{E} + −
| \meta{E} \cdot * \cdot \meta{E} + −
| ( \cdot \meta{E} \cdot )+ −
| Number \\+ −
\end{plstx}+ −
+ −
\noindent+ −
Naturally we want to implement the grammar in such a way that we can+ −
calculate what the result of, for example, \texttt{4*2+3} is---we are+ −
interested in an \texttt{Int} rather than a string. This means every+ −
component parser needs to have as output type \texttt{Int} and when we+ −
assemble the intermediate results, strings like \texttt{"+"},+ −
\texttt{"*"} and so on, need to be translated into the appropriate+ −
Scala operation of adding, multiplying and so on. Being inspired by+ −
the parser for well-nested parentheses above and ignoring the fact+ −
that we want $*$ to take precedence over $+$ and $-$, we might want to+ −
write something like+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
lazy val E: Parser[String, Int] = + −
((E ~ p"+" ~ E).map{ case ((x, y), z) => x + z} ||+ −
(E ~ p"-" ~ E).map{ case ((x, y), z) => x - z} ||+ −
(E ~ p"*" ~ E).map{ case ((x, y), z) => x * z} ||+ −
(p"(" ~ E ~ p")").map{ case ((x, y), z) => y} ||+ −
NumParserInt)+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent+ −
Consider again carefully how the semantic actions pick out the correct+ −
arguments for the calculation. In case of plus, we need \texttt{x} and+ −
\texttt{z}, because they correspond to the results of the component+ −
parser \texttt{E}. We can just add \texttt{x + z} in order to obtain+ −
an \texttt{Int} because the output type of \texttt{E} is+ −
\texttt{Int}. Similarly with subtraction and multiplication. In+ −
contrast in the fourth clause we need to return \texttt{y}, because it+ −
is the result enclosed inside the parentheses. The information about+ −
parentheses, roughly speaking, we just throw away.+ −
+ −
So far so good. The problem arises when we try to call \pcode{parse_all} with the+ −
expression \texttt{"1+2+3"}. Lets try it+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
E.parse_all("1+2+3")+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent+ −
\ldots and we wait and wait and \ldots still wait. What is the+ −
problem? Actually, the parser just fell into an infinite loop! The+ −
reason is that the above grammar is left-recursive and recall that our+ −
parser combinators cannot deal with such left-recursive+ −
grammars. Fortunately, every left-recursive context-free grammar can be+ −
transformed into a non-left-recursive grammars that still recognises+ −
the same strings. This allows us to design the following grammar+ −
+ −
\begin{plstx}[margin=3cm]+ −
: \meta{E} ::= \meta{T} \cdot + \cdot \meta{E} | \meta{T} \cdot - \cdot \meta{E} | \meta{T}\\+ −
: \meta{T} ::= \meta{F} \cdot * \cdot \meta{T} | \meta{F}\\+ −
: \meta{F} ::= ( \cdot \meta{E} \cdot ) | Number\\+ −
\end{plstx}+ −
+ −
\noindent+ −
Recall what left-recursive means from Handout 5 and make sure you see+ −
why this grammar is \emph{non} left-recursive. This version of the grammar+ −
also deals with the fact that $*$ should have a higher precedence than $+$ and $-$. This does not+ −
affect which strings this grammar can recognise, but in which order we are going+ −
to evaluate any arithmetic expression. We can translate this grammar into+ −
parsing combinators as follows:+ −
+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
lazy val E: Parser[String, Int] = + −
(T ~ p"+" ~ E).map{ case ((x, y), z) => x + z } ||+ −
(T ~ p"-" ~ E).map{ case ((x, y), z) => x - z } || T + −
lazy val T: Parser[String, Int] = + −
(F ~ p"*" ~ T).map{ case ((x, y), z) => x * z } || F+ −
lazy val F: Parser[String, Int] = + −
(p"(" ~ E ~ p")").map{ case ((x, y), z) => y } || NumParserInt+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent+ −
Let us try out some examples:+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
input strings & & output of \pcode{parse_all}\medskip\\+ −
\texttt{\Grid{1+2+3}} & $\rightarrow$ & \texttt{Set(6)}\\+ −
\texttt{\Grid{4*2+3}} & $\rightarrow$ & \texttt{Set(11)}\\+ −
\texttt{\Grid{4*(2+3)}} & $\rightarrow$ & \texttt{Set(20)}\\+ −
\texttt{\Grid{(4)*((2+3))}} & $\rightarrow$ & \texttt{Set(20)}\\+ −
\texttt{\Grid{4/2+3}} & $\rightarrow$ & \texttt{Set()}\\+ −
\texttt{\Grid{1\VS +\VS 2\VS +\VS 3}} & $\rightarrow$ & \texttt{Set()}\\ + −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
Note that we call \pcode{parse_all}, not \pcode{parse}. The examples+ −
should be quite self-explanatory. The last two example do not produce+ −
any integer result because our parser does not define what to do in+ −
case of division (could be easily added), but also has no idea what to+ −
do with whitespaces. To deal with them is the task of the lexer! Yes,+ −
we can deal with them inside the grammar, but that would render many+ −
grammars becoming unintelligible, including this one.\footnote{If you+ −
think an easy solution is to extend the notion of what a number+ −
should be, then think again---you still would have to deal with+ −
cases like \texttt{\Grid{(\VS (\VS 2+3)\VS )}}. Just think of the mess + −
you would have in a grammar for a full-blown language where there are + −
numerous such cases.}+ −
+ −
\end{document}+ −
+ −
%%% Local Variables: + −
%%% mode: latex + −
%%% TeX-master: t+ −
%%% End: + −