\documentclass{article}+ −
\usepackage{../style}+ −
\usepackage{../langs}+ −
+ −
+ −
\begin{document}+ −
+ −
\section*{Handout 6 (Parser Combinators)}+ −
+ −
In what follows we explain \emph{parser combinators}. Their+ −
distinguishing feature is that they are very easy to+ −
implement. However, they only work when the grammar to be+ −
parsed is \emph{not} left-recursive and they are efficient+ −
only when the grammar is unambiguous. It is the responsibility+ −
of the grammar designer to ensure these two properties.+ −
+ −
Parser combinators can deal with any kind of input as long as+ −
this input is a kind of sequence, for example a string or a+ −
list of tokens. The general idea behind parser combinators is+ −
to transform the input into sets of pairs, like so+ −
+ −
\begin{center}+ −
$\underbrace{\text{list of tokens}}_{\text{input}}$ + −
$\Rightarrow$+ −
$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$+ −
\end{center} + −
+ −
\noindent In Scala parser combinators are functions of type+ −
+ −
\begin{center}+ −
\texttt{I $\Rightarrow$ Set[(T, I)]}+ −
\end{center}+ −
+ −
\noindent that is they take as input something of type+ −
\texttt{I} and return a set of pairs. The first component of+ −
these pairs corresponds to what the parser combinator was able+ −
to process from the input and the second is the unprocessed+ −
part of the input. As we shall see shortly, a parser+ −
combinator might return more than one such pair, the idea+ −
being that there are potentially several ways of how to+ −
interpret the input. To simplify matters we will first look at+ −
the case where the input to the parser combinators is just+ −
strings. As a concrete example, consider the string+ −
+ −
\begin{center}+ −
\tt\Grid{iffoo\VS testbar}+ −
\end{center}+ −
+ −
\noindent We might have a parser combinator which tries to+ −
interpret this string as a keyword (\texttt{if}) or as an+ −
identifier (\texttt{iffoo}). Then the output will be the set+ −
+ −
\begin{center}+ −
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), + −
\left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$+ −
\end{center}+ −
+ −
\noindent where the first pair means the parser could+ −
recognise \texttt{if} from the input and leaves the rest as+ −
`unprocessed' as the second component of the pair; in the+ −
other case it could recognise \texttt{iffoo} and leaves+ −
\texttt{\VS testbar} as unprocessed. If the parser cannot+ −
recognise anything from the input, then parser combinators just+ −
return the empty set $\{\}$. This will indicate something+ −
``went wrong''\ldots or more precisely, nothing could be+ −
parsed.+ −
+ −
The idea of parser combinators is that we can easily build+ −
parser combinators out of smaller components following very+ −
closely the structure of a grammar. In order to implement this+ −
in an object-oriented programming language, like Scala, we+ −
need to specify an abstract class for parser combinators. This+ −
abstract class requires the implementation of the function+ −
\texttt{parse} taking an argument of type \texttt{I} and+ −
returns a set of type \mbox{\texttt{Set[(T, I)]}}.+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala]+ −
abstract class Parser[I, T] {+ −
def parse(ts: I): Set[(T, I)]+ −
+ −
def parse_all(ts: I): Set[T] =+ −
for ((head, tail) <- parse(ts); if (tail.isEmpty)) + −
yield head+ −
}+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent From the function \texttt{parse} we can then+ −
``centrally'' derive the function \texttt{parse\_all}, which+ −
just filters out all pairs whose second component is not empty+ −
(that is has still some unprocessed part). The reason is that+ −
at the end of the parsing we are only interested in the results+ −
where all the input has been consumed and no unprocessed part+ −
is left over.+ −
+ −
One of the simplest parser combinators recognises just a+ −
character, say $c$, from the beginning of strings. Its+ −
behaviour can be described as follows:+ −
+ −
\begin{itemize}+ −
\item if the head of the input string starts with a $c$, then returns + −
the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of} + −
$s$ is the unprocessed part of the input string+ −
\item otherwise return the empty set $\{\}$ + −
\end{itemize}+ −
+ −
\noindent+ −
The input type of this simple parser combinator for characters is+ −
\texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. + −
The code in Scala is as follows:+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
case class CharParser(c: Char) extends Parser[String, Char] {+ −
def parse(sb: String) = + −
if (sb.head == c) Set((c, sb.tail)) else Set()+ −
}+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent The \texttt{parse} function tests whether the first+ −
character of the input string \texttt{sb} is equal to+ −
\texttt{c}. If yes, then it splits the string into the+ −
recognised part \texttt{c} and the unprocessed part+ −
\texttt{sb.tail}. In case \texttt{sb} does not start with+ −
\texttt{c} then the parser returns the empty set (in Scala+ −
\texttt{Set()}).+ −
+ −
+ −
More interesting are the parser combinators that build larger+ −
parsers out of smaller component parsers. For example the+ −
alternative parser combinator is as follows: given two+ −
parsers, say, $p$ and $q$, we apply both parsers to the+ −
input (remember parsers are functions) and combine the output+ −
(remember they are sets)+ −
+ −
\begin{center}+ −
$p(\text{input}) \cup q(\text{input})$+ −
\end{center}+ −
+ −
\noindent In Scala we would implement alternative parser+ −
combinator as follows+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala, numbers=none]+ −
class AltParser[I, T]+ −
(p: => Parser[I, T], + −
q: => Parser[I, T]) extends Parser[I, T] {+ −
def parse(sb: I) = p.parse(sb) ++ q.parse(sb)+ −
}+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent The types of this parser combinator are polymorphic+ −
(we just have \texttt{I} for the input type, and \texttt{T}+ −
for the output type). The alternative parser builds a new+ −
parser out of two existing parsers \texttt{p} and \texttt{q}.+ −
Both need to be able to process input of type \texttt{I} and+ −
return the same output type \texttt{Set[(T,+ −
I)]}.\footnote{There is an interesting detail of Scala, namely+ −
the \texttt{=>} in front of the types of \texttt{p} and+ −
\texttt{q}. They will prevent the evaluation of the arguments+ −
before they are used. This is often called \emph{lazy+ −
evaluation} of the arguments. We will explain this later.} The alternative parser should+ −
run the input with the first parser \texttt{p} (producing a+ −
set of outputs) and then run the same input with \texttt{q}.+ −
The result should be then just the union of both sets, which+ −
is the operation \texttt{++} in Scala.+ −
+ −
The alternative parser combinator already allows us to+ −
construct a parser that parses either a character \texttt{a}+ −
or \texttt{b}, as+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala, numbers=none]+ −
new AltParser(CharParser('a'), CharParser('b'))+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent Scala allows us to introduce some more readable+ −
shorthand notation for this, like \texttt{'a' || 'b'}. We can+ −
call this parser combinator with the strings+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
input strings & & output\medskip\\+ −
\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\+ −
\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\+ −
\texttt{\Grid{cc}} & $\rightarrow$ & $\{\}$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent We receive in the first two cases a successful+ −
output (that is a non-empty set). In each case, either+ −
\pcode{a} or \pcode{b} is in the processed part, and+ −
\pcode{c} in the unprocessed part. Clearly this parser cannot+ −
parse anything in the string \pcode{cc}, therefore the empty+ −
set.+ −
+ −
A bit more interesting is the \emph{sequence parser+ −
combinator}. Given two parsers, say, $p$ and $q$, apply first+ −
the input to $p$ producing a set of pairs; then apply $q$ to+ −
all the unparsed parts; then combine the results like+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & + −
$(\textit{output}_1, u_1) \in p(\text{input}) + −
\;\wedge\;$\\+ −
&& $\;(\textit{output}_2, u_2) \in q(u_1)\}$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
This can be implemented in Scala as follows:+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,numbers=none]+ −
class SeqParser[I, T, S]+ −
(p: => Parser[I, T], + −
q: => Parser[I, S]) extends Parser[I, (T, S)] {+ −
def parse(sb: I) = + −
for ((output1, u1) <- p.parse(sb); + −
(output2, u2) <- q.parse(u1)) + −
yield ((output1, output2), u2)+ −
}+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
\noindent This parser takes as input two parsers, \texttt{p}+ −
and \texttt{q}. It implements \texttt{parse} as follows: let+ −
first run the parser \texttt{p} on the input producing a set+ −
of pairs (\texttt{output1}, \texttt{u1}). The \texttt{u1}+ −
stands for the unprocessed parts left over by \texttt{p}. Let+ −
\texttt{q} run on these unprocessed parts producing again a+ −
set of pairs. The output of the sequence parser combinator is+ −
then a set containing pairs where the first components are+ −
again pairs, namely what the first parser could parse together+ −
with what the second parser could parse; the second component+ −
is the unprocessed part left over after running the second+ −
parser \texttt{q}. Therefore the input type of the sequence+ −
parser combinator is as usual \texttt{I}, but the output type+ −
is+ −
+ −
\begin{center}+ −
\texttt{Set[((T, S), I)]}+ −
\end{center}+ −
+ −
Scala allows us to provide some shorthand notation for the+ −
sequence parser combinator. We can write for example+ −
\pcode{'a' ~ 'b'}, which is the parser combinator that+ −
first consumes the character \texttt{a} from a string and then+ −
\texttt{b}. Three examples of this parser combinator are as+ −
follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
input strings & & output\medskip\\+ −
\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\+ −
\texttt{\Grid{bac}} & $\rightarrow$ & $\{\}$\\+ −
\texttt{\Grid{ccc}} & $\rightarrow$ & $\{\}$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent A slightly more complicated parser is \pcode{('a'+ −
|| 'b') ~ 'b'} which parses as first character either an+ −
\texttt{a} or \texttt{b} followed by a \texttt{b}. This parser+ −
produces the following outputs.+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
input strings & & output\medskip\\+ −
\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\+ −
\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\+ −
\texttt{\Grid{aac}} & $\rightarrow$ & $\{\}$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Two more examples: first consider the parser \pcode{('a' ~+ −
'a') ~ 'a'} and the input \pcode{aaaa}:+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
input string & & output\medskip\\+ −
\texttt{\Grid{aaaa}} & $\rightarrow$ & + −
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Notice how the results nest deeper and deeper as+ −
pairs (the last \pcode{a} is in the unprocessed part). To+ −
consume everything of this string we can use the parser+ −
\pcode{(('a' ~'a') ~ 'a') ~ 'a'}. Then the output is as+ −
follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
input string & & output\medskip\\+ −
\texttt{\Grid{aaaa}} & $\rightarrow$ & + −
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent This is an instance where the parser consumed+ −
completely the input, meaning the unprocessed part is just the+ −
empty string.+ −
+ −
+ −
Note carefully that constructing a parser such \pcode{'a' ||+ −
('a' ~ 'b')} will result in a typing error. The first+ −
parser has as output type a single character (recall the type+ −
of \texttt{CharParser}), but the second parser produces a pair+ −
of characters as output. The alternative parser is however+ −
required to have both component parsers to have the same type.+ −
We will see later how we can build this parser without the+ −
typing error.+ −
+ −
The next parser combinator does not actually combine smaller+ −
parsers, but applies a function to the result of a parser.+ −
It is implemented in Scala as follows+ −
+ −
\begin{center}+ −
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]+ −
class FunParser[I, T, S]+ −
(p: => Parser[I, T], + −
f: T => S) extends Parser[I, S] {+ −
def parse(sb: I) = + −
for ((head, tail) <- p.parse(sb)) yield (f(head), tail)+ −
}+ −
\end{lstlisting}+ −
\end{center}+ −
+ −
+ −
\noindent This parser combinator takes a parser \texttt{p}+ −
with output type \texttt{T} as one argument as well as a+ −
function \texttt{f} with type \texttt{T => S}. The parser+ −
\texttt{p} produces sets of type \texttt{(T, I)}. The+ −
\texttt{FunParser} combinator then applies the function+ −
\texttt{f} to all the parser outputs. Since this function is of+ −
type \texttt{T => S}, we obtain a parser with output type+ −
\texttt{S}. Again Scala lets us introduce some shorthand+ −
notation for this parser combinator. Therefore we will write+ −
\texttt{p ==> f} for it.+ −
+ −
\subsubsection*{How to build parsers using parser combinators?}+ −
+ −
\subsubsection*{Implementing an Interpreter}+ −
+ −
%\bigskip+ −
%takes advantage of the full generality---have a look+ −
%what it produces if we call it with the string \texttt{abc}+ −
%+ −
%\begin{center}+ −
%\begin{tabular}{rcl}+ −
%input string & & output\medskip\\+ −
%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\+ −
%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\+ −
%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$+ −
%\end{tabular}+ −
%\end{center}+ −
+ −
+ −
+ −
\end{document}+ −
+ −
%%% Local Variables: + −
%%% mode: latex + −
%%% TeX-master: t+ −
%%% End: + −