\documentclass{article}\usepackage{../style}\usepackage{../langs}\begin{document}\section*{Handout 6 (Parser Combinators)}In what follows we explain \emph{parser combinators}. Theirdistinguishing feature is that they are very easy toimplement. However, they only work when the grammar to beparsed is \emph{not} left-recursive and they are efficientonly when the grammar is unambiguous. It is the responsibilityof the grammar designer to ensure these two properties.Parser combinators can deal with any kind of input as long asthis input is a kind of sequence, for example a string or alist of tokens. The general idea behind parser combinators isto transform the input into sets of pairs, like so\begin{center}$\underbrace{\text{list of tokens}}_{\text{input}}$ $\Rightarrow$$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$\end{center} \noindent In Scala parser combinators are functions of type\begin{center}\texttt{I $\Rightarrow$ Set[(T, I)]}\end{center}\noindent that is they take as input something of type\texttt{I} and return a set of pairs. The first component ofthese pairs corresponds to what the parser combinator was ableto process from the input and the second is the unprocessedpart of the input. As we shall see shortly, a parsercombinator might return more than one such pair, the ideabeing that there are potentially several ways of how tointerpret the input. To simplify matters we will first look atthe case where the input to the parser combinators is juststrings. As a concrete example, consider the string\begin{center}\tt\Grid{iffoo\VS testbar}\end{center}\noindent We might have a parser combinator which tries tointerpret this string as a keyword (\texttt{if}) or as anidentifier (\texttt{iffoo}). Then the output will be the set\begin{center}$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$\end{center}\noindent where the first pair means the parser couldrecognise \texttt{if} from the input and leaves the rest as`unprocessed' as the second component of the pair; in theother case it could recognise \texttt{iffoo} and leaves\texttt{\VS testbar} as unprocessed. If the parser cannotrecognise anything from the input, then parser combinators justreturn the empty set $\{\}$. This will indicate something``went wrong''\ldots or more precisely, nothing could beparsed.The idea of parser combinators is that we can easily buildparser combinators out of smaller components following veryclosely the structure of a grammar. In order to implement thisin an object-oriented programming language, like Scala, weneed to specify an abstract class for parser combinators. Thisabstract class requires the implementation of the function\texttt{parse} taking an argument of type \texttt{I} andreturns a set of type \mbox{\texttt{Set[(T, I)]}}.\begin{center}\begin{lstlisting}[language=Scala]abstract class Parser[I, T] { def parse(ts: I): Set[(T, I)] def parse_all(ts: I): Set[T] = for ((head, tail) <- parse(ts); if (tail.isEmpty)) yield head}\end{lstlisting}\end{center}\noindent From the function \texttt{parse} we can then``centrally'' derive the function \texttt{parse\_all}, whichjust filters out all pairs whose second component is not empty(that is has still some unprocessed part). The reason is thatat the end of the parsing we are only interested in the resultswhere all the input has been consumed and no unprocessed partis left over.One of the simplest parser combinators recognises just acharacter, say $c$, from the beginning of strings. Itsbehaviour can be described as follows:\begin{itemize}\item if the head of the input string starts with a $c$, then returns the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of} $s$ is the unprocessed part of the input string\item otherwise return the empty set $\{\}$ \end{itemize}\noindentThe input type of this simple parser combinator for characters is\texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. The code in Scala is as follows:\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]case class CharParser(c: Char) extends Parser[String, Char] { def parse(sb: String) = if (sb.head == c) Set((c, sb.tail)) else Set()}\end{lstlisting}\end{center}\noindent The \texttt{parse} function tests whether the firstcharacter of the input string \texttt{sb} is equal to\texttt{c}. If yes, then it splits the string into therecognised part \texttt{c} and the unprocessed part\texttt{sb.tail}. In case \texttt{sb} does not start with\texttt{c} then the parser returns the empty set (in Scala\texttt{Set()}).More interesting are the parser combinators that build largerparsers out of smaller component parsers. For example thealternative parser combinator is as follows: given twoparsers, say, $p$ and $q$, we apply both parsers to theinput (remember parsers are functions) and combine the output(remember they are sets)\begin{center}$p(\text{input}) \cup q(\text{input})$\end{center}\noindent In Scala we would implement alternative parsercombinator as follows\begin{center}\begin{lstlisting}[language=Scala, numbers=none]class AltParser[I, T] (p: => Parser[I, T], q: => Parser[I, T]) extends Parser[I, T] { def parse(sb: I) = p.parse(sb) ++ q.parse(sb)}\end{lstlisting}\end{center}\noindent The types of this parser combinator are polymorphic(we just have \texttt{I} for the input type, and \texttt{T}for the output type). The alternative parser builds a newparser out of two existing parsers \texttt{p} and \texttt{q}.Both need to be able to process input of type \texttt{I} andreturn the same output type \texttt{Set[(T,I)]}.\footnote{There is an interesting detail of Scala, namelythe \texttt{=>} in front of the types of \texttt{p} and\texttt{q}. They will prevent the evaluation of the argumentsbefore they are used. This is often called \emph{lazyevaluation} of the arguments. We will explain this later.} The alternative parser shouldrun the input with the first parser \texttt{p} (producing aset of outputs) and then run the same input with \texttt{q}.The result should be then just the union of both sets, whichis the operation \texttt{++} in Scala.The alternative parser combinator already allows us toconstruct a parser that parses either a character \texttt{a}or \texttt{b}, as\begin{center}\begin{lstlisting}[language=Scala, numbers=none]new AltParser(CharParser('a'), CharParser('b'))\end{lstlisting}\end{center}\noindent Scala allows us to introduce some more readableshorthand notation for this, like \texttt{'a' || 'b'}. We cancall this parser combinator with the strings\begin{center}\begin{tabular}{rcl}input strings & & output\medskip\\\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\\texttt{\Grid{cc}} & $\rightarrow$ & $\{\}$\end{tabular}\end{center}\noindent We receive in the first two cases a successfuloutput (that is a non-empty set). In each case, either\pcode{a} or \pcode{b} is in the processed part, and\pcode{c} in the unprocessed part. Clearly this parser cannotparse anything in the string \pcode{cc}, therefore the emptyset.A bit more interesting is the \emph{sequence parsercombinator}. Given two parsers, say, $p$ and $q$, apply firstthe input to $p$ producing a set of pairs; then apply $q$ toall the unparsed parts; then combine the results like\begin{center}\begin{tabular}{lcl}$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & $(\textit{output}_1, u_1) \in p(\text{input}) \;\wedge\;$\\&& $\;(\textit{output}_2, u_2) \in q(u_1)\}$\end{tabular}\end{center}\noindentThis can be implemented in Scala as follows:\begin{center}\begin{lstlisting}[language=Scala,numbers=none]class SeqParser[I, T, S] (p: => Parser[I, T], q: => Parser[I, S]) extends Parser[I, (T, S)] { def parse(sb: I) = for ((output1, u1) <- p.parse(sb); (output2, u2) <- q.parse(u1)) yield ((output1, output2), u2)}\end{lstlisting}\end{center}\noindent This parser takes as input two parsers, \texttt{p}and \texttt{q}. It implements \texttt{parse} as follows: letfirst run the parser \texttt{p} on the input producing a setof pairs (\texttt{output1}, \texttt{u1}). The \texttt{u1}stands for the unprocessed parts left over by \texttt{p}. Let\texttt{q} run on these unprocessed parts producing again aset of pairs. The output of the sequence parser combinator isthen a set containing pairs where the first components areagain pairs, namely what the first parser could parse togetherwith what the second parser could parse; the second componentis the unprocessed part left over after running the secondparser \texttt{q}. Therefore the input type of the sequenceparser combinator is as usual \texttt{I}, but the output typeis\begin{center}\texttt{Set[((T, S), I)]}\end{center}Scala allows us to provide some shorthand notation for thesequence parser combinator. We can write for example\pcode{'a' ~ 'b'}, which is the parser combinator thatfirst consumes the character \texttt{a} from a string and then\texttt{b}. Three examples of this parser combinator are asfollows:\begin{center}\begin{tabular}{rcl}input strings & & output\medskip\\\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\\texttt{\Grid{bac}} & $\rightarrow$ & $\{\}$\\\texttt{\Grid{ccc}} & $\rightarrow$ & $\{\}$\end{tabular}\end{center}\noindent A slightly more complicated parser is \pcode{('a'|| 'b') ~ 'b'} which parses as first character either an\texttt{a} or \texttt{b} followed by a \texttt{b}. This parserproduces the following outputs.\begin{center}\begin{tabular}{rcl}input strings & & output\medskip\\\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\\texttt{\Grid{aac}} & $\rightarrow$ & $\{\}$\end{tabular}\end{center}\noindent Two more examples: first consider the parser \pcode{('a' ~'a') ~ 'a'} and the input \pcode{aaaa}:\begin{center}\begin{tabular}{rcl}input string & & output\medskip\\\texttt{\Grid{aaaa}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\\end{tabular}\end{center}\noindent Notice how the results nest deeper and deeper aspairs (the last \pcode{a} is in the unprocessed part). Toconsume everything of this string we can use the parser\pcode{(('a' ~'a') ~ 'a') ~ 'a'}. Then the output is asfollows:\begin{center}\begin{tabular}{rcl}input string & & output\medskip\\\texttt{\Grid{aaaa}} & $\rightarrow$ & $\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\\end{tabular}\end{center}\noindent This is an instance where the parser consumedcompletely the input, meaning the unprocessed part is just theempty string.Note carefully that constructing a parser such \pcode{'a' ||('a' ~ 'b')} will result in a typing error. The firstparser has as output type a single character (recall the typeof \texttt{CharParser}), but the second parser produces a pairof characters as output. The alternative parser is howeverrequired to have both component parsers to have the same type.We will see later how we can build this parser without thetyping error.The next parser combinator does not actually combine smallerparsers, but applies a function to the result of a parser.It is implemented in Scala as follows\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]class FunParser[I, T, S] (p: => Parser[I, T], f: T => S) extends Parser[I, S] { def parse(sb: I) = for ((head, tail) <- p.parse(sb)) yield (f(head), tail)}\end{lstlisting}\end{center}\noindent This parser combinator takes a parser \texttt{p}with output type \texttt{T} as one argument as well as afunction \texttt{f} with type \texttt{T => S}. The parser\texttt{p} produces sets of type \texttt{(T, I)}. The\texttt{FunParser} combinator then applies the function\texttt{f} to all the parser outputs. Since this function is oftype \texttt{T => S}, we obtain a parser with output type\texttt{S}. Again Scala lets us introduce some shorthandnotation for this parser combinator. Therefore we will write\texttt{p ==> f} for it.\subsubsection*{How to build parsers using parser combinators?}\subsubsection*{Implementing an Interpreter}%\bigskip%takes advantage of the full generality---have a look%what it produces if we call it with the string \texttt{abc}%%\begin{center}%\begin{tabular}{rcl}%input string & & output\medskip\\%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$%\end{tabular}%\end{center}\end{document}%%% Local Variables: %%% mode: latex %%% TeX-master: t%%% End: