% !TEX program = xelatex\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../grammar}\usepackage{../graphics}\begin{document}\section*{Handout 6 (Parser Combinators)}This handout explains how \emph{parser combinators} work and how theycan be implemented in Scala. Their most distinguishing feature is thatthey are very easy to implement (admittedly it is only easy in afunctional programming language). Another good point of parsercombinators is that they can deal with any kind of input as long asthis input is of ``sequence-kind'', for example a string or a list oftokens. The only two properties of the input we need is to be able totest when it is empty and ``sequentially'' take it apart. Strings andlists fit this bill. However, parser combinators also have theirdrawbacks. For example they require that the grammar to be parsed is\emph{not} left-recursive and they are efficient only when the grammaris unambiguous. It is the responsibility of the grammar designer toensure these two properties hold.The general idea behind parser combinators is to transform the inputinto sets of pairs, like so\begin{center}$\underbrace{\text{list of tokens}}_{\text{input}}$ $\quad\Rightarrow\quad$$\underbrace{\text{set of (parsed part, unprocessed part)}}_{\text{output}}$\end{center} \noindentGiven the extended effort we have spent implementing a lexer in orderto generate lists of tokens, it might be surprising that in whatfollows we shall often use strings as input, rather than lists oftokens. This is for making the explanation more lucid and ensure theexamples are simple. It does not make our previous work on lexers obsolete(remember they transform a string into a list of tokens). Lexers willstill be needed for building a somewhat realistic compiler. See alsoa question in the homework regarding this issue.As mentioned above, parser combinators are relatively agnostic about whatkind of input they process. In my Scala code I use the followingpolymorphic types for parser combinators:\begin{center}input:\;\; \texttt{I} \qquad output:\;\; \texttt{T}\end{center}\noindent That is they take as input something of type \texttt{I} andreturn a set of pairs of type \texttt{Set[(T, I)]}. Since the inputneeds to be of ``sequence-kind'', I actually have to often write\code{(using is: I => Seq[_])} for the input type. This ensures theinput is a subtype of Scala sequences.\footnote{This is a new feature in Scala 3 and is about type-classes, meaning if you use Scala 2 you will have difficulties with running my code.} The first component of the generated pairscorresponds to what the parser combinator was able to parse from theinput and the second is the unprocessed, or leftover, part of theinput (therefore the type of this unprocessed part is the same as theinput). A parser combinator might return more than one such pair; theidea is that there are potentially several ways of how to parse theinput. As a concrete example, consider the string\begin{center}\tt\Grid{iffoo\VS testbar}\end{center}\noindent We might have a parser combinator which tries tointerpret this string as a keyword (\texttt{if}) or as anidentifier (\texttt{iffoo}). Then the output will be the set\begin{center}$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$\end{center}\noindent where the first pair means the parser could recognise\texttt{if} from the input and leaves the \texttt{foo\VS testbar} asunprocessed part; in the other case it could recognise\texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If theparser cannot recognise anything from the input at all, then parsercombinators just return the empty set $\{\}$. This will indicatesomething ``went wrong''\ldots or more precisely, nothing could beparsed.Also important to note is that the output type \texttt{T} for theprocessed part can potentially be different from the input type\texttt{I} in the parser. In the example above is just happens to bethe same. The reason for the difference is that in general we areinterested in transforming our input into something``different''\ldots for example into a tree; or if we implement thegrammar for arithmetic expressions, we might be interested in theactual integer number the arithmetic expression, say \texttt{1 + 2 * 3}, stands for. In this way we can use parser combinators toimplement relatively easily a calculator, for instance (we shall dothis later on).The main driving force behind parser combinators is that we can easilybuild parser combinators out of smaller components following veryclosely the structure of a grammar. In order to implement this in afunctional/object-oriented programming language, like Scala, we needto specify an abstract class for parser combinators. In the abstractclass we specify that \texttt{I} is the \emph{input type} of theparser combinator and that \texttt{T} is the \emph{output type}. Thisimplies that the function \texttt{parse} takes an argument of type\texttt{I} and returns a set of type \mbox{\texttt{Set[(T, I)]}}.\footnote{In the actual code on KEATS there is some kerfuffle about correct type-bounds and using clauses. Ignore this part of the implementation for the moment---it is not needed for understanding how the code works.}\begin{center}\begin{lstlisting}[language=Scala]abstract class Parser[I, T] { def parse(in: I): Set[(T, I)] def parse_all(in: I) : Set[T] = for ((hd, tl) <- parse(in); if tl.isEmpty) yield hd}\end{lstlisting}\end{center}\noindent It is the obligation in each instance of this class tosupply an implementation for \texttt{parse}. From this function wecan then ``centrally'' derive the function \texttt{parse\_all}, whichjust filters out all pairs whose second component is not empty (thatis has still some unprocessed part). The reason is that at the end ofthe parsing we are only interested in the results where all the inputhas been consumed and no unprocessed part is left over.One of the simplest parser combinators recognises just asingle character, say $c$, from the beginning of strings. Itsbehaviour can be described as follows:\begin{itemize}\item If the head of the input string $s$ starts with a $c$, then return the set \[\{(c, \textit{tail-of-}s)\}\] where \textit{tail of} $s$ is the unprocessed part of the input string.\item Otherwise return the empty set $\{\}$. \end{itemize}\noindentThe input type of this simple parser combinator is \texttt{String} andthe output type is \texttt{Char}. This means \texttt{parse} returns\mbox{\texttt{Set[(Char, String)]}}. The code in Scala is as follows:\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]case class CharParser(c: Char) extends Parser[String, Char] { def parse(s: String) = if (s != "" && s.head == c) Set((c, s.tail)) else Set()}\end{lstlisting}\end{center}\noindent You can see \texttt{parse} tests here whether thefirst character of the input string \texttt{s} is equal to\texttt{c}. If yes, then it splits the string into the recognised part\texttt{c} and the unprocessed part \texttt{s.tail}. In case\texttt{s} does not start with \texttt{c} then the parser returns theempty set (in Scala \texttt{Set()}). Since this parser recognisescharacters and just returns characters as the processed part, theoutput type of the parser is \texttt{Char}.If we want to parse a list of tokens and interested in recognising anumber token, for example, we could write something like this\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]case object NumParser extends Parser[List[Token], Int] { def parse(ts: List[Token]) = ts match { case Num_token(s)::rest => Set((s.toInt, rest)) case _ => Set () }}\end{lstlisting}\end{center}\noindentIn this parser the input is of type \texttt{List[Token]}. The functionparse looks at the input \texttt{ts} and checks whether the firsttoken is a \texttt{Num\_token} (let us assume our lexer generatedthese tokens for numbers). But this parser does not just return thistoken (and the rest of the list), like the \texttt{CharParser} above,rather it extracts also the string \texttt{s} from the token andconverts it into an integer. The hope is that the lexer did its workwell and this conversion always succeeds. The consequence of this isthat the output type for this parser is \texttt{Int}, not\texttt{Token}. Such a conversion would be needed in our parser,because when we encounter a number in our program, we want to dosome calculations based on integers, not strings (or tokens). These simple parsers that just look at the input and do a simpletransformation are often called \emph{atomic} parser combinators.More interesting are the parser combinators that build larger parsersout of smaller component parsers. There are three such parsercombinators that can be implemented generically. The \emph{alternative parser combinator} is as follows: given two parsers, say, $p$ and$q$, we apply both parsers to the input (remember parsers arefunctions) and combine the output (remember they are sets of pairs):\begin{center}$p(\text{input}) \cup q(\text{input})$\end{center}\noindent In Scala we can implement alternative parsercombinator as follows\begin{center}\begin{lstlisting}[language=Scala, numbers=none]class AltParser[I, T] (p: => Parser[I, T], q: => Parser[I, T]) extends Parser[I, T] { def parse(in: I) = p.parse(in) ++ q.parse(in) } \end{lstlisting}\end{center}\noindent The types of this parser combinator are again generic (wehave \texttt{I} for the input type, and \texttt{T} for the outputtype). The alternative parser builds a new parser out of two existingparsers \texttt{p} and \texttt{q} which are given as arguments. Bothparsers need to be able to process input of type \texttt{I} and returnin \texttt{parse} the same output type \texttt{Set[(T, I)]}.\footnote{There is an interesting detail of Scala, namely the \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. These arrows will prevent the evaluation of the arguments before they are used. This is often called \emph{lazy evaluation} of the arguments. We will explain this later.} The alternative parser runsthe input with the first parser \texttt{p} (producing a set of pairs)and then runs the same input with \texttt{q} (producing another set ofpairs). The result should be then just the union of both sets, whichis the operation \texttt{++} in Scala.The alternative parser combinator allows us to construct a parser thatparses either a character \texttt{a} or \texttt{b} using the\texttt{CharParser} shown above. For this we can write\footnote{Note that we cannot use a \texttt{case}-class for \texttt{AltParser}s because of the problem with laziness and Scala quirks. Hating \texttt{new} like the plague, we will work around this later with some syntax tricks. ;o)}\begin{center}\begin{lstlisting}[language=Scala, numbers=none]new AltParser(CharParser('a'), CharParser('b'))\end{lstlisting}\end{center}\noindent Later on we will use Scala mechanism for introducing somemore readable shorthand notation for this, like \texttt{p"a" || p"b"}. But first let us look in detail at what this parser combinator produceswith some sample strings.\begin{center}\begin{tabular}{rcl}input strings & & output\medskip\\\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}},\; \texttt{\Grid{cde}})\right\}$\\\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}},\; \texttt{\Grid{cde}})\right\}$\\\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$\end{tabular}\end{center}\noindent We receive in the first two cases a successful output (thatis a non-empty set). In each case, either \pcode{a} or \pcode{b} is inthe parsed part, and \pcode{cde} in the unprocessed part. Clearly thisparser cannot parse anything of the form \pcode{ccde}, therefore theempty set is returned in the last case. Observe that parsercombinators only look at the beginning of the given input: they do notfish out something in the ``middle'' of the input.A bit more interesting is the \emph{sequence parser combinator}. Giventwo parsers, say again, $p$ and $q$, we want to apply first the inputto $p$ producing a set of pairs; then apply $q$ to all the unparsed parts in the pairs; and then combine the results. Mathematically we wouldwrite something like this for the set of pairs:\begin{center}\begin{tabular}{lcl}$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & $(\textit{output}_1, u_1) \in p(\text{input}) \;\wedge\;$\\&& $(\textit{output}_2, u_2) \in q(u_1)\}$\end{tabular}\end{center}\noindent Notice that the $p$ will first be run on the input,producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$stands for the unprocessed, or leftover, parts of $p$. We want that$q$ runs on all these unprocessed parts $u_1$. Therefore theseunprocessed parts are fed into the second parser $q$. The overallresult of the sequence parser combinator is pairs of the form$((\textit{output}_1, \textit{output}_2), u_2)$. This means theunprocessed part of the sequence parser combinator is the unprocessedpart the second parser $q$ leaves as leftover. The parsed parts of thecomponent parsers are combined in a pair, namely$(\textit{output}_1, \textit{output}_2)$. The reason is we want toknow what $p$ and $q$ were able to parse. This behaviour can beimplemented in Scala as follows:\begin{center}\begin{lstlisting}[language=Scala,numbers=none]class SeqParser[I, T, S] (p: => Parser[I, T], q: => Parser[I, S]) extends Parser[I, (T, S)] { def parse(in: I) = for ((output1, u1) <- p.parse(in); (output2, u2) <- q.parse(u1)) yield ((output1, output2), u2)}\end{lstlisting}\end{center}\noindent This parser takes again as arguments two parsers, \texttt{p}and \texttt{q}. It implements \texttt{parse} as follows: first run theparser \texttt{p} on the input producing a set of pairs(\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for theunprocessed parts left over by \texttt{p} (recall that there can beseveral such pairs). Let then \texttt{q} run on these unprocessedparts producing again a set of pairs. The output of the sequenceparser combinator is then a set containing pairs where the firstcomponents are again pairs, namely what the first parser could parsetogether with what the second parser could parse; the second componentis the unprocessed part left over after running the second parser\texttt{q}. Note that the input type of the sequence parser combinatoris as usual \texttt{I}, but the output type is\begin{center}\texttt{(T, S)}\end{center}\noindentConsequently, the function \texttt{parse} in the sequence parsercombinator returns sets of type \texttt{Set[((T, S), I)]}. That meanswe have essentially two output types for the sequence parsercombinator (packaged in a pair), because in general \textit{p} and\textit{q} might produce different things (for example we recognise anumber with \texttt{p} and then with \texttt{q} a string correspondingto an operator). If any of the runs of \textit{p} and \textit{q}fail, that is produce the empty set, then \texttt{parse} will alsoproduce the empty set.With the shorthand notation we shall introduce later for the sequenceparser combinator, we can write for example \pcode{p"a" ~ p"b"}, whichis the parser combinator that first recognises the character\texttt{a} from a string and then \texttt{b}. (Actually, we will beable to write just \pcode{p"ab"} for such parsers, but it is good to understand first what happens behind the scenes.) Let us look again at some examples of how the sequence parser combinator processes some strings:\begin{center}\begin{tabular}{rcl}input strings & & output\medskip\\\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}),\; \texttt{\Grid{cde}})\right\}$\\\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$\end{tabular}\end{center}\noindent In the first line we have a successful parse, because thestring starts with \texttt{ab}, which is the prefix we are lookingfor. But since the parsing combinator is constructed as sequence ofthe two simple (atomic) parsers for \texttt{a} and \texttt{b}, theresult is a nested pair of the form \texttt{((a, b), cde)}. It is\emph{not} a simple pair \texttt{(ab, cde)} as one might erroneouslyexpect. The parser returns the empty set in the other examples,because they do not fit with what the parser is supposed to parse.A slightly more complicated parser is \pcode{(p"a" || p"b") ~ p"c"} whichparses as first character either an \texttt{a} or \texttt{b}, followedby a \texttt{c}. This parser produces the following outputs.\begin{center}\begin{tabular}{rcl}input strings & & output\medskip\\\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\end{tabular}\end{center}\noindentNow consider the parser \pcode{(p"a" ~ p"b") ~ p"c"} which parses\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser producesthe following outputs.\begin{center}\begin{tabular}{rcl}input strings & & output\medskip\\\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$\end{tabular}\end{center}\noindent The second and third example fail, because something is``missing'' in the sequence we are looking for. The first succeeds butnotice how the results nest with sequences: the parsed part is anested pair of the form \pcode{((a, b), c)}. If we nest the sequenceparser differently, say \pcode{p"a" ~ (p"b" ~ p"c")}, then alsoour output pairs nest differently\begin{center}\begin{tabular}{rcl}input strings & & output\medskip\\\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})),\; \texttt{\Grid{de}})\right\}$\\\end{tabular}\end{center}\noindentTwo more examples: first consider the parser\pcode{(p"a" ~ p"a") ~ p"a"} and the input \pcode{aaaa}:\begin{center}\begin{tabular}{rcl}input string & & output\medskip\\\texttt{\Grid{aaaa}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\\end{tabular}\end{center}\noindent Notice again how the results nest deeper and deeper as pairs (thelast \pcode{a} is in the unprocessed part). To consume everything ofthis string we can use the parser \pcode{((p"a" ~ p"a") ~ p"a") ~ p"a"}. Then the output is as follows:\begin{center}\begin{tabular}{rcl}input string & & output\medskip\\\texttt{\Grid{aaaa}} & $\rightarrow$ & $\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\\end{tabular}\end{center}\noindent This is an instance where the parser consumedcompletely the input, meaning the unprocessed part is just theempty string. So if we called \pcode{parse_all}, instead of \pcode{parse},we would get back the result\[\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}\]\noindent where the unprocessed (empty) parts have been stripped awayfrom the pairs; everything where the second part was not empty hasbeen thrown away as well, because they representultimately-unsuccessful-parses. The main point is that the sequenceparser combinator returns pairs that can nest according to thenesting of the component parsers.Consider also carefully that constructing a parser such\begin{center}\pcode{p"a" || (p"a" ~ p"b")}\end{center}\noindentwill result in a typing error. The intention with thisparser is that we want to parse either an \texttt{a}, or an \texttt{a}followed by a \texttt{b}. However, the first parser has as output typea single character (recall the type of \texttt{CharParser}), but thesecond parser produces a pair of characters as output. The alternativeparser is required to have both component parsers to have the sametype---the reason is that we need to be able to build the union of twosets, which requires in Scala that the sets have the same type. Sincethey are not in this case, there is a typing error. We will see laterhow we can build this parser without the typing error.The next parser combinator, called \emph{semantic action} or \emph{map-parser}, does notactually combine two smaller parsers, but applies a function to the resultof a parser. It is implemented in Scala as follows\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]class MapParser[I, T, S] (p: => Parser[I, T], f: T => S) extends Parser[I, S] { def parse(in: I) = for ((hd, tl) <- p.parse(in)) yield (f(hd), tl)}\end{lstlisting}\end{center}\noindent This parser combinator takes a parser \texttt{p} (with inputtype \texttt{I} and output type \texttt{T}) as one argument but also afunction \texttt{f} (with type \texttt{T => S}). The parser \texttt{p}produces sets of type \texttt{Set[(S, I)]}. The semantic actioncombinator then applies the function \texttt{f} to all the `processed'parser outputs. Since this function is of type \texttt{T => S}, weobtain a parser with output type \texttt{S}. Again Scala lets usintroduce some shorthand notation for this parsercombinator. Therefore we will write short \texttt{p.map(f)} for it.What are semantic actions good for? Well, they allow you to transformthe parsed input into datastructures you can use for furtherprocessing. A simple (contrived) example would be to transform parsedcharacters into ASCII numbers. Suppose we define a function \texttt{f}(from characters to \texttt{Int}s) and use a \texttt{CharParser} for parsingthe character \texttt{c}.\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]val f = (c: Char) => c.toIntval c = new CharParser('c')\end{lstlisting}\end{center}\noindentWe then can run the following two parsers on the input \texttt{cbd}:\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]c.parse("cbd")c.map(f).parse("cbd")\end{lstlisting}\end{center}\noindentIn the first line we obtain the expected result \texttt{Set(('c', "bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---thecharacter has been transformed into an ASCII number.A slightly less contrived example is about parsing numbers (recall\texttt{NumParser} above). However, we want to do this here forstrings, not for tokens. For this assume we have the following(atomic) \texttt{RegexParser}.\begin{center} \begin{lstlisting}[language=Scala,xleftmargin=0mm, basicstyle=\small\ttfamily, numbers=none]import scala.util.matching.Regexcase class RegexParser(reg: Regex) extends Parser[String, String] { def parse(in: String) = reg.findPrefixMatchOf(in) match { case None => Set() case Some(m) => Set((m.matched, m.after.toString)) }}\end{lstlisting}\end{center}\noindentThis parser takes a regex as argument and splits up a string into aprefix and the rest according to this regex(\texttt{reg.findPrefixMatchOf} generates a match---in the successfulcase---and the corresponding strings can be extracted with\texttt{matched} and \texttt{after}). The input and output type forthis parser is \texttt{String}. Using \texttt{RegexParser} we candefine a \texttt{NumParser} for \texttt{Strings} to \texttt{Int} asfollows:\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]val NumParser = RegexParser("[0-9]+".r)\end{lstlisting}\end{center}\noindentThis parser will recognise a number at the beginning of a string. Forexample\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]NumParser.parse("123abc")\end{lstlisting}\end{center} \noindentproduces \texttt{Set((123,abc))}. The problem is that \texttt{123} isstill a string (the expected double-quotes are not printed byScala). We want to convert this string into the corresponding\texttt{Int}. We can do this as follows using a semantic action\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]NumParser.map{s => s.toInt}.parse("123abc")\end{lstlisting}\end{center} \noindentThe function in the semantic action converts a string into an\texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},but this time \texttt{123} is an \texttt{Int}. Think carefully whatthe input and output type of the parser is without the semantic actionand what with the semantic action (the type of the function canalready tell you). Let us come back to semantic actions when we aregoing to implement actual context-free grammars.\subsubsection*{Shorthand notation for parser combinators}Before we proceed, let us just explain the shorthand notation forparser combinators. Like for regular expressions, the shorthandnotation will make our life much easier when writing actualparsers. We can define some extensions\footnote{In Scala 2 this was generically called as ``implicits''.} which allow us to write\begin{center}\begin{tabular}{ll} \pcode{p || q} & alternative parser\\ \pcode{p ~ q} & sequence parser\\ \pcode{p.map(f)} & semantic action parser\end{tabular}\end{center}\noindentWe will also use the \texttt{p}-string-interpolation for specifying simple string parsers.The idea is that this shorthand notation allows us to easily translatecontext-free grammars into code. For example recall our context-freegrammar for palindromes:\begin{plstx}[margin=3cm]: \meta{Pal} ::= a\cdot \meta{Pal}\cdot a | b\cdot \meta{Pal}\cdot b | a | b | \epsilon\\\end{plstx}\noindentEach alternative in this grammar translates into an alternative parsercombinator. The $\cdot$ can be translated to a sequence parsercombinator. The parsers for $a$, $b$ and $\epsilon$ can be simplywritten as \texttt{p"a"}, \texttt{p"b"} and \texttt{p""}.\subsubsection*{How to build more interesting parsers using parser combinators?}The beauty of parser combinators is the ease with which they can beimplemented and how easy it is to translate context-free grammars intocode (though the grammars need to be non-left-recursive). Todemonstrate this consider again the grammar for palindromes from above.The first idea would be to translate it into the following code\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]lazy val Pal : Parser[String, String] = ((p"a" ~ Pal ~ p"a") || (p"b" ~ Pal ~ p"b") || p"a" || p"b" || p"")\end{lstlisting}\end{center}\noindentUnfortunately, this does not quite work yet as it produces a typingerror. The reason is that the parsers \texttt{p"a"}, \texttt{p"b"} and\texttt{p""} all produce strings as output type and therefore can beput into an alternative \texttt{...|| p"a" || p"b" || p""}. But bothsequence parsers \pcode{p"a" ~ Pal ~ p"a"} and \pcode{p"b" ~ Pal ~ p"b"}produce pairs of the form\begin{center}(((\texttt{a}-part, \texttt{Pal}-part), \texttt{a}-part), unprocessed part)\end{center}\noindent That is how thesequence parser combinator nests results when \pcode{\~} is usedbetween two components. The solution is to use a semantic action that``flattens'' these pairs and appends the corresponding strings, like\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]lazy val Pal : Parser[String, String] = ((p"a" ~ Pal ~ p"a").map{ case ((x, y), z) => x + y + z } || (p"b" ~ Pal ~ p"b").map{ case ((x, y), z) => x + y + z } || p"a" || p"b" || p"")\end{lstlisting}\end{center}\noindentHow does this work? Well, recall again what the pairs look like forthe parser \pcode{p"a" ~ Pal ~ p"a"}. The pattern in the semanticaction matches the nested pairs (the \texttt{x} with the\texttt{a}-part and so on). Unfortunately when we have such nestedpairs, Scala requires us to define the function using the\pcode{case}-syntax\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]{ case ((x, y), z) => ... }\end{lstlisting}\end{center}\noindentIf we have more sequence parser combinators or have them differently nested,then the pattern in the semantic action needs to be adjusted accordingly.The action we implement above is to concatenate all three strings, whichmeans after the semantic action is applied the output type of the parser is \texttt{String}, which means it fits with the alternative parsers\texttt{...|| p"a" || p"b" || p""}.If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtainas result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrome(an empty set would mean something is wrong). But also notice what theintermediate results are generated by \pcode{Pal.parse("abaaaba")}\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]Set((abaaaba,""),(aba,aaba), (a,baaaba), ("",abaaaba))\end{lstlisting}\end{center}\noindentThat there are more than one output might be slightly unexpected, butcan be explained as follows: the pairs represent all possible(partial) parses of the string \pcode{"abaaaba"}. The first pair abovecorresponds to a complete parse (all output is consumed) and this iswhat \pcode{Pal.parse_all} returns. The second pair is a small``sub-palindrome'' that can also be parsed, but the parse fails withthe rest \pcode{aaba}, which is therefore left as unprocessed. Thethird one is an attempt to parse the whole string with thesingle-character parser \pcode{a}. That of course only partiallysucceeds, by leaving \pcode{"baaaba"} as the unprocessedpart. Finally, since we allow the empty string to be a palindrome wealso obtain the last pair, where actually nothing is consumed from theinput string. While all this works as intended, we need to be carefulwith this (especially with including the \pcode{""} parser in ourgrammar): if during parsing the set of parsing attempts gets too big,then the parsing process can become very slow as the potentialcandidates for applying rules can snowball.Important is also to note is that we must define the\texttt{Pal}-parser as a \emph{lazy} value in Scala. Look again at thecode: \texttt{Pal} occurs on the right-hand side of the definition. If we hadjust written\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]val Pal : Parser[String, String] = ...rhs...\end{lstlisting}\end{center}\noindentthen Scala before making this assignment to \texttt{Pal} attempts tofind out what the expression on the right-hand side evaluates to. Thisis straightforward in case of simple expressions \texttt{2 + 3}, butthe expression above contains \texttt{Pal} in the right-handside. Without \pcode{lazy} it would try to evaluate what \texttt{Pal}evaluates to and start a new recursion, which means it falls into aninfinite loop. The definition of \texttt{Pal} is recursive and the\pcode{lazy} key-word prevents it from being fully evaluated. Thereforewhenever we want to define a recursive parser we have to write\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]lazy val SomeParser : Parser[...,...] = ...rhs...\end{lstlisting}\end{center}\noindent That was not necessary for our atomic parsers, like\texttt{RegexParser} or \texttt{CharParser}, because they are not recursive.Note that this is also the reason why we had to write\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]class AltParser[I, T] (p: => Parser[I, T], q: => Parser[I, T]) ... extends Parser[I, T] {...}class SeqParser[I, T, S] (p: => Parser[I, T], q: => Parser[I, S]) ... extends Parser[I, (T, S)] {...}\end{lstlisting}\end{center}\noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front ofthe argument types for \texttt{p} and \texttt{q} prevent Scala fromevaluating the arguments. Normally, Scala would first evaluate whatkind of parsers \texttt{p} and \texttt{q} are, and only then generatethe alternative parser combinator, respectively sequence parsercombinator. Since the arguments can be recursive parsers, such as\texttt{Pal}, this would lead again to an infinite loop.As a final example in this section, let us consider the grammar forwell-nested parentheses:\begin{plstx}[margin=3cm]: \meta{P} ::= (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\\end{plstx}\noindentLet us assume we want to not just recognise strings ofwell-nested parentheses but also transform round parenthesesinto curly braces. We can do this by using a semanticaction:\begin{center} \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, xleftmargin=0mm, numbers=none]lazy val P : Parser[String, String] = (p"(" ~ P ~ p")" ~ P).map{ case (((_,x),_),y) => "{" + x + "}" + y } || p""\end{lstlisting}\end{center}\noindentHere we define a function where which ignores the parentheses in thepairs, but replaces them in the right places with curly braces whenassembling the new string in the right-hand side. If we run\pcode{P.parse_all("(((()()))())")} we obtain\texttt{Set(\{\{\{\{\}\{\}\}\}\{\}\})} as expected.\subsubsection*{Implementing an Interpreter}The first step before implementing an interpreter for a full-blownlanguage is to implement a simple calculator for arithmeticexpressions. Suppose our arithmetic expressions are given by thegrammar:\begin{plstx}[margin=3cm,one per line]: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E} | \meta{E} \cdot - \cdot \meta{E} | \meta{E} \cdot * \cdot \meta{E} | ( \cdot \meta{E} \cdot ) | Number \\\end{plstx}\noindentNaturally we want to implement the grammar in such a way that we cancalculate what the result of, for example, \texttt{4*2+3} is---we areinterested in an \texttt{Int} rather than a string. This means everycomponent parser needs to have as output type \texttt{Int} and when weassemble the intermediate results, strings like \texttt{"+"},\texttt{"*"} and so on, need to be translated into the appropriateScala operation of adding, multiplying and so on. Being inspired bythe parser for well-nested parentheses above and ignoring the factthat we want $*$ to take precedence over $+$ and $-$, we might want towrite something like\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]lazy val E: Parser[String, Int] = ((E ~ p"+" ~ E).map{ case ((x, y), z) => x + z} || (E ~ p"-" ~ E).map{ case ((x, y), z) => x - z} || (E ~ p"*" ~ E).map{ case ((x, y), z) => x * z} || (p"(" ~ E ~ p")").map{ case ((x, y), z) => y} || NumParserInt)\end{lstlisting}\end{center}\noindentConsider again carefully how the semantic actions pick out the correctarguments for the calculation. In case of plus, we need \texttt{x} and\texttt{z}, because they correspond to the results of the componentparser \texttt{E}. We can just add \texttt{x + z} in order to obtainan \texttt{Int} because the output type of \texttt{E} is\texttt{Int}. Similarly with subtraction and multiplication. Incontrast in the fourth clause we need to return \texttt{y}, because itis the result enclosed inside the parentheses. The information aboutparentheses, roughly speaking, we just throw away.So far so good. The problem arises when we try to call \pcode{parse_all} with theexpression \texttt{"1+2+3"}. Lets try it\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]E.parse_all("1+2+3")\end{lstlisting}\end{center}\noindent\ldots and we wait and wait and \ldots still wait. What is theproblem? Actually, the parser just fell into an infinite loop! Thereason is that the above grammar is left-recursive and recall that ourparser combinators cannot deal with such left-recursivegrammars. Fortunately, every left-recursive context-free grammar can betransformed into a non-left-recursive grammars that still recognisesthe same strings. This allows us to design the following grammar\begin{plstx}[margin=3cm] : \meta{E} ::= \meta{T} \cdot + \cdot \meta{E} | \meta{T} \cdot - \cdot \meta{E} | \meta{T}\\: \meta{T} ::= \meta{F} \cdot * \cdot \meta{T} | \meta{F}\\: \meta{F} ::= ( \cdot \meta{E} \cdot ) | Number\\\end{plstx}\noindentRecall what left-recursive means from Handout 5 and make sure you seewhy this grammar is \emph{non} left-recursive. This version of the grammaralso deals with the fact that $*$ should have a higher precedence than $+$ and $-$. This does notaffect which strings this grammar can recognise, but in which order we are goingto evaluate any arithmetic expression. We can translate this grammar intoparsing combinators as follows:\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]lazy val E: Parser[String, Int] = (T ~ p"+" ~ E).map{ case ((x, y), z) => x + z } || (T ~ p"-" ~ E).map{ case ((x, y), z) => x - z } || T lazy val T: Parser[String, Int] = (F ~ p"*" ~ T).map{ case ((x, y), z) => x * z } || Flazy val F: Parser[String, Int] = (p"(" ~ E ~ p")").map{ case ((x, y), z) => y } || NumParserInt\end{lstlisting}\end{center}\noindentLet us try out some examples:\begin{center}\begin{tabular}{rcl} input strings & & output of \pcode{parse_all}\medskip\\ \texttt{\Grid{1+2+3}} & $\rightarrow$ & \texttt{Set(6)}\\ \texttt{\Grid{4*2+3}} & $\rightarrow$ & \texttt{Set(11)}\\ \texttt{\Grid{4*(2+3)}} & $\rightarrow$ & \texttt{Set(20)}\\ \texttt{\Grid{(4)*((2+3))}} & $\rightarrow$ & \texttt{Set(20)}\\ \texttt{\Grid{4/2+3}} & $\rightarrow$ & \texttt{Set()}\\ \texttt{\Grid{1\VS +\VS 2\VS +\VS 3}} & $\rightarrow$ & \texttt{Set()}\\ \end{tabular}\end{center}\noindentNote that we call \pcode{parse_all}, not \pcode{parse}. The examplesshould be quite self-explanatory. The last two example do not produceany integer result because our parser does not define what to do incase of division (could be easily added), but also has no idea what todo with whitespaces. To deal with them is the task of the lexer! Yes,we can deal with them inside the grammar, but that would render manygrammars becoming unintelligible, including this one.\footnote{If you think an easy solution is to extend the notion of what a number should be, then think again---you still would have to deal with cases like \texttt{\Grid{(\VS (\VS 2+3)\VS )}}. Just think of the mess you would have in a grammar for a full-blown language where there are numerous such cases.}\end{document}%%% Local Variables: %%% mode: latex %%% TeX-master: t%%% End: