afl-material: handouts/ho06.tex@0dfa2923a7c6


% !TEX program = xelatex
\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../grammar}
\usepackage{../graphics}

\begin{document}

\section*{Handout 6 (Parser Combinators)}

This handout explains how \emph{parser combinators} work and how they
can be implemented in Scala. Their most distinguishing feature is that
they are very easy to implement (admittedly it is only easy in a
functional programming language).  Another good point of parser
combinators is that they can deal with any kind of input as long as
this input is of ``sequence-kind'', for example a string or a list of
tokens. The only two properties of the input we need is to be able to
test when it is empty and ``sequentially'' take it apart. Strings and
lists fit this bill. However, parser combinators also have their
drawbacks. For example they require that the grammar to be parsed is
\emph{not} left-recursive and they are efficient only when the grammar
is unambiguous. It is the responsibility of the grammar designer to
ensure these two properties hold.

The general idea behind parser combinators is to transform the input
into sets of pairs, like so

\begin{center}
$\underbrace{\text{list of tokens}}_{\text{input}}$ 
$\quad\Rightarrow\quad$
$\underbrace{\text{set of (parsed part, unprocessed part)}}_{\text{output}}$
\end{center} 

\noindent
Given the extended effort we have spent implementing a lexer in order
to generate lists of tokens, it might be surprising that in what
follows we shall often use strings as input, rather than lists of
tokens. This is for making the explanation more lucid and ensure the
examples are simple. It does not make our previous work on lexers obsolete
(remember they transform a string into a list of tokens). Lexers will
still be needed for building a somewhat realistic compiler. See also
a question in the homework regarding this issue.

As mentioned above, parser combinators are relatively agnostic about what
kind of input they process. In my Scala code I use the following
polymorphic types for parser combinators:

\begin{center}
input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
\end{center}

\noindent That is they take as input something of type \texttt{I} and
return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
needs to be of ``sequence-kind'', I actually have to often write
\code{(using is: I => Seq[_])} for the input type. This ensures the
input is a subtype of Scala sequences.\footnote{This is a new feature
  in Scala 3 and is about type-classes, meaning if you use Scala 2 you will have difficulties
  with running my code.} The first component of the generated pairs
corresponds to what the parser combinator was able to parse from the
input and the second is the unprocessed, or leftover, part of the
input (therefore the type of this unprocessed part is the same as the
input). A parser combinator might return more than one such pair; the
idea is that there are potentially several ways of how to parse the
input.  As a concrete example, consider the string

\begin{center}
\tt\Grid{iffoo\VS testbar}
\end{center}

\noindent We might have a parser combinator which tries to
interpret this string as a keyword (\texttt{if}) or as an
identifier (\texttt{iffoo}). Then the output will be the set

\begin{center}
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
           \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
\end{center}

\noindent where the first pair means the parser could recognise
\texttt{if} from the input and leaves the \texttt{foo\VS testbar} as
unprocessed part; in the other case it could recognise
\texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
parser cannot recognise anything from the input at all, then parser
combinators just return the empty set $\{\}$. This will indicate
something ``went wrong''\ldots or more precisely, nothing could be
parsed.

Also important to note is that the output type \texttt{T} for the
processed part can potentially be different from the input type
\texttt{I} in the parser. In the example above is just happens to be
the same. The reason for the difference is that in general we are
interested in transforming our input into something
``different''\ldots for example into a tree; or if we implement the
grammar for arithmetic expressions, we might be interested in the
actual integer number the arithmetic expression, say \texttt{1 + 2 *
  3}, stands for. In this way we can use parser combinators to
implement relatively easily a calculator, for instance (we shall do
this later on).

The main driving force behind parser combinators is that we can easily
build parser combinators out of smaller components following very
closely the structure of a grammar. In order to implement this in a
functional/object-oriented programming language, like Scala, we need
to specify an abstract class for parser combinators. In the abstract
class we specify that \texttt{I} is the \emph{input type} of the
parser combinator and that \texttt{T} is the \emph{output type}.  This
implies that the function \texttt{parse} takes an argument of type
\texttt{I} and returns a set of type \mbox{\texttt{Set[(T,
    I)]}}.\footnote{In the actual code on KEATS there is some
  kerfuffle about correct type-bounds and using clauses. Ignore this
  part of the implementation for the moment---it is not needed for
  understanding how the code works.}

\begin{center}
\begin{lstlisting}[language=Scala]
abstract class Parser[I, T]  {
  def parse(in: I): Set[(T, I)]  

  def parse_all(in: I) : Set[T] =
    for ((hd, tl) <- parse(in); 
        if tl.isEmpty) yield hd
}
\end{lstlisting}
\end{center}

\noindent It is the obligation in each instance of this class to
supply an implementation for \texttt{parse}.  From this function we
can then ``centrally'' derive the function \texttt{parse\_all}, which
just filters out all pairs whose second component is not empty (that
is has still some unprocessed part). The reason is that at the end of
the parsing we are only interested in the results where all the input
has been consumed and no unprocessed part is left over.

One of the simplest parser combinators recognises just a
single character, say $c$, from the beginning of strings. Its
behaviour can be described as follows:

\begin{itemize}
\item If the head of the input string $s$ starts with a $c$, then return 
  the set
  \[\{(c, \textit{tail-of-}s)\}\]
  where \textit{tail of} 
  $s$ is the unprocessed part of the input string.
\item Otherwise return the empty set $\{\}$.	
\end{itemize}

\noindent
The input type of this simple parser combinator is \texttt{String} and
the output type is \texttt{Char}. This means \texttt{parse} returns
\mbox{\texttt{Set[(Char, String)]}}.  The code in Scala is as follows:

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
case class CharParser(c: Char) extends Parser[String, Char] {
  def parse(s: String) = 
    if (s != "" && s.head == c) Set((c, s.tail)) else Set()
}
\end{lstlisting}
\end{center}

\noindent You can see \texttt{parse} tests here whether the
first character of the input string \texttt{s} is equal to
\texttt{c}. If yes, then it splits the string into the recognised part
\texttt{c} and the unprocessed part \texttt{s.tail}. In case
\texttt{s} does not start with \texttt{c} then the parser returns the
empty set (in Scala \texttt{Set()}). Since this parser recognises
characters and just returns characters as the processed part, the
output type of the parser is \texttt{Char}.

If we want to parse a list of tokens and interested in recognising a
number token, for example, we could write something like this

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
case object NumParser extends Parser[List[Token], Int] {
  def parse(ts: List[Token]) = ts match {
    case Num_token(s)::rest => Set((s.toInt, rest)) 
    case _ => Set ()
  }
}
\end{lstlisting}
\end{center}

\noindent
In this parser the input is of type \texttt{List[Token]}. The function
parse looks at the input \texttt{ts} and checks whether the first
token is a \texttt{Num\_token} (let us assume our lexer generated
these tokens for numbers). But this parser does not just return this
token (and the rest of the list), like the \texttt{CharParser} above,
rather it extracts also the string \texttt{s} from the token and
converts it into an integer. The hope is that the lexer did its work
well and this conversion always succeeds. The consequence of this is
that the output type for this parser is \texttt{Int}, not
\texttt{Token}. Such a conversion would be needed in our parser,
because when we encounter a number in our program, we want to do
some calculations based on integers, not strings (or tokens). 


These simple parsers that just look at the input and do a simple
transformation are often called \emph{atomic} parser combinators.
More interesting are the parser combinators that build larger parsers
out of smaller component parsers. There are three such parser
combinators that can be implemented generically. The \emph{alternative
  parser combinator} is as follows: given two parsers, say, $p$ and
$q$, we apply both parsers to the input (remember parsers are
functions) and combine the output (remember they are sets of pairs):
 
\begin{center}
$p(\text{input}) \cup q(\text{input})$
\end{center}

\noindent In Scala we can implement alternative parser
combinator as follows

\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
class AltParser[I, T]
         (p: => Parser[I, T], 
          q: => Parser[I, T]) extends Parser[I, T] {
  def parse(in: I) = p.parse(in) ++ q.parse(in)   
}    
\end{lstlisting}
\end{center}

\noindent The types of this parser combinator are again generic (we
have \texttt{I} for the input type, and \texttt{T} for the output
type). The alternative parser builds a new parser out of two existing
parsers \texttt{p} and \texttt{q} which are given as arguments.  Both
parsers need to be able to process input of type \texttt{I} and return
in \texttt{parse} the same output type \texttt{Set[(T,
  I)]}.\footnote{There is an interesting detail of Scala, namely the
  \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. These arrows
  will prevent the evaluation of the arguments before they are
  used. This is often called \emph{lazy evaluation} of the
  arguments. We will explain this later.}  The alternative parser runs
the input with the first parser \texttt{p} (producing a set of pairs)
and then runs the same input with \texttt{q} (producing another set of
pairs).  The result should be then just the union of both sets, which
is the operation \texttt{++} in Scala.

The alternative parser combinator allows us to construct a parser that
parses either a character \texttt{a} or \texttt{b} using the
\texttt{CharParser} shown above. For this we can write\footnote{Note
  that we cannot use a \texttt{case}-class for \texttt{AltParser}s
  because of the problem with laziness and Scala quirks. Hating
  \texttt{new} like the plague, we will work around this later with
  some syntax tricks. ;o)}

\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
new AltParser(CharParser('a'), CharParser('b'))
\end{lstlisting}
\end{center}

\noindent Later on we will use Scala mechanism for introducing some
more readable shorthand notation for this, like \texttt{p"a" ||
  p"b"}. But first let us look in detail at what this parser combinator produces
with some sample strings.

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}},\; \texttt{\Grid{cde}})\right\}$\\
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}},\; \texttt{\Grid{cde}})\right\}$\\
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}

\noindent We receive in the first two cases a successful output (that
is a non-empty set). In each case, either \pcode{a} or \pcode{b} is in
the parsed part, and \pcode{cde} in the unprocessed part. Clearly this
parser cannot parse anything of the form \pcode{ccde}, therefore the
empty set is returned in the last case. Observe that parser
combinators only look at the beginning of the given input: they do not
fish out something in the ``middle'' of the input.

A bit more interesting is the \emph{sequence parser combinator}. Given
two parsers, say again, $p$ and $q$, we want to apply first the input
to $p$ producing a set of pairs; then apply $q$ to all the unparsed  
parts in the pairs; and then combine the results. Mathematically we would
write something like this for the set of pairs:

\begin{center}
\begin{tabular}{lcl}
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & 
$(\textit{output}_1, u_1) \in p(\text{input}) 
\;\wedge\;$\\
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
\end{tabular}
\end{center}

\noindent Notice that the $p$ will first be run on the input,
producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$
stands for the unprocessed, or leftover, parts of $p$. We want that
$q$ runs on all these unprocessed parts $u_1$. Therefore these
unprocessed parts are fed into the second parser $q$. The overall
result of the sequence parser combinator is pairs of the form
$((\textit{output}_1, \textit{output}_2), u_2)$. This means the
unprocessed part of the sequence parser combinator is the unprocessed
part the second parser $q$ leaves as leftover. The parsed parts of the
component parsers are combined in a pair, namely
$(\textit{output}_1, \textit{output}_2)$. The reason is we want to
know what $p$ and $q$ were able to parse. This behaviour can be
implemented in Scala as follows:

\begin{center}
\begin{lstlisting}[language=Scala,numbers=none]
class SeqParser[I, T, S]
       (p: => Parser[I, T], 
        q: => Parser[I, S]) extends Parser[I, (T, S)] {
  def parse(in: I) = 
    for ((output1, u1) <- p.parse(in); 
         (output2, u2) <- q.parse(u1)) 
            yield ((output1, output2), u2)
}
\end{lstlisting}
\end{center}

\noindent This parser takes again as arguments two parsers, \texttt{p}
and \texttt{q}. It implements \texttt{parse} as follows: first run the
parser \texttt{p} on the input producing a set of pairs
(\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
unprocessed parts left over by \texttt{p} (recall that there can be
several such pairs).  Let then \texttt{q} run on these unprocessed
parts producing again a set of pairs. The output of the sequence
parser combinator is then a set containing pairs where the first
components are again pairs, namely what the first parser could parse
together with what the second parser could parse; the second component
is the unprocessed part left over after running the second parser
\texttt{q}. Note that the input type of the sequence parser combinator
is as usual \texttt{I}, but the output type is

\begin{center}
\texttt{(T, S)}
\end{center}

\noindent
Consequently, the function \texttt{parse} in the sequence parser
combinator returns sets of type \texttt{Set[((T, S), I)]}.  That means
we have essentially two output types for the sequence parser
combinator (packaged in a pair), because in general \textit{p} and
\textit{q} might produce different things (for example we recognise a
number with \texttt{p} and then with \texttt{q} a string corresponding
to an operator).  If any of the runs of \textit{p} and \textit{q}
fail, that is produce the empty set, then \texttt{parse} will also
produce the empty set.

With the shorthand notation we shall introduce later for the sequence
parser combinator, we can write for example \pcode{p"a" ~ p"b"}, which
is the parser combinator that first recognises the character
\texttt{a} from a string and then \texttt{b}. (Actually, we will be
able to write just \pcode{p"ab"} for such parsers, but it is good to
  understand first what happens behind the scenes.) Let us look again
  at some examples of how the sequence parser combinator processes some
  strings:

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}),\; \texttt{\Grid{cde}})\right\}$\\
\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}

\noindent In the first line we have a successful parse, because the
string starts with \texttt{ab}, which is the prefix we are looking
for. But since the parsing combinator is constructed as sequence of
the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
result is a nested pair of the form \texttt{((a, b), cde)}. It is
\emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
expect.  The parser returns the empty set in the other examples,
because they do not fit with what the parser is supposed to parse.


A slightly more complicated parser is \pcode{(p"a" || p"b") ~ p"c"} which
parses as first character either an \texttt{a} or \texttt{b}, followed
by a \texttt{c}. This parser produces the following outputs.

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}

\noindent
Now consider the parser \pcode{(p"a" ~ p"b") ~ p"c"} which parses
\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces
the following outputs.

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}


\noindent The second and third example fail, because something is
``missing'' in the sequence we are looking for. The first succeeds but
notice how the results nest with sequences: the parsed part is a
nested pair of the form \pcode{((a, b), c)}. If we nest the sequence
parser differently, say \pcode{p"a" ~ (p"b" ~ p"c")}, then also
our output pairs nest differently

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})),\; \texttt{\Grid{de}})\right\}$\\
\end{tabular}
\end{center}

\noindent
Two more examples: first consider the parser
\pcode{(p"a" ~ p"a") ~ p"a"} and the input \pcode{aaaa}:

\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
\end{tabular}
\end{center}

\noindent Notice again how the results nest deeper and deeper as pairs (the
last \pcode{a} is in the unprocessed part). To consume everything of
this string we can use the parser \pcode{((p"a" ~ p"a") ~ p"a") ~
  p"a"}. Then the output is as follows:

\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
\end{tabular}
\end{center}

\noindent This is an instance where the parser consumed
completely the input, meaning the unprocessed part is just the
empty string. So if we called \pcode{parse_all}, instead of \pcode{parse},
we would get back the result

\[
\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
\]

\noindent where the unprocessed (empty) parts have been stripped away
from the pairs; everything where the second part was not empty has
been thrown away as well, because they represent
ultimately-unsuccessful-parses. The main point is that the sequence
parser combinator returns pairs that can nest according to the
nesting of the component parsers.


Consider also carefully that constructing a parser such

\begin{center}
\pcode{p"a" || (p"a" ~ p"b")}
\end{center}

\noindent
will result in a typing error. The intention with this
parser is that we want to parse either an \texttt{a}, or an \texttt{a}
followed by a \texttt{b}. However, the first parser has as output type
a single character (recall the type of \texttt{CharParser}), but the
second parser produces a pair of characters as output. The alternative
parser is required to have both component parsers to have the same
type---the reason is that we need to be able to build the union of two
sets, which requires in Scala that the sets have the same type.  Since
they are not in this case, there is a typing error.  We will see later
how we can build this parser without the typing error.

The next parser combinator, called \emph{semantic action} or \emph{map-parser}, does not
actually combine two smaller parsers, but applies a function to the result
of a parser.  It is implemented in Scala as follows

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
class MapParser[I, T, S]
        (p: => Parser[I, T], 
         f: T => S) extends Parser[I, S] {
  def parse(in: I) =
       for ((hd, tl) <- p.parse(in)) yield (f(hd), tl)
}
\end{lstlisting}
\end{center}


\noindent This parser combinator takes a parser \texttt{p} (with input
type \texttt{I} and output type \texttt{T}) as one argument but also a
function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p}
produces sets of type \texttt{Set[(S, I)]}. The semantic action
combinator then applies the function \texttt{f} to all the `processed'
parser outputs. Since this function is of type \texttt{T => S}, we
obtain a parser with output type \texttt{S}. Again Scala lets us
introduce some shorthand notation for this parser
combinator. Therefore we will write short \texttt{p.map(f)} for it.

What are semantic actions good for? Well, they allow you to transform
the parsed input into datastructures you can use for further
processing. A simple (contrived) example would be to transform parsed
characters into ASCII numbers. Suppose we define a function \texttt{f}
(from characters to \texttt{Int}s) and use a \texttt{CharParser} for parsing
the character \texttt{c}.


\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
val f = (c: Char) => c.toInt
val c = new CharParser('c')
\end{lstlisting}
\end{center}

\noindent
We then can run the following two parsers on the input \texttt{cbd}:

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
c.parse("cbd")
c.map(f).parse("cbd")
\end{lstlisting}
\end{center}

\noindent
In the first line we obtain the expected result \texttt{Set(('c',
  "bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the
character has been transformed into an ASCII number.

A slightly less contrived example is about parsing numbers (recall
\texttt{NumParser} above). However, we want to do this here for
strings, not for tokens.  For this assume we have the following
(atomic) \texttt{RegexParser}.

\begin{center}
  \begin{lstlisting}[language=Scala,xleftmargin=0mm,
    basicstyle=\small\ttfamily, numbers=none]
import scala.util.matching.Regex
    
case class RegexParser(reg: Regex) extends Parser[String, String] {
  def parse(in: String) = reg.findPrefixMatchOf(in) match {
    case None => Set()
    case Some(m) => Set((m.matched, m.after.toString))  
  }
}
\end{lstlisting}
\end{center}

\noindent
This parser takes a regex as argument and splits up a string into a
prefix and the rest according to this regex
(\texttt{reg.findPrefixMatchOf} generates a match---in the successful
case---and the corresponding strings can be extracted with
\texttt{matched} and \texttt{after}). The input and output type for
this parser is \texttt{String}. Using \texttt{RegexParser} we can
define a \texttt{NumParser} for \texttt{Strings} to \texttt{Int} as
follows:

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
val NumParser = RegexParser("[0-9]+".r)
\end{lstlisting}
\end{center}

\noindent
This parser will recognise a number at the beginning of a string. For
example

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
NumParser.parse("123abc")
\end{lstlisting}
\end{center}  

\noindent
produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
still a string (the expected double-quotes are not printed by
Scala). We want to convert this string into the corresponding
\texttt{Int}. We can do this as follows using a semantic action

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
NumParser.map{s => s.toInt}.parse("123abc")
\end{lstlisting}
\end{center}  

\noindent
The function in the semantic action converts a string into an
\texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},
but this time \texttt{123} is an \texttt{Int}. Think carefully what
the input and output type of the parser is without the semantic action
and what with the semantic action (the type of the function can
already tell you). Let us come back to semantic actions when we are
going to implement actual context-free grammars.

\subsubsection*{Shorthand notation for parser combinators}

Before we proceed, let us just explain the shorthand notation for
parser combinators. Like for regular expressions, the shorthand
notation will make our life much easier when writing actual
parsers. We can define some extensions\footnote{In Scala 2 this was
  generically called as ``implicits''.} which allow us to write

\begin{center}
\begin{tabular}{ll}  
  \pcode{p || q} & alternative parser\\
  \pcode{p ~ q} & sequence parser\\ 
  \pcode{p.map(f)} & semantic action parser
\end{tabular}
\end{center}

\noindent
We will also use the \texttt{p}-string-interpolation for specifying simple string parsers.

The idea is that this shorthand notation allows us to easily translate
context-free grammars into code. For example recall our context-free
grammar for palindromes:

\begin{plstx}[margin=3cm]
: \meta{Pal} ::=  a\cdot \meta{Pal}\cdot a | b\cdot \meta{Pal}\cdot b | a | b | \epsilon\\
\end{plstx}

\noindent
Each alternative in this grammar translates into an alternative parser
combinator.  The $\cdot$ can be translated to a sequence parser
combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply
written as \texttt{p"a"}, \texttt{p"b"} and \texttt{p""}.


\subsubsection*{How to build more interesting parsers using parser combinators?}

The beauty of parser combinators is the ease with which they can be
implemented and how easy it is to translate context-free grammars into
code (though the grammars need to be non-left-recursive). To
demonstrate this consider again the grammar for palindromes from above.
The first idea would be to translate it into the following code

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
lazy val Pal : Parser[String, String] = 
  ((p"a" ~ Pal ~ p"a") || (p"b" ~ Pal ~ p"b") || p"a" || p"b" || p"")
\end{lstlisting}
\end{center}

\noindent
Unfortunately, this does not quite work yet as it produces a typing
error. The reason is that the parsers \texttt{p"a"}, \texttt{p"b"} and
\texttt{p""} all produce strings as output type and therefore can be
put into an alternative \texttt{...|| p"a" || p"b" || p""}. But both
sequence parsers \pcode{p"a" ~ Pal ~ p"a"} and \pcode{p"b" ~ Pal ~ p"b"}
produce pairs of the form

\begin{center}
(((\texttt{a}-part, \texttt{Pal}-part), \texttt{a}-part), unprocessed part)
\end{center}

\noindent That is how the
sequence parser combinator nests results when \pcode{\~} is used
between two components. The solution is to use a semantic action that
``flattens'' these pairs and appends the corresponding strings, like

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
lazy val Pal : Parser[String, String] =  
  ((p"a" ~ Pal ~ p"a").map{ case ((x, y), z) => x + y + z } ||
   (p"b" ~ Pal ~ p"b").map{ case ((x, y), z) => x + y + z } ||
    p"a" || p"b" || p"")
\end{lstlisting}
\end{center}

\noindent
How does this work? Well, recall again what the pairs look like for
the parser \pcode{p"a" ~ Pal ~ p"a"}.  The pattern in the semantic
action matches the nested pairs (the \texttt{x} with the
\texttt{a}-part and so on).  Unfortunately when we have such nested
pairs, Scala requires us to define the function using the
\pcode{case}-syntax

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
{ case ((x, y), z) => ... }
\end{lstlisting}
\end{center}

\noindent
If we have more sequence parser combinators or have them differently nested,
then the pattern in the semantic action needs to be adjusted accordingly.
The action we implement above is to concatenate all three strings, which
means after the semantic action is applied the output type of the parser 
is \texttt{String}, which means it fits with the alternative parsers
\texttt{...|| p"a" || p"b" || p""}.

If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain
as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrome
(an empty set would mean something is wrong). But also notice what the
intermediate results are generated by \pcode{Pal.parse("abaaaba")}

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
Set((abaaaba,""),(aba,aaba), (a,baaaba), ("",abaaaba))
\end{lstlisting}
\end{center}

\noindent
That there are more than one output might be slightly unexpected, but
can be explained as follows: the pairs represent all possible
(partial) parses of the string \pcode{"abaaaba"}. The first pair above
corresponds to a complete parse (all output is consumed) and this is
what \pcode{Pal.parse_all} returns. The second pair is a small
``sub-palindrome'' that can also be parsed, but the parse fails with
the rest \pcode{aaba}, which is therefore left as unprocessed. The
third one is an attempt to parse the whole string with the
single-character parser \pcode{a}. That of course only partially
succeeds, by leaving \pcode{"baaaba"} as the unprocessed
part. Finally, since we allow the empty string to be a palindrome we
also obtain the last pair, where actually nothing is consumed from the
input string. While all this works as intended, we need to be careful
with this (especially with including the \pcode{""} parser in our
grammar): if during parsing the set of parsing attempts gets too big,
then the parsing process can become very slow as the potential
candidates for applying rules can snowball.


Important is also to note is that we must define the
\texttt{Pal}-parser as a \emph{lazy} value in Scala. Look again at the
code: \texttt{Pal} occurs on the right-hand side of the definition. If we had
just written

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
val Pal : Parser[String, String] =  ...rhs...
\end{lstlisting}
\end{center}

\noindent
then Scala before making this assignment to \texttt{Pal} attempts to
find out what the expression on the right-hand side evaluates to. This
is straightforward in case of simple expressions \texttt{2 + 3}, but
the expression above contains \texttt{Pal} in the right-hand
side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal}
evaluates to and start a new recursion, which means it falls into an
infinite loop. The definition of \texttt{Pal} is recursive and the
\pcode{lazy} key-word prevents it from being fully evaluated. Therefore
whenever we want to define a recursive parser we have to write

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
lazy val SomeParser : Parser[...,...] =  ...rhs...
\end{lstlisting}
\end{center}

\noindent That was not necessary for our atomic parsers, like
\texttt{RegexParser} or \texttt{CharParser}, because they are not recursive.
Note that this is also the reason why we had to write

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
class AltParser[I, T]
       (p: => Parser[I, T], 
        q: => Parser[I, T]) ... extends Parser[I, T] {...}

class SeqParser[I, T, S]
       (p: => Parser[I, T], 
        q: => Parser[I, S]) ... extends Parser[I, (T, S)] {...}
\end{lstlisting}
\end{center}

\noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of
the argument types for \texttt{p} and \texttt{q} prevent Scala from
evaluating the arguments. Normally, Scala would first evaluate what
kind of parsers \texttt{p} and \texttt{q} are, and only then generate
the alternative parser combinator, respectively sequence parser
combinator. Since the arguments can be recursive parsers, such as
\texttt{Pal}, this would lead again to an infinite loop.

As a final example in this section, let us consider the grammar for
well-nested parentheses:

\begin{plstx}[margin=3cm]
: \meta{P} ::=  (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\
\end{plstx}

\noindent
Let us assume we want to not just recognise strings of
well-nested parentheses but also transform round parentheses
into curly braces. We can do this by using a semantic
action:

\begin{center}
  \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,
    xleftmargin=0mm, numbers=none]
lazy val P : Parser[String, String] = 
  (p"(" ~ P ~ p")" ~ P).map{ case (((_,x),_),y) => "{" + x + "}" + y } || p""
\end{lstlisting}
\end{center}

\noindent
Here we define a function where which ignores the parentheses in the
pairs, but replaces them in the right places with curly braces when
assembling the new string in the right-hand side. If we run
\pcode{P.parse_all("(((()()))())")} we obtain
\texttt{Set(\{\{\{\{\}\{\}\}\}\{\}\})} as expected.



\subsubsection*{Implementing an Interpreter}

The first step before implementing an interpreter for a full-blown
language is to implement a simple calculator for arithmetic
expressions. Suppose our arithmetic expressions are given by the
grammar:

\begin{plstx}[margin=3cm,one per line]
: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E} 
   | \meta{E} \cdot - \cdot \meta{E} 
   | \meta{E} \cdot * \cdot \meta{E} 
   | ( \cdot \meta{E} \cdot )
   | Number \\
\end{plstx}

\noindent
Naturally we want to implement the grammar in such a way that we can
calculate what the result of, for example, \texttt{4*2+3} is---we are
interested in an \texttt{Int} rather than a string. This means every
component parser needs to have as output type \texttt{Int} and when we
assemble the intermediate results, strings like \texttt{"+"},
\texttt{"*"} and so on, need to be translated into the appropriate
Scala operation of adding, multiplying and so on.  Being inspired by
the parser for well-nested parentheses above and ignoring the fact
that we want $*$ to take precedence over $+$ and $-$, we might want to
write something like

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
lazy val E: Parser[String, Int] = 
  ((E ~ p"+" ~ E).map{ case ((x, y), z) => x + z} ||
   (E ~ p"-" ~ E).map{ case ((x, y), z) => x - z} ||
   (E ~ p"*" ~ E).map{ case ((x, y), z) => x * z} ||
   (p"(" ~ E ~ p")").map{ case ((x, y), z) => y} ||
   NumParserInt)
\end{lstlisting}
\end{center}

\noindent
Consider again carefully how the semantic actions pick out the correct
arguments for the calculation. In case of plus, we need \texttt{x} and
\texttt{z}, because they correspond to the results of the component
parser \texttt{E}. We can just add \texttt{x + z} in order to obtain
an \texttt{Int} because the output type of \texttt{E} is
\texttt{Int}.  Similarly with subtraction and multiplication. In
contrast in the fourth clause we need to return \texttt{y}, because it
is the result enclosed inside the parentheses. The information about
parentheses, roughly speaking, we just throw away.

So far so good. The problem arises when we try to call \pcode{parse_all} with the
expression \texttt{"1+2+3"}. Lets try it

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
E.parse_all("1+2+3")
\end{lstlisting}
\end{center}

\noindent
\ldots and we wait and wait and \ldots still wait. What is the
problem? Actually, the parser just fell into an infinite loop! The
reason is that the above grammar is left-recursive and recall that our
parser combinators cannot deal with such left-recursive
grammars. Fortunately, every left-recursive context-free grammar can be
transformed into a non-left-recursive grammars that still recognises
the same strings. This allows us to design the following grammar

\begin{plstx}[margin=3cm]
  : \meta{E} ::=  \meta{T} \cdot + \cdot \meta{E} |  \meta{T} \cdot - \cdot \meta{E} | \meta{T}\\
: \meta{T} ::=  \meta{F} \cdot * \cdot \meta{T} | \meta{F}\\
: \meta{F} ::= ( \cdot \meta{E} \cdot ) | Number\\
\end{plstx}

\noindent
Recall what left-recursive means from Handout 5 and make sure you see
why this grammar is \emph{non} left-recursive. This version of the grammar
also deals with the fact that $*$ should have a higher precedence than $+$ and $-$. This does not
affect which strings this grammar can recognise, but in which order we are going
to evaluate any arithmetic expression. We can translate this grammar into
parsing combinators as follows:


\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
lazy val E: Parser[String, Int] = 
  (T ~ p"+" ~ E).map{ case ((x, y), z) => x + z } ||
  (T ~ p"-" ~ E).map{ case ((x, y), z) => x - z } || T 
lazy val T: Parser[String, Int] = 
  (F ~ p"*" ~ T).map{ case ((x, y), z) => x * z } || F
lazy val F: Parser[String, Int] = 
  (p"(" ~ E ~ p")").map{ case ((x, y), z) => y } || NumParserInt
\end{lstlisting}
\end{center}

\noindent
Let us try out some examples:

\begin{center}
\begin{tabular}{rcl}
  input strings & & output of \pcode{parse_all}\medskip\\
  \texttt{\Grid{1+2+3}} & $\rightarrow$ & \texttt{Set(6)}\\
  \texttt{\Grid{4*2+3}} & $\rightarrow$ & \texttt{Set(11)}\\
  \texttt{\Grid{4*(2+3)}} & $\rightarrow$ & \texttt{Set(20)}\\
  \texttt{\Grid{(4)*((2+3))}} & $\rightarrow$ & \texttt{Set(20)}\\
  \texttt{\Grid{4/2+3}} & $\rightarrow$ & \texttt{Set()}\\
  \texttt{\Grid{1\VS +\VS 2\VS +\VS 3}} & $\rightarrow$ & \texttt{Set()}\\                      
\end{tabular}
\end{center}

\noindent
Note that we call \pcode{parse_all}, not \pcode{parse}.  The examples
should be quite self-explanatory. The last two example do not produce
any integer result because our parser does not define what to do in
case of division (could be easily added), but also has no idea what to
do with whitespaces. To deal with them is the task of the lexer! Yes,
we can deal with them inside the grammar, but that would render many
grammars becoming unintelligible, including this one.\footnote{If you
  think an easy solution is to extend the notion of what a number
  should be, then think again---you still would have to deal with
  cases like \texttt{\Grid{(\VS (\VS 2+3)\VS )}}. Just think of the mess 
  you would have in a grammar for a full-blown language where there are 
  numerous such cases.}

\end{document}

%%% Local Variables: 
%%% mode: latex  
%%% TeX-master: t
%%% End:
author	Christian Urban <christian.urban@kcl.ac.uk>
	Fri, 18 Oct 2024 05:45:14 +0100 (3 months ago)
changeset 969	0dfa2923a7c6
parent 941	66adcae6c762
permissions	-rw-r--r--