handouts/ho06.tex
author Christian Urban <christian dot urban at kcl dot ac dot uk>
Sun, 02 Oct 2016 15:07:21 +0100
changeset 434 8664ff87cd77
parent 392 2d0a59127694
child 584 529460073b24
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{../style}
\usepackage{../langs}


\begin{document}

\section*{Handout 6 (Parser Combinators)}

In what follows we explain \emph{parser combinators}. Their
distinguishing feature is that they are very easy to
implement. However, they only work when the grammar to be
parsed is \emph{not} left-recursive and they are efficient
only when the grammar is unambiguous. It is the responsibility
of the grammar designer to ensure these two properties.

Parser combinators can deal with any kind of input as long as
this input is a kind of sequence, for example a string or a
list of tokens. The general idea behind parser combinators is
to transform the input into sets of pairs, like so

\begin{center}
$\underbrace{\text{list of tokens}}_{\text{input}}$ 
$\Rightarrow$
$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
\end{center} 

\noindent In Scala parser combinators are functions of type

\begin{center}
\texttt{I $\Rightarrow$ Set[(T, I)]}
\end{center}

\noindent that is they take as input something of type
\texttt{I} and return a set of pairs. The first component of
these pairs corresponds to what the parser combinator was able
to process from the input and the second is the unprocessed
part of the input. As we shall see shortly, a parser
combinator might return more than one such pair, the idea
being that there are potentially several ways of how to
interpret the input. To simplify matters we will first look at
the case where the input to the parser combinators is just
strings. As a concrete example, consider the string

\begin{center}
\tt\Grid{iffoo\VS testbar}
\end{center}

\noindent We might have a parser combinator which tries to
interpret this string as a keyword (\texttt{if}) or as an
identifier (\texttt{iffoo}). Then the output will be the set

\begin{center}
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
           \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
\end{center}

\noindent where the first pair means the parser could
recognise \texttt{if} from the input and leaves the rest as
`unprocessed' as the second component of the pair; in the
other case it could recognise \texttt{iffoo} and leaves
\texttt{\VS testbar} as unprocessed. If the parser cannot
recognise anything from the input, then parser combinators just
return the empty set $\{\}$. This will indicate something
``went wrong''\ldots or more precisely, nothing could be
parsed.

The idea of parser combinators is that we can easily build
parser combinators out of smaller components following very
closely the structure of a grammar. In order to implement this
in an object-oriented programming language, like Scala, we
need to specify an abstract class for parser combinators. This
abstract class requires the implementation of the function
\texttt{parse} taking an argument of type \texttt{I} and
returns a set of type \mbox{\texttt{Set[(T, I)]}}.

\begin{center}
\begin{lstlisting}[language=Scala]
abstract class Parser[I, T] {
  def parse(ts: I): Set[(T, I)]

  def parse_all(ts: I): Set[T] =
    for ((head, tail) <- parse(ts); if (tail.isEmpty)) 
      yield head
}
\end{lstlisting}
\end{center}

\noindent From the function \texttt{parse} we can then
``centrally'' derive the function \texttt{parse\_all}, which
just filters out all pairs whose second component is not empty
(that is has still some unprocessed part). The reason is that
at the end of the parsing we are only interested in the results
where all the input has been consumed and no unprocessed part
is left over.

One of the simplest parser combinators recognises just a
character, say $c$, from the beginning of strings. Its
behaviour can be described as follows:

\begin{itemize}
\item if the head of the input string starts with a $c$, then returns 
	the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of} 
	$s$ is the unprocessed part of the input string
\item otherwise return the empty set $\{\}$	
\end{itemize}

\noindent
The input type of this simple parser combinator for characters is
\texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. 
The code in Scala is as follows:

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
case class CharParser(c: Char) extends Parser[String, Char] {
  def parse(sb: String) = 
    if (sb.head == c) Set((c, sb.tail)) else Set()
}
\end{lstlisting}
\end{center}

\noindent The \texttt{parse} function tests whether the first
character of the input string \texttt{sb} is equal to
\texttt{c}. If yes, then it splits the string into the
recognised part \texttt{c} and the unprocessed part
\texttt{sb.tail}. In case \texttt{sb} does not start with
\texttt{c} then the parser returns the empty set (in Scala
\texttt{Set()}).


More interesting are the parser combinators that build larger
parsers out of smaller component parsers. For example the
alternative parser combinator is as follows: given two
parsers, say, $p$ and $q$, we apply both parsers to the
input (remember parsers are functions) and combine the output
(remember they are sets)
 
\begin{center}
$p(\text{input}) \cup q(\text{input})$
\end{center}

\noindent In Scala we would implement alternative parser
combinator as follows

\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
class AltParser[I, T]
       (p: => Parser[I, T], 
        q: => Parser[I, T]) extends Parser[I, T] {
  def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
}
\end{lstlisting}
\end{center}

\noindent The types of this parser combinator are polymorphic
(we just have \texttt{I} for the input type, and \texttt{T}
for the output type). The alternative parser builds a new
parser out of two existing parsers \texttt{p} and \texttt{q}.
Both need to be able to process input of type \texttt{I} and
return the same output type \texttt{Set[(T,
I)]}.\footnote{There is an interesting detail of Scala, namely
the \texttt{=>} in front of the types of \texttt{p} and
\texttt{q}. They will prevent the evaluation of the arguments
before they are used. This is often called \emph{lazy
evaluation} of the arguments. We will explain this later.} The alternative parser should
run the input with the first parser \texttt{p} (producing a
set of outputs) and then run the same input with \texttt{q}.
The result should be then just the union of both sets, which
is the operation \texttt{++} in Scala.

The alternative parser combinator already allows us to
construct a parser that parses either a character \texttt{a}
or \texttt{b}, as

\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
new AltParser(CharParser('a'), CharParser('b'))
\end{lstlisting}
\end{center}

\noindent Scala allows us to introduce some more readable
shorthand notation for this, like \texttt{'a' || 'b'}. We can
call this parser combinator with the strings

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{cc}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}

\noindent We receive in the first two cases a successful
output (that is a non-empty set). In each case, either
\pcode{a} or \pcode{b} is in the processed part, and
\pcode{c} in the unprocessed part. Clearly this parser cannot
parse anything in the string \pcode{cc}, therefore the empty
set.

A bit more interesting is the \emph{sequence parser
combinator}. Given two parsers, say, $p$ and $q$, apply first
the input to $p$ producing a set of pairs; then apply $q$ to
all the unparsed parts; then combine the results like

\begin{center}
\begin{tabular}{lcl}
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & 
$(\textit{output}_1, u_1) \in p(\text{input}) 
\;\wedge\;$\\
&& $\;(\textit{output}_2, u_2) \in q(u_1)\}$
\end{tabular}
\end{center}

\noindent
This can be implemented in Scala as follows:

\begin{center}
\begin{lstlisting}[language=Scala,numbers=none]
class SeqParser[I, T, S]
       (p: => Parser[I, T], 
        q: => Parser[I, S]) extends Parser[I, (T, S)] {
  def parse(sb: I) = 
    for ((output1, u1) <- p.parse(sb); 
         (output2, u2) <- q.parse(u1)) 
            yield ((output1, output2), u2)
}
\end{lstlisting}
\end{center}

\noindent This parser takes as input two parsers, \texttt{p}
and \texttt{q}. It implements \texttt{parse} as follows: let
first run the parser \texttt{p} on the input producing a set
of pairs (\texttt{output1}, \texttt{u1}). The \texttt{u1}
stands for the unprocessed parts left over by \texttt{p}. Let
\texttt{q} run on these unprocessed parts producing again a
set of pairs. The output of the sequence parser combinator is
then a set containing pairs where the first components are
again pairs, namely what the first parser could parse together
with what the second parser could parse; the second component
is the unprocessed part left over after running the second
parser \texttt{q}. Therefore the input type of the sequence
parser combinator is as usual \texttt{I}, but the output type
is

\begin{center}
\texttt{Set[((T, S), I)]}
\end{center}

Scala allows us to provide some shorthand notation for the
sequence parser combinator. We can write for example
\pcode{'a' ~ 'b'}, which is the parser combinator that
first consumes the character \texttt{a} from a string and then
\texttt{b}. Three examples of this parser combinator are as
follows:

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{bac}} & $\rightarrow$ & $\{\}$\\
\texttt{\Grid{ccc}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}

\noindent A slightly more complicated parser is \pcode{('a'
|| 'b') ~ 'b'} which parses as first character either an
\texttt{a} or \texttt{b} followed by a \texttt{b}. This parser
produces the following outputs.

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{aac}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}

\noindent Two more examples: first consider the parser \pcode{('a' ~
'a') ~ 'a'} and the input \pcode{aaaa}:

\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
\end{tabular}
\end{center}

\noindent Notice how the results nest deeper and deeper as
pairs (the last \pcode{a} is in the unprocessed part). To
consume everything of this string we can use the parser
\pcode{(('a' ~'a') ~ 'a') ~ 'a'}. Then the output is as
follows:

\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
\end{tabular}
\end{center}

\noindent This is an instance where the parser consumed
completely the input, meaning the unprocessed part is just the
empty string.


Note carefully that constructing a parser such \pcode{'a' ||
('a' ~ 'b')} will result in a typing error. The first
parser has as output type a single character (recall the type
of \texttt{CharParser}), but the second parser produces a pair
of characters as output. The alternative parser is however
required to have both component parsers to have the same type.
We will see later how we can build this parser without the
typing error.

The next parser combinator does not actually combine smaller
parsers, but applies a function to the result of a parser.
It is implemented in Scala as follows

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
class FunParser[I, T, S]
         (p: => Parser[I, T], 
          f: T => S) extends Parser[I, S] {
  def parse(sb: I) = 
    for ((head, tail) <- p.parse(sb)) yield (f(head), tail)
}
\end{lstlisting}
\end{center}


\noindent This parser combinator takes a parser \texttt{p}
with output type \texttt{T} as one argument as well as a
function \texttt{f} with type \texttt{T => S}. The parser
\texttt{p} produces sets of type \texttt{(T, I)}. The
\texttt{FunParser} combinator then applies the function
\texttt{f} to all the parser outputs. Since this function is of
type \texttt{T => S}, we obtain a parser with output type
\texttt{S}. Again Scala lets us introduce some shorthand
notation for this parser combinator. Therefore we will write
\texttt{p ==> f} for it.

\subsubsection*{How to build parsers using parser combinators?}

\subsubsection*{Implementing an Interpreter}

%\bigskip
%takes advantage of the full generality---have a look
%what it produces if we call it with the string \texttt{abc}
%
%\begin{center}
%\begin{tabular}{rcl}
%input string & & output\medskip\\
%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
%\end{tabular}
%\end{center}



\end{document}

%%% Local Variables: 
%%% mode: latex  
%%% TeX-master: t
%%% End: