handouts/ho06.tex
author Christian Urban <christian dot urban at kcl dot ac dot uk>
Wed, 25 Nov 2015 15:59:32 +0000
changeset 385 7f8516ff408d
parent 297 5c51839c88fd
child 386 31295bb945c6
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{../style}
\usepackage{../langs}


\begin{document}

\section*{Handout 6 (Parser Combinators)}

In what follows we explain \emph{parser combinators}, because
they are very easy to implement. However, they only work when
the grammar to be parsed is \emph{not} left-recursive and they
are efficient only when the grammar is unambiguous. It is the
responsibility of the grammar designer to ensure these two
properties.

Parser combinators can deal with any kind of input as long as
this input is a kind of sequence, for example a string or a
list of tokens. If the input are lists of tokens, then parser
combinators transform them into sets of pairs, like so

\begin{center}
$\underbrace{\text{list of tokens}}_{\text{input}}$ 
$\Rightarrow$
$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
\end{center} 

\noindent In Scala parser combinators will therefore be
functions of type

\begin{center}
\texttt{I $\Rightarrow$ Set[(T, I)]}
\end{center}

\noindent that is they take as input something of type
\texttt{I} and return a set of pairs. The first component of
these pairs corresponds to what the parser combinator was able
to process from the input and the second is the unprocessed
part of the input. As we shall see shortly, a parser
combinator might return more than one such pair, with the idea
that there are potentially several ways how to interpret the
input. To simplify matters we will first look at the case
where the input to the parser combinators is just strings. As
a concrete example, consider the case where the input is the
string

\begin{center}
\tt\Grid{iffoo\VS testbar}
\end{center}

\noindent We might have a parser combinator which tries to
interpret this string as a keyword (\texttt{if}) or an
identifier (\texttt{iffoo}). Then the output will be the set

\begin{center}
$\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), 
           \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
\end{center}

\noindent where the first pair means the parser could
recognise \texttt{if} from the input and leaves the rest as
`unprocessed' as the second component of the pair; in the
other case it could recognise \texttt{iffoo} and leaves
\texttt{\VS testbar} as unprocessed. If the parser cannot
recognise anything from the input then parser combinators just
return the empty set $\{\}$. This will indicate something
``went wrong''\ldots or more precisely, nothing could be
parsed.

The main attraction of parser combinators is that we can
easily build parser combinators out of smaller components
following very closely the structure of a grammar. In order to
implement this in an object-oriented programming language,
like Scala, we need to specify an abstract class for parser
combinators. This abstract class requires the implementation
of the function \texttt{parse} taking an argument of type
\texttt{I} and returns a set of type \mbox{\texttt{Set[(T,
I)]}}.

\begin{center}
\begin{lstlisting}[language=Scala]
abstract class Parser[I, T] {
  def parse(ts: I): Set[(T, I)]

  def parse_all(ts: I): Set[T] =
    for ((head, tail) <- parse(ts); if (tail.isEmpty)) 
      yield head
}
\end{lstlisting}
\end{center}

\noindent From the function \texttt{parse} we can then
``centrally'' derive the function \texttt{parse\_all}, which
just filters out all pairs whose second component is not empty
(that is has still some unprocessed part). The reason is that
at the end of parsing we are only interested in the results
where all the input has been consumed and no unprocessed part
is left.

One of the simplest parser combinators recognises just a
character, say $c$, from the beginning of strings. Its
behaviour can be described as follows:

\begin{itemize}
\item if the head of the input string starts with a $c$, it returns 
	the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of} 
	$s$ is the unprocessed part of the input string
\item otherwise it returns the empty set $\{\}$	
\end{itemize}

\noindent
The input type of this simple parser combinator for characters is
\texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. 
The code in Scala is as follows:

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
case class CharParser(c: Char) extends Parser[String, Char] {
  def parse(sb: String) = 
    if (sb.head == c) Set((c, sb.tail)) else Set()
}
\end{lstlisting}
\end{center}

\noindent The \texttt{parse} function tests whether the first
character of the input string \texttt{sb} is equal to
\texttt{c}. If yes, then it splits the string into the
recognised part \texttt{c} and the unprocessed part
\texttt{sb.tail}. In case \texttt{sb} does not start with
\texttt{c} then the parser returns the empty set (in Scala
\texttt{Set()}).


More interesting are the parser combinators that build larger
parsers out of smaller component parsers. For example the
alternative parser combinator is as follows: given two
parsers, say, $p$ and $q$, then we apply both parsers to the
input (remember parsers are functions) and combine the input
 
\begin{center}
$p(\text{input}) \cup q(\text{input})$
\end{center}

\noindent
In Scala we would implement alternative parsers as follows

\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
class AltParser[I, T]
       (p: => Parser[I, T], 
        q: => Parser[I, T]) extends Parser[I, T] {
  def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
}
\end{lstlisting}
\end{center}

\noindent The types of this parser combinator are polymorphic
(we just have \texttt{I} for the input type, and \texttt{T}
for the output type). The alternative parser builds a new
parser out of two existing parser combinator \texttt{p} and
\texttt{q}. Both need to be able to process input of type
\texttt{I} and return the same output type \texttt{Set[(T,
I)]}. (There is an interesting detail of Scala, namely the
\texttt{=>} in front of the types of \texttt{p} and
\texttt{q}. They will prevent the evaluation of the arguments
before they are used. This is often called \emph{lazy
evaluation} of the arguments.) The alternative parser should
run the input with the first parser \texttt{p} (producing a
set of outputs) and then run the same input with \texttt{q}.
The result should be then just the union of both sets, which
is the operation \texttt{++} in Scala.

This parser combinator already allows us to construct a parser
that either a character \texttt{a} or \texttt{b}, as

\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
new AltParser(CharParser('a'), CharParser('b'))
\end{lstlisting}
\end{center}

\noindent Scala allows us to introduce some more readable
shorthand notation for this, like \texttt{'a' || 'b'}. We can
call this parser combinator with the strings

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{cc}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}

\noindent We receive in the first two cases a successful
output (that is a non-empty set). In each case, either
\pcode{a} or \pcode{b} are the processed parts, and \pcode{c}
is the unprocessed part. Clearly this parser cannot parse
anything with the string \pcode{cc}.

A bit more interesting is the \emph{sequence parser
combinator}. Given two parsers, say, $p$ and $q$,
apply first the input to $p$ producing a set of pairs;
then apply $q$ to the unparsed parts; then combine the 
results like

\begin{center}
\begin{tabular}{lcl}
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & 
$(\textit{output}_1, u_1) \in p(\text{input}) 
\;\wedge\;$\\
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
\end{tabular}
\end{center}

\noindent
This can be implemented in Scala as follows:

\begin{center}
\begin{lstlisting}[language=Scala,numbers=none]
class SeqParser[I, T, S]
       (p: => Parser[I, T], 
        q: => Parser[I, S]) extends Parser[I, (T, S)] {
  def parse(sb: I) = 
    for ((head1, tail1) <- p.parse(sb); 
         (head2, tail2) <- q.parse(tail1)) 
            yield ((head1, head2), tail2)
}
\end{lstlisting}
\end{center}

\noindent This parser takes as input two parsers, \texttt{p}
and \texttt{q}. It implements \texttt{parse} as follows: let
first run the parser \texttt{p} on the input producing a set
of pairs (\texttt{head1}, \texttt{tail1}). The \texttt{tail1}
stands for the unprocessed parts left over by \texttt{p}. Let
\texttt{q} run on these unprocessed parts producing again a
set of pairs. The output of the sequence parser combinator is
then a set containing pairs where the first components are
again pairs, namely what the first parser could parse together
with what the second parser could parse; the second component
is the unprocessed part left over after running the second
parser \texttt{q}. Therefore the input type of the sequence
parser combinator is as usual \texttt{I}, but the output type
is

\begin{center}
\texttt{Set[((T, S), I)]}
\end{center}

Scala allows us to provide some shorthand notation for the
sequence parser combinator. So we can write for example
\texttt{'a' $\sim$ 'b'}, which is the parser combinator that
first consumes the character \texttt{a} from a string and then
\texttt{b}. Three examples of this parser combinator are as
follows:

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{bac}} & $\rightarrow$ & $\{\}$\\
\texttt{\Grid{ccc}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}

\noindent A slightly more complicated parser is \texttt{('a'
|| 'b') $\sim$ 'b'} which parses as first character either an
\texttt{a} or \texttt{b} followed by a \texttt{b}. This parser
produces the following results.

\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{aac}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}

\noindent Two more examples: first consider the parser \pcode{('a' ~
'a') ~ 'a'} and the input \pcode{aaaa}:

\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
\end{tabular}
\end{center}

\noindent Notice how the results nest deeper as pairs (the
last \pcode{a} is in the unprocessed part). To consume
everything we can use the parser \pcode{(('a' ~'a') ~ 'a') ~
'a'}. Then the output is as follows:

\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
\end{tabular}
\end{center}

\noindent This is an instance where the parser consumed
completely the input, meaning the unprocessed part is just the
empty string.


Note carefully that constructing a parser such \texttt{'a' ||
('a' $\sim$ 'b')} will result in a tying error. The first
parser has as output type a single character (recall the type
of \texttt{CharParser}), but the second parser produces a pair
of characters as output. The alternative parser is however
required to have both component parsers to have the same type.
We will see later how we can build this parser without the
typing error.

The next parser combinator does not actually combine smaller
parsers, but applies a function to the result of the parser.
It is implemented in Scala as follows

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
class FunParser[I, T, S]
         (p: => Parser[I, T], 
          f: T => S) extends Parser[I, S] {
  def parse(sb: I) = 
    for ((head, tail) <- p.parse(sb)) yield (f(head), tail)
}
\end{lstlisting}
\end{center}


\noindent
This parser combinator takes a parser \texttt{p} with output type \texttt{T} as
input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p}
produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then
applies the function \texttt{f} to all the parer outputs. Since this function
is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}.
Again Scala lets us introduce some shorthand notation for this parser combinator. 
Therefore we will write \texttt{p ==> f} for it.

%\bigskip
%takes advantage of the full generality---have a look
%what it produces if we call it with the string \texttt{abc}
%
%\begin{center}
%\begin{tabular}{rcl}
%input string & & output\medskip\\
%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
%\end{tabular}
%\end{center}



\end{document}

%%% Local Variables: 
%%% mode: latex  
%%% TeX-master: t
%%% End: