handouts/ho05.tex
author Christian Urban <urbanc@in.tum.de>
Tue, 25 Apr 2017 12:33:16 +0100
changeset 486 8178fcf377dc
parent 459 780486571e38
child 545 76a98ed71a2a
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../grammar}

\begin{document}

\section*{Handout 5 (Grammars \& Parser)}

While regular expressions are very useful for lexing and for
recognising many patterns in strings (like email addresses),
they have their limitations. For example there is no regular
expression that can recognise the language $a^nb^n$. Another
example for which there exists no regular expression is the
language of well-parenthesised expressions. In languages like
Lisp, which use parentheses rather extensively, it might be of
interest whether the following two expressions are
well-parenthesised (the left one is, the right one is not):

\begin{center}
$(((()()))())$  \hspace{10mm} $(((()()))()))$
\end{center}

\noindent Not being able to solve such recognition problems is
a serious limitation. In order to solve such recognition
problems, we need more powerful techniques than regular
expressions. We will in particular look at \emph{context-free
languages}. They include the regular languages as the picture
below shows:


\begin{center}
\begin{tikzpicture}
[rect/.style={draw=black!50, 
 top color=white,bottom color=black!20, 
 rectangle, very thick, rounded corners}]

\draw (0,0) node [rect, text depth=30mm, text width=46mm] {\small all languages};
\draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {\small decidable languages};
\draw (0,-0.65) node [rect, text depth=13mm] {\small context sensitive languages};
\draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages};
\draw (0,-1.05) node [rect] {\small regular languages};
\end{tikzpicture}
\end{center}

\noindent Context-free languages play an important role in
`day-to-day' text processing and in programming languages.
Context-free languages are usually specified by grammars. For
example a grammar for well-parenthesised expressions is

\begin{plstx}[margin=3cm]
: \meta{P} ::=  ( \cdot  \meta{P} \cdot ) \cdot \meta{P}
             | \epsilon\\ 
\end{plstx}

\noindent 
or a grammar for recognising strings consisting of ones is

\begin{plstx}[margin=3cm]
: \meta{O} ::= 1 \cdot  \meta{O} 
             | 1\\
\end{plstx}

In general grammars consist of finitely many rules built up
from \emph{terminal symbols} (usually lower-case letters) and
\emph{non-terminal symbols} (upper-case letters inside \meta{\mbox{}}). Rules have
the shape

\begin{plstx}[margin=3cm]
: \meta{NT} ::= rhs\\
\end{plstx}
 
\noindent where on the left-hand side is a single non-terminal
and on the right a string consisting of both terminals and
non-terminals including the $\epsilon$-symbol for indicating
the empty string. We use the convention to separate components
on the right hand-side by using the $\cdot$ symbol, as in the
grammar for well-parenthesised expressions. We also use the
convention to use $|$ as a shorthand notation for several
rules. For example

\begin{plstx}[margin=3cm]
: \meta{NT} ::= rhs_1
   | rhs_2\\
\end{plstx}

\noindent means that the non-terminal \meta{NT} can be replaced by
either $\textit{rhs}_1$ or $\textit{rhs}_2$. If there are more
than one non-terminal on the left-hand side of the rules, then
we need to indicate what is the \emph{starting} symbol of the
grammar. For example the grammar for arithmetic expressions
can be given as follows

\begin{plstx}[margin=3cm,one per line]
\mbox{\rm (1)}: \meta{E} ::= \meta{N}\\
\mbox{\rm (2)}: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E}\\
\mbox{\rm (3)}: \meta{E} ::= \meta{E} \cdot - \cdot \meta{E}\\
\mbox{\rm (4)}: \meta{E} ::= \meta{E} \cdot * \cdot \meta{E}\\
\mbox{\rm (5)}: \meta{E} ::= ( \cdot \meta{E} \cdot )\\
\mbox{\rm (6\ldots)}: \meta{N} ::= \meta{N} \cdot \meta{N} 
  \mid 0 \mid 1 \mid \ldots \mid 9\\
\end{plstx}

\noindent where \meta{E} is the starting symbol. A
\emph{derivation} for a grammar starts with the starting
symbol of the grammar and in each step replaces one
non-terminal by a right-hand side of a rule. A derivation ends
with a string in which only terminal symbols are left. For
example a derivation for the string $(1 + 2) + 3$ is as
follows:

\begin{center}
\begin{tabular}{lll@{\hspace{2cm}}l}
\meta{E} & $\rightarrow$ & $\meta{E}+\meta{E}$          & by (2)\\
       & $\rightarrow$ & $(\meta{E})+\meta{E}$     & by (5)\\
       & $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{E}$   & by (2)\\
       & $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{N}$   & by (1)\\
       & $\rightarrow$ & $(\meta{E}+\meta{E})+3$   & by (6\dots)\\
       & $\rightarrow$ & $(\meta{N}+\meta{E})+3$   & by (1)\\
       & $\rightarrow^+$ & $(1+2)+3$ & by (1, 6\ldots)\\
\end{tabular} 
\end{center}

\noindent where on the right it is indicated which 
grammar rule has been applied. In the last step we
merged several steps into one.

The \emph{language} of a context-free grammar $G$
with start symbol $S$ is defined as the set of strings
derivable by a derivation, that is

\begin{center}
$\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
\end{center}

\noindent
A \emph{parse-tree} encodes how a string is derived with the starting symbol on 
top and each non-terminal containing a subtree for how it is replaced in a derivation.
The parse tree for the string $(1 + 23)+4$ is as follows:

\begin{center}
\begin{tikzpicture}[level distance=8mm, black]
  \node {$E$}
    child {node {$E$} 
       child {node {$($}}
       child {node {$E$}       
         child {node {$E$} child {node {$N$} child {node {$1$}}}}
         child {node {$+$}}
         child {node {$E$} 
            child {node {$N$} child {node {$2$}}}
            child {node {$N$} child {node {$3$}}}
            } 
        }
       child {node {$)$}}
     }
     child {node {$+$}}
     child {node {$E$}
        child {node {$N$} child {node {$4$}}}
     };
\end{tikzpicture}
\end{center}

\noindent We are often interested in these parse-trees since
they encode the structure of how a string is derived by a
grammar. Before we come to the problem of constructing such
parse-trees, we need to consider the following two properties
of grammars. A grammar is \emph{left-recursive} if there is a
derivation starting from a non-terminal, say $NT$ which leads
to a string which again starts with $NT$. This means a
derivation of the form.

\begin{center}
$NT \rightarrow \ldots \rightarrow NT \cdot \ldots$
\end{center}

\noindent It can be easily seen that the grammar above for
arithmetic expressions is left-recursive: for example the
rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow
N\cdot N$ show that this grammar is left-recursive. But note
that left-recursiveness can involve more than one step in the
derivation. The problem with left-recursive grammars is that
some algorithms cannot cope with them: they fall into a loop.
Fortunately every left-recursive grammar can be transformed
into one that is not left-recursive, although this
transformation might make the grammar less ``human-readable''.
For example if we want to give a non-left-recursive grammar
for numbers we might specify

\begin{center}
$N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$
\end{center}

\noindent Using this grammar we can still derive every number
string, but we will never be able to derive a string of the
form $N \to \ldots \to N \cdot \ldots$.

The other property we have to watch out for is when a grammar
is \emph{ambiguous}. A grammar is said to be ambiguous if
there are two parse-trees for one string. Again the grammar
for arithmetic expressions shown above is ambiguous. While the
shown parse tree for the string $(1 + 23) + 4$ is unique, this
is not the case in general. For example there are two parse
trees for the string $1 + 2 + 3$, namely

\begin{center}
\begin{tabular}{c@{\hspace{10mm}}c}
\begin{tikzpicture}[level distance=8mm, black]
  \node {$E$}
    child {node {$E$} child {node {$N$} child {node {$1$}}}}
    child {node {$+$}}
    child {node {$E$}
       child {node {$E$} child {node {$N$} child {node {$2$}}}}
       child {node {$+$}}
       child {node {$E$} child {node {$N$} child {node {$3$}}}}
    }
    ;
\end{tikzpicture} 
&
\begin{tikzpicture}[level distance=8mm, black]
  \node {$E$}
    child {node {$E$}
       child {node {$E$} child {node {$N$} child {node {$1$}}}}
       child {node {$+$}}
       child {node {$E$} child {node {$N$} child {node {$2$}}}} 
    }
    child {node {$+$}}
    child {node {$E$} child {node {$N$} child {node {$3$}}}}
    ;
\end{tikzpicture}
\end{tabular} 
\end{center}

\noindent In particular in programming languages we will try
to avoid ambiguous grammars because two different parse-trees
for a string mean a program can be interpreted in two
different ways. In such cases we have to somehow make sure the
two different ways do not matter, or disambiguate the grammar
in some other way (for example making the $+$
left-associative). Unfortunately already the problem of
deciding whether a grammar is ambiguous or not is in general
undecidable. But in simple instance (the ones we deal in this
module) one can usually see when a grammar is ambiguous.


\subsection*{Parser Combinators}

Let us now turn to the problem of generating a parse-tree for
a grammar and string. In what follows we explain \emph{parser
combinators}, because they are easy to implement and closely
resemble grammar rules. Imagine that a grammar describes the
strings of natural numbers, such as the grammar $N$ shown
above. For all such strings we want to generate the
parse-trees or later on we actually want to extract the
meaning of these strings, that is the concrete integers
``behind'' these strings. In Scala the parser combinators will
be functions of type

\begin{center}
\texttt{I $\Rightarrow$ Set[(T, I)]}
\end{center}

\noindent that is they take as input something of type
\texttt{I}, typically a list of tokens or a string, and return
a set of pairs. The first component of these pairs corresponds
to what the parser combinator was able to process from the
input and the second is the unprocessed part of the input. As
we shall see shortly, a parser combinator might return more
than one such pair, with the idea that there are potentially
several ways how to interpret the input. As a concrete
example, consider the case where the input is of type string,
say the string

\begin{center}
\tt\Grid{iffoo\VS testbar}
\end{center}

\noindent We might have a parser combinator which tries to
interpret this string as a keyword (\texttt{if}) or an
identifier (\texttt{iffoo}). Then the output will be the set

\begin{center}
$\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), 
           \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
\end{center}

\noindent where the first pair means the parser could
recognise \texttt{if} from the input and leaves the rest as
`unprocessed' as the second component of the pair; in the
other case it could recognise \texttt{iffoo} and leaves
\texttt{\VS testbar} as unprocessed. If the parser cannot
recognise anything from the input then parser combinators just
return the empty set $\{\}$. This will indicate
something ``went wrong''.

The main attraction is that we can easily build parser combinators out of smaller components
following very closely the structure of a grammar. In order to implement this in an object
oriented programming language, like Scala, we need to specify an abstract class for parser 
combinators. This abstract class requires the implementation of the function
\texttt{parse} taking an argument of type \texttt{I} and returns a set of type  
\mbox{\texttt{Set[(T, I)]}}.

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
abstract class Parser[I, T] {
  def parse(ts: I): Set[(T, I)]

  def parse_all(ts: I): Set[T] =
    for ((head, tail) <- parse(ts); if (tail.isEmpty)) 
      yield head
}
\end{lstlisting}
\end{center}

\noindent
From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all},
which just filters out all pairs whose second component is not empty (that is has still some
unprocessed part). The reason is that at the end of parsing we are only interested in the
results where all the input has been consumed and no unprocessed part is left.

One of the simplest parser combinators recognises just a character, say $c$, 
from the beginning of strings. Its behaviour is as follows:

\begin{itemize}
\item if the head of the input string starts with a $c$, it returns 
	the set $\{(c, \textit{tail of}\; s)\}$
\item otherwise it returns the empty set $\varnothing$	
\end{itemize}

\noindent
The input type of this simple parser combinator for characters is
\texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. 
The code in Scala is as follows:

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
case class CharParser(c: Char) extends Parser[String, Char] {
  def parse(sb: String) = 
    if (sb.head == c) Set((c, sb.tail)) else Set()
}
\end{lstlisting}
\end{center}

\noindent
The \texttt{parse} function tests whether the first character of the 
input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the
string into the recognised part \texttt{c} and the unprocessed part
\texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then
the parser returns the empty set (in Scala \texttt{Set()}).

More interesting are the parser combinators that build larger parsers
out of smaller component parsers. For example the alternative 
parser combinator is as follows.

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
class AltParser[I, T]
       (p: => Parser[I, T], 
        q: => Parser[I, T]) extends Parser[I, T] {
  def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
}
\end{lstlisting}
\end{center}

\noindent
The types of this parser combinator are polymorphic (we just have \texttt{I}
for the input type, and \texttt{T} for the output type). The alternative parser
builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}.
Both need to be able to process input of type \texttt{I} and return the same
output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the 
\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the
evaluation of the arguments before they are used. This is often called 
\emph{lazy evaluation} of the arguments.) The alternative parser should run
the input with the first parser \texttt{p} (producing a set of outputs) and then
run the same input with \texttt{q}. The result should be then just the union
of both sets, which is the operation \texttt{++} in Scala.

This parser combinator already allows us to construct a parser that either 
a character \texttt{a} or \texttt{b}, as

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
new AltParser(CharParser('a'), CharParser('b'))
\end{lstlisting}
\end{center}

\noindent
Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. 
We can call this parser combinator with the strings

\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$
\end{tabular}
\end{center}

\noindent
We receive in the first two cases a successful output (that is a non-empty set).

A bit more interesting is the \emph{sequence parser combinator} implemented in
Scala as follows:

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
class SeqParser[I, T, S]
       (p: => Parser[I, T], 
        q: => Parser[I, S]) extends Parser[I, (T, S)] {
  def parse(sb: I) = 
    for ((head1, tail1) <- p.parse(sb); 
         (head2, tail2) <- q.parse(tail1)) 
            yield ((head1, head2), tail2)
}
\end{lstlisting}
\end{center}

\noindent
This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} 
as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}).
The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. 
Let \texttt{q} run on these unprocessed parts
producing again a set of pairs. The output of the sequence parser combinator is then a set
containing pairs where the first components are again pairs, namely what the first parser could parse
together with what the second parser could parse; the second component is the unprocessed
part left over after running the second parser \texttt{q}. Therefore the input type of
the sequence parser combinator is as usual \texttt{I}, but the output type is

\begin{center}
\texttt{Set[((T, S), I)]}
\end{center}

Scala allows us to provide some
shorthand notation for the sequence parser combinator. So we can write for 
example \texttt{'a'  $\sim$ 'b'}, which is the
parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}.
Calling this parser combinator with the strings

\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\
\texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$
\end{tabular}
\end{center}

\noindent
A slightly more complicated parser is \texttt{('a'  || 'b') $\sim$ 'b'} which parses as first character either
an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results.

\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
\end{tabular}
\end{center}

Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error.
The first parser has as output type a single character (recall the type of \texttt{CharParser}),
but the second parser produces a pair of characters as output. The alternative parser is however
required to have both component parsers to have the same type. We will see later how we can 
build this parser without the typing error.

The next parser combinator does not actually combine smaller parsers, but applies
a function to the result of the parser. It is implemented in Scala as follows

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
class FunParser[I, T, S]
         (p: => Parser[I, T], 
          f: T => S) extends Parser[I, S] {
  def parse(sb: I) = 
    for ((head, tail) <- p.parse(sb)) yield (f(head), tail)
}
\end{lstlisting}
\end{center}


\noindent
This parser combinator takes a parser \texttt{p} with output type \texttt{T} as
input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p}
produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then
applies the function \texttt{f} to all the parer outputs. Since this function
is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}.
Again Scala lets us introduce some shorthand notation for this parser combinator. 
Therefore we will write \texttt{p ==> f} for it.

%\bigskip
%takes advantage of the full generality---have a look
%what it produces if we call it with the string \texttt{abc}
%
%\begin{center}
%\begin{tabular}{rcl}
%input string & & output\medskip\\
%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
%\end{tabular}
%\end{center}



\end{document}

%%% Local Variables: 
%%% mode: latex  
%%% TeX-master: t
%%% End: