handouts/ho06.tex
changeset 183 b17eff695c7f
parent 177 53def1fbf472
child 217 cd6066f1056a
--- a/handouts/ho06.tex	Mon Nov 11 00:34:14 2013 +0000
+++ b/handouts/ho06.tex	Mon Nov 11 13:48:34 2013 +0000
@@ -85,7 +85,7 @@
 \makeatother	
 
 \newcommand\Vspace[1][.3em]{%
-  \mbox{\kern.06em\vrle height.3ex}%
+  \mbox{\kern.06em\vrule height.3ex}%
   \vbox{\hrule width#1}%
   \hbox{\vrule height.3ex}}
 
@@ -96,9 +96,9 @@
 \section*{Handout 6}
 
 While regular expressions are very useful for lexing and for recognising
-many patterns (like email addresses), they have their limitations. For
+many patterns in strings (like email addresses), they have their limitations. For
 example there is no regular expression that can recognise the language 
-$a^nb^n$. Another example is the language of well-parenthesised 
+$a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised 
 expressions.  In languages like Lisp, which use parentheses rather
 extensively, it might be of interest whether the following two expressions
 are well-parenthesised (the left one is, the right one is not):
@@ -107,6 +107,8 @@
 $(((()()))())$  \hspace{10mm} $(((()()))()))$
 \end{center}
 
+\noindent
+Not being able to solve such recognition problems is a serious limitation.
 In order to solve such recognition problems, we need more powerful 
 techniques than regular expressions. We will in particular look at \emph{context-free
 languages}. They include the regular languages as the picture below shows:
@@ -134,8 +136,9 @@
 \end{center}
  
 \noindent
-In general grammars consist of finitely many rules built up from terminal symbols (usually lower-case letters)
-and non-terminal symbols (upper-case letters).  Rules have the shape
+In general grammars consist of finitely many rules built up from \emph{terminal symbols} 
+(usually lower-case letters) and \emph{non-terminal symbols} (upper-case letters).  Rules 
+have the shape
 
 \begin{center}
 $NT \;\;\rightarrow\;\; \textit{rhs}$
@@ -236,7 +239,7 @@
 \end{center}
 
 \noindent
-It can be easily seems that the grammar above for arithmetic expressions is left-recursive:
+It can be easily seen that the grammar above for arithmetic expressions is left-recursive:
 for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$ 
 show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive 
 grammars. Fortunately every left-recursive grammar can be transformed into one that is
@@ -252,10 +255,11 @@
 Using this grammar we can still derive every number string, but we will never be able 
 to derive a string of the form $\ldots \rightarrow N \cdot \ldots$.
 
-The other property we have to watch out is when a grammar is
+The other property we have to watch out for is when a grammar is
 \emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees
 for one string. Again the grammar for arithmetic expressions shown above is ambiguous.
-While the shown parse tree for the string $(1 + 23) + 4$ is unique, there are two parse
+While the shown parse tree for the string $(1 + 23) + 4$ is unique, this is not the case in
+general. For example there are two parse
 trees for the string $1 + 2 + 3$, namely
 
 \begin{center}
@@ -291,7 +295,7 @@
 grammars because two different parse-trees for a string mean a program can
 be interpreted in two different ways. In such cases we have to somehow make sure
 the two different ways do not matter, or disambiguate the grammar in
-some way (for example making the $+$ left-associative). Unfortunately already 
+some other way (for example making the $+$ left-associative). Unfortunately already 
 the problem of deciding whether a grammar
 is ambiguous or not is in general undecidable. 
 
@@ -301,7 +305,7 @@
 describes the strings of natural numbers, such as the grammar $N$ shown above.
 For all such strings we want to generate the parse-trees or later on we actually 
 want to extract the meaning of these strings, that is the concrete integers ``behind'' 
-these strings. The parser combinators will be functions of type
+these strings. In Scala the parser combinators will be functions of type
 
 \begin{center}
 \texttt{I $\Rightarrow$ Set[(T, I)]}
@@ -312,9 +316,33 @@
 and return a set of pairs. The first component of these pairs corresponds to what the
 parser combinator was able to process from the input and the second is the unprocessed 
 part of the input. As we shall see shortly, a parser combinator might return more than one such pair,
-with the idea that there are potentially several ways how to interpret the input.
+with the idea that there are potentially several ways how to interpret the input. As a concrete
+example, consider the case where the input is of type string, say the string
+
+\begin{center}
+\tt\Grid{iffoo\VS testbar}
+\end{center}
+
+\noindent
+We might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) or
+an identifier (\texttt{iffoo}). Then the output will be the set
 
-The abstract class for parser combinators requires the implementation of the function
+\begin{center}
+$\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), 
+           \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
+\end{center}
+
+\noindent
+where the first pair means the parser could recognise \texttt{if} from the input and leaves 
+the rest as `unprocessed' as the second component of the pair; in the other case
+it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser
+cannot recognise anything from the input then parser combinators just return the empty 
+set $\varnothing$. This will indicate something ``went wrong''.
+
+The main attraction is that we can easily build parser combinators out of smaller components
+following very closely the structure of a grammar. In order to implement this in an object
+oriented programming language, like Scala, we need to specify an abstract class for parser 
+combinators. This abstract class requires the implementation of the function
 \texttt{parse} taking an argument of type \texttt{I} and returns a set of type  
 \mbox{\texttt{Set[(T, I)]}}.
 
@@ -331,6 +359,11 @@
 \end{center}
 
 \noindent
+From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all},
+which just filters out all pairs whose second component is not empty (that is has still some
+unprocessed part). The reason is that at the end of parsing we are only interested in the
+results where all the input has been consumed and no unprocessed part is left.
+
 One of the simplest parser combinators recognises just a character, say $c$, 
 from the beginning of strings. Its behaviour is as follows:
 
@@ -354,6 +387,168 @@
 \end{lstlisting}
 \end{center}
 
+\noindent
+The \texttt{parse} function tests whether the first character of the 
+input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the
+string into the recognised part \texttt{c} and the unprocessed part
+\texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then
+the parser returns the empty set (in Scala \texttt{Set()}).
+
+More interesting are the parser combinators that build larger parsers
+out of smaller component parsers. For example the alternative 
+parser combinator is as follows.
+
+\begin{center}
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+class AltParser[I, T]
+       (p: => Parser[I, T], 
+        q: => Parser[I, T]) extends Parser[I, T] {
+  def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
+}
+\end{lstlisting}
+\end{center}
+
+\noindent
+The types of this parser combinator are polymorphic (we just have \texttt{I}
+for the input type, and \texttt{T} for the output type). The alternative parser
+builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}.
+Both need to be able to process input of type \texttt{I} and return the same
+output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the 
+\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the
+evaluation of the arguments before they are used. This is often called 
+\emph{lazy evaluation} of the arguments.) The alternative parser should run
+the input with the first parser \texttt{p} (producing a set of outputs) and then
+run the same input with \texttt{q}. The result should be then just the union
+of both sets, which is the operation \texttt{++} in Scala.
+
+This parser combinator already allows us to construct a parser that either 
+a character \texttt{a} or \texttt{b}, as
+
+\begin{center}
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+new AltParser(CharParser('a'), CharParser('b'))
+\end{lstlisting}
+\end{center}
+
+\noindent
+Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. 
+We can call this parser combinator with the strings
+
+\begin{center}
+\begin{tabular}{rcl}
+input string & & output\medskip\\
+\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
+\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
+\texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$
+\end{tabular}
+\end{center}
+
+\noindent
+We receive in the first two cases a successful output (that is a non-empty set).
+
+A bit more interesting is the \emph{sequence parser combinator} implemented in
+Scala as follows:
+
+\begin{center}
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+class SeqParser[I, T, S]
+       (p: => Parser[I, T], 
+        q: => Parser[I, S]) extends Parser[I, (T, S)] {
+  def parse(sb: I) = 
+    for ((head1, tail1) <- p.parse(sb); 
+         (head2, tail2) <- q.parse(tail1)) 
+            yield ((head1, head2), tail2)
+}
+\end{lstlisting}
+\end{center}
+
+\noindent
+This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} 
+as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}).
+The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. 
+Let \texttt{q} run on these unprocessed parts
+producing again a set of pairs. The output of the sequence parser combinator is then a set
+containing pairs where the first components are again pairs, namely what the first parser could parse
+together with what the second parser could parse; the second component is the unprocessed
+part left over after running the second parser \texttt{q}. Therefore the input type of
+the sequence parser combinator is as usual \texttt{I}, but the output type is
+
+\begin{center}
+\texttt{Set[((T, S), I)]}
+\end{center}
+
+Scala allows us to provide some
+shorthand notation for the sequence parser combinator. So we can write for 
+example \texttt{'a'  $\sim$ 'b'}, which is the
+parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}.
+Calling this parser combinator with the strings
+
+\begin{center}
+\begin{tabular}{rcl}
+input string & & output\medskip\\
+\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
+\texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\
+\texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$
+\end{tabular}
+\end{center}
+
+\noindent
+A slightly more complicated parser is \texttt{('a'  || 'b') $\sim$ 'b'} which parses as first character either
+an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results.
+
+\begin{center}
+\begin{tabular}{rcl}
+input string & & output\medskip\\
+\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
+\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
+\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
+\end{tabular}
+\end{center}
+
+Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error.
+The first parser has as output type a single character (recall the type of \texttt{CharParser}),
+but the second parser produces a pair of characters as output. The alternative parser is however
+required to have both component parsers to have the same type. We will see later how we can 
+build this parser without the typing error.
+
+The next parser combinator does not actually combine smaller parsers, but applies
+a function to the result of the parser. It is implemented in Scala as follows
+
+\begin{center}
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+class FunParser[I, T, S]
+         (p: => Parser[I, T], 
+          f: T => S) extends Parser[I, S] {
+  def parse(sb: I) = 
+    for ((head, tail) <- p.parse(sb)) yield (f(head), tail)
+}
+\end{lstlisting}
+\end{center}
+
+
+\noindent
+This parser combinator takes a parser \texttt{p} with output type \texttt{T} as
+input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p}
+produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then
+applies the function \texttt{f} to all the parer outputs. Since this function
+is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}.
+Again Scala lets us introduce some shorthand notation for this parser combinator. 
+Therefore we will write \texttt{p ==> f} for it.
+
+%\bigskip
+%takes advantage of the full generality---have a look
+%what it produces if we call it with the string \texttt{abc}
+%
+%\begin{center}
+%\begin{tabular}{rcl}
+%input string & & output\medskip\\
+%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
+%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
+%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
+%\end{tabular}
+%\end{center}
+
+
 
 \end{document}