handouts/ho04.tex
changeset 287 2c50b8b5886c
parent 286 19020b75d75e
child 288 39aeca14af8c
--- a/handouts/ho04.tex	Fri Oct 17 14:18:15 2014 +0100
+++ b/handouts/ho04.tex	Sat Oct 18 01:02:03 2014 +0100
@@ -12,8 +12,8 @@
 yes or no depending on whether a string was matched by regular
 expression or not. Often a more interesting question is to
 find out \emph{how} a regular expression matched a string?
-Answering this question will help us with the problem we are
-after, namely tokenising an input string. The algorithm we
+Answering this question will also help us with the problem we
+are after, namely tokenising an input string. The algorithm we
 will be looking at for this was designed by Sulzmann \& Lu in
 a rather recent paper. A link to it is provided on KEATS, in
 case you are interested.\footnote{In my humble opinion this is
@@ -73,7 +73,21 @@
 that was needed to match the string. This means we might also
 return the empty list $[]$, if no copy was needed.
 
-Graphically the algorithm by Sulzmann \& Lu can be represneted
+To emphasise the connection between regular expressions and
+values, I have in my implementation the convention that
+regular expressions are written entirely with upper-case
+letters, while values just start with a single upper-case
+character. My definition of values in Scala is below. I use 
+this in the REPL of Scala; when I use the Scala compiler
+I need to rename some constructors, because Scala on Macs
+does not like classes that are called \pcode{EMPTY} and
+\pcode{Empty}.
+ 
+{\small\lstinputlisting[language=Scala,numbers=none]
+{../progs/app02.scala}}
+
+
+Graphically the algorithm by Sulzmann \& Lu can be illustrated
 by the picture in Figure~\ref{Sulz} where the path from the
 left to the right involving $der/nullable$ is the first phase
 of the algorithm and $mkeps/inj$, the path from right to left,
@@ -132,17 +146,17 @@
 regular expression on the left-hand side. This will become
 important later on.
 
-The second phase of the algorithm is organised recursively
-such that it will calculate a value for how the derivative
-regular expression has matched a string whose first character
-has been chopped off. Now we need a function that reverses
-this ``chopping off'' for values. The corresponding function
-is called $inj$ for injection. This function takes three
-arguments: the first one is a regular expression for which we
-want to calculate the value, the second is the character we
-want to inject and the third argument is the value where we
-will inject the character. The result of this function is a
-new value. The definition of $inj$ is as follows: 
+The second phase of the algorithm is organised so that it will
+calculate a value for how the derivative regular expression
+has matched a string whose first character has been chopped
+off. Now we need a function that reverses this ``chopping
+off'' for values. The corresponding function is called $inj$
+for injection. This function takes three arguments: the first
+one is a regular expression for which we want to calculate the
+value, the second is the character we want to inject and the
+third argument is the value where we will inject the
+character. The result of this function is a new value. The
+definition of $inj$ is as follows: 
 
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
@@ -161,12 +175,12 @@
 there are, for example, three cases for sequnece regular
 expressions. The last clause for the star regular expression
 returns a list where the first element is $inj\,r\,c\,v$ and
-the other elements are $vs$. That mean $\_\,::\,\_$ should be 
+the other elements are $vs$. That means $\_\,::\,\_$ should be 
 read as list cons.
 
 To understand what is going on, it might be best to do some
-example calculations and compare with Figure~\ref{Sulz}. For
-this note that we have not yet dealt with the need of
+example calculations and compare them with Figure~\ref{Sulz}.
+For this note that we have not yet dealt with the need of
 simplifying regular expressions (this will be a topic on its
 own later). Suppose the regular expression is $a \cdot (b
 \cdot c)$ and the input string is $abc$. The derivatives from
@@ -215,8 +229,9 @@
 $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
 \end{center}
 
-\noindent Finally we need to inject back the letter $a$ into
-$v_2$ giving the final result
+\noindent which is again the correct result for how $r_2$
+matched the string $bc$. Finally we need to inject back the
+letter $a$ into $v_2$ giving the final result
 
 \begin{center}
 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
@@ -226,7 +241,7 @@
 expression $a \cdot (b \cdot c)$ matched the string $abc$.
 
 There are a few auxiliary functions that are of interest
-in analysing this algorithm. One is called \emph{flatten},
+when analysing this algorithm. One is called \emph{flatten},
 written $|\_|$, which extracts the string ``underlying'' a 
 value. It is defined recursively as
 
@@ -254,35 +269,39 @@
 \end{tabular}
 \end{center}
 
-\noindent 
-This indicates that $inj$ indeed is injecting, or adding, back
-a character into the value.
+\noindent This indicates that $inj$ indeed is injecting, or
+adding, back a character into the value. Next we look at how
+simplification can be included into this algorithm.
 
 
 \subsubsection*{Simplification}
 
 Generally the matching algorithms based on derivatives do
 poorly unless the regular expressions are simplified after
-each derivatives step. But this is a bit more involved in 
-algorithm of Sulzmann \& Lu. Consider the last derivation
-step in our running example
+each derivative step. But this is a bit more involved in the
+algorithm of Sulzmann \& Lu. So what follows might require you
+to read several times before it makes sense and also might
+require that you do some example calculations. As a first
+example consider the last derivation step in our earlier
+example:
 
 \begin{center}
-$r_4 = der\,c\,r_3 = (\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$
+$r_4 = der\,c\,r_3 = 
+(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$
 \end{center}
 
-\noindent Simplifying the result would just give us
-$\epsilon$. Running $mkeps$ on this regular expression would
-then provide us with $Empty$ instead of $Right(Right(Empty))$
-that was obtained without the simplification. The problem is
-we need to recreate this more complicated value, rather than 
-just $Empty$.
+\noindent Simplifying this regular expression would just give
+us $\epsilon$. Running $mkeps$ with this regular expression as
+input, however, would then provide us with $Empty$ instead of
+$Right(Right(Empty))$ that was obtained without the
+simplification. The problem is we need to recreate this more
+complicated value, rather than just $Empty$.
 
-This requires what I call \emph{rectification functions}. They
-need to be calculated whenever a regular expression gets
+This will require what I call \emph{rectification functions}.
+They need to be calculated whenever a regular expression gets
 simplified. Rectification functions take a value as argument
-and return a (rectified) value. Our simplification rules so
-far are
+and return a (rectified) value. Let us first take a look again
+at our simplification rules:
 
 \begin{center}
 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
@@ -297,38 +316,49 @@
 \end{center}
 
 \noindent Applying them to $r_4$ will require several nested
-simplifications in order end up with just $\epsilon$. 
+simplifications in order end up with just $\epsilon$. However,
+it is possible to apply them in a depth-first, or inside-out,
+manner in order to calculate this simplified regular
+expression.
 
-We can implement this by letting simp return not just a
-(simplified) regular expression, but also a rectification
-function. Let us consider the alternative case, say $r_1 +
-r_2$, first. We would first simplify the component regular
-expressions $r_1$ and $r_2.$ This will return simplified
-versions (if they can be simplified), say $r_{1s}$ and
-$r_{2s}$, but also two rectification functions $f_{1s}$ and
-$f_{2s}$. We need to assemble them in order to obtain a
-rectified value for $r_1 + r_2$. In case $r_{1s}$ simplified
-to $\varnothing$, we would continue the derivative calculation
-with $r_{2s}$. The Sulzmann \& Lu algorithm would return a
-corresponding value, say $v_{2s}$. But now this needs to
-be ``rectified'' to the value 
+The rectification we can implement this by letting simp return
+not just a (simplified) regular expression, but also a
+rectification function. Let us consider the alternative case,
+$r_1 + r_2$, first. By going depth-first, we first simplify
+the component regular expressions $r_1$ and $r_2.$ This will
+return simplified versions (if they can be simplified), say
+$r_{1s}$ and $r_{2s}$, but also two rectification functions
+$f_{1s}$ and $f_{2s}$. We need to assemble them in order to
+obtain a rectified value for $r_1 + r_2$. In case $r_{1s}$
+simplified to $\varnothing$, we continue the derivative
+calculation with $r_{2s}$. The Sulzmann \& Lu algorithm would
+return a corresponding value, say $v_{2s}$. But now this value
+needs to be ``rectified'' to the value 
 
 \begin{center}
 $Right(v_{2s})$
 \end{center}
 
-\noindent Unfortunately, this is not enough because there
-might be some simplifications that happened inside $r_2$ 
-and for which the simplification function retuned also
-a rectification function $f_{2s}$. So in fact we need to
-apply this one too which gives
+\noindent The reason is that we look for the value that tells
+us how $r_1 + r_2$ could have matched the string, not just
+$r_2$ or $r_{2s}$. Unfortunately, this is still not the right
+value in general because there might be some simplifications
+that happened inside $r_2$ and for which the simplification
+function retuned also a rectification function $f_{2s}$. So in
+fact we need to apply this one too which gives
 
 \begin{center}
 $Right(f_{2s}(v_{2s}))$
 \end{center}
 
-\noindent So if we want to return this as function,
-we would need to return
+\noindent This is now the correct, or rectified, value. Since
+the simplification will be done in the first phase of the
+algorithm, but the rectification needs to be done to the
+values in the second phase, it is advantageous to calculate
+the rectification as a function, remember this function and
+then apply the value to this function during the second phase.
+So if we want to implement the rectification as function, we 
+would need to return
 
 \begin{center}
 $\lambda v.\,Right(f_{2s}(v))$
@@ -339,8 +369,9 @@
 after the dot where $v$ is replaced by whatever value is 
 given.
 
-Let us package these ideas into a single function (still only
-considering the alternative case):
+Let us package this idea with rectification functions
+into a single function (still only considering the alternative
+case):
 
 \begin{center}
 \begin{tabular}{l}
@@ -352,45 +383,54 @@
        return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
 \qquad case $r_{2s} = \varnothing$: 
        return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
+\qquad case $r_{1s} = r_{2s}$:
+       return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
 \qquad otherwise: 
        return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
 \end{tabular}
 \end{center}
 
-\noindent We first recursively call the simlification with $r_1$
-and $r_2$. This gives simplified regular expressions, $r_{1s}$
-and $r_{2s}$, as well as two rectification functions $f_{1s}$
-and $f_{2s}$. We next need to test whether the simplified 
-regular expressions are $\varnothing$ so as to make further 
-simplifications. In case $r_{1s}$ is $\varnothing$ then 
-we can return $r_{2s}$ (the other alternative). However
-we need to now build a rectification function, which as
-said above is
+\noindent We first recursively call the simplification with
+$r_1$ and $r_2$. This gives simplified regular expressions,
+$r_{1s}$ and $r_{2s}$, as well as two rectification functions
+$f_{1s}$ and $f_{2s}$. We next need to test whether the
+simplified regular expressions are $\varnothing$ so as to make
+further simplifications. In case $r_{1s}$ is $\varnothing$,
+then we can return $r_{2s}$ (the other alternative). However
+for this we need to build a corresponding rectification 
+function, which as said above is
 
 \begin{center}
 $\lambda v.\,Right(f_{2s}(v))$
 \end{center}
 
-\noindent The case where $r_{2s} = \varnothing$ is similar.
-We return $r_{1s}$ but now have to rectify such that we return
+\noindent The case where $r_{2s} = \varnothing$ is similar:
+We return $r_{1s}$ and rectify with $Left(\_)$ and the
+other calculated rectification function $f_{1s}$. This gives
 
 \begin{center}
 $\lambda v.\,Left(f_{1s}(v))$
 \end{center}
 
-\noindent Note that in this case we have to apply $f_{1s}$,
-not $f_{2s}$, which is responsible to rectify the inner parts
-of $v$. The otherwise-case is slightly interesting. In this
-case neither $r_{1s}$ nor $r_{2s}$ are $\varnothing$ and no
-further simplification can be applied. Accordingly, we return
-$r_{1s} + r_{2s}$ as the simplified regular expression. In
-principle we also do not have to do any rectification, because
-no simplification was done in this case. But this is actually
-not true: There might have been simplifications inside
-$r_{1s}$ and $r_2s$. We therefore need to take into account
-the calculated rectification functions $f_{1s}$ and $f_{2s}$.
-We can do this by defining a rectification function $f_{alt}$
-which takes two rectification functions as arguments
+\noindent The next case where $r_{1s} = r_{2s}$ can be treated
+like the one where $r_{2s} = \varnothing$. We return $r_{1s}$
+and rectify with $Left(\_)$ and so on.
+
+
+The otherwise-case is slightly more complicated. In this case
+neither $r_{1s}$ nor $r_{2s}$ are $\varnothing$ and also
+$r_{1s} \not= r_{2s}$, which means no further simplification
+can be applied. Accordingly, we return $r_{1s} + r_{2s}$ as
+the simplified regular expression. In principle we also do not
+have to do any rectification, because no simplification was
+done in this case. But this is actually not true: There might
+have been simplifications inside $r_{1s}$ and $r_{2s}$. We
+therefore need to take into account the calculated
+rectification functions $f_{1s}$ and $f_{2s}$. We can do this
+by defining a rectification function $f_{alt}$ which takes two
+rectification functions as arguments and applies them
+according to whether the value is of the form $Left(\_)$ or
+$Right(\_)$:
 
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}l}
@@ -402,14 +442,9 @@
 \end{tabular}
 \end{center}
 
-\noindent In essence we need to apply in this case 
-the appropriate rectification function to the inner part
-of the value $v$, whereby $v$ can only be of the form 
-$Right(\_)$ or $Left(\_)$.
-
-The other interesting case with simplification is the 
-sequence case. Here the main simplification function 
-is as follows
+The other interesting case with simplification is the sequence
+case. In this case the main simplification function is as
+follows
 
 \begin{center}
 \begin{tabular}{l}
@@ -430,9 +465,9 @@
 \end{tabular}
 \end{center}
 
-\noindent whereby $f_{seq}$ is again pushing the two 
-rectification functions into the two components of
-the Seq-value:
+\noindent whereby in the last line $f_{seq}$ is again pushing
+the two rectification functions into the two components of the
+Seq-value:
 
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}l}
@@ -442,16 +477,97 @@
 \end{tabular}
 \end{center}
 
-\noindent Note that in the case of $r_{1s}$ (similarly $r_{2s}$)
-we use the function $f_{error}$ for rectification. If you 
-think carefully, then you will see that this function will
-actually never been called. Because a sequence with 
-$\varnothing$ will never recognise any string and therefore
-the second phase of the algorithm would never been called.
-The simplification function still expects us to give a 
-function. So in my own implementation I just returned 
-a function which raises an error. 
+\noindent Note that in the case of $r_{1s} = \varnothing$
+(similarly $r_{2s}$) we use the function $f_{error}$ for
+rectification. If you think carefully, then you will realise
+that this function will actually never been called. This is
+because a sequence with $\varnothing$ will never recognise any
+string and therefore the second phase of the algorithm would
+never been called. The simplification function still expects
+us to give a function. So in my own implementation I just
+returned a function which raises an error. In the case
+where $r_{1s} = \epsilon$ (similarly $r_{2s}$) we have
+to create a sequence where the first component is a rectified
+version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.
+
+Since we only simplify regular expressions of the form $r_1 +
+r_2$ and $r_1 \cdot r_2$ we do not have to do anything else
+in the remaining cases. The rectification function will
+be just the identity, which in lambda-calculus terms is
+just
+
+\begin{center}
+$\lambda v.\,v$
+\end{center}
+
+\noindent This completes the high-level version of the
+simplification function, which is also shown again in 
+Figure~\ref{simp}. This can now be used in a \emph{lexing
+function} as follows:
+
+\begin{figure}[t]
+\begin{center}
+\begin{tabular}{l}
+$simp(r)$:\\
+\quad case $r = r_1 + r_2$\\
+\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
+\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
+\qquad case $r_{1s} = \varnothing$: 
+       return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
+\qquad case $r_{2s} = \varnothing$: 
+       return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
+\qquad case $r_{1s} = r_{2s}$:
+       return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
+\qquad otherwise: 
+       return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$
+       \medskip\\
 
+\quad case $r = r_1 \cdot r_2$\\
+\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
+\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
+\qquad case $r_{1s} = \varnothing$: 
+       return $(\varnothing, f_{error})$\\
+\qquad case $r_{2s} = \varnothing$: 
+       return $(\varnothing, f_{error})$\\
+\qquad case $r_{1s} = \epsilon$: 
+return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
+\qquad case $r_{2s} = \epsilon$: 
+return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
+\qquad otherwise: 
+       return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$
+       \medskip\\
+
+\quad otherwise:\\
+\qquad return $(r, \lambda v.\,v)$\\
+\end{tabular}
+\end{center}
+\caption{The simplification function that returns a simplified 
+regular expression and a rectification function.\label{simp}}
+\end{figure}
+
+\begin{center}
+\begin{tabular}{lcl}
+$lex\,r\,[]$ & $\dn$ & if $nullable(r)$ then $mkeps(r)$\\
+             &       & else $error$\\
+$lex\,r\,c\!::\!s$ & $\dn$ & let 
+   $(r_{simp}, f_{rect}) = simp(der(c, r))$\\
+& & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
+\end{tabular}
+\end{center}
+
+\noindent This corresponds to the $matches$ function we
+have seen in earlier lectures. In the first clause we are
+given an empty string, $[]$, and need to test wether the
+regular expression is $nullable$. If yes we can proceed
+normally and just return the value calculated by $mkeps$.
+The second clause is for strings where the first character is
+$c$ and the rest of the string is $s$. We first build the
+derivative of $r$ with respect to $c$; simplify the resulting 
+regulare expression. We continue lexing with the simplified
+regular expression and the string $s$. Whatever will be
+returned as value, we sill rectify using the $f_{rect}$
+from the simplification and finally inject $c$ back into
+the (rectified) value.
 
 
 \subsubsection*{Records and Tokenisation}