handouts/ho04.tex
changeset 422 5deefcc8cffa
parent 417 e74c696821a2
child 444 3056a4c071b0
--- a/handouts/ho04.tex	Tue Aug 23 23:12:55 2016 +0200
+++ b/handouts/ho04.tex	Tue Sep 20 12:24:29 2016 +0100
@@ -10,20 +10,34 @@
 \section*{Handout 4 (Sulzmann \& Lu Algorithm)}
 
 So far our algorithm based on derivatives was only able to say
-yes or no depending on whether a string was matched by regular
+yes or no depending on whether a string was matched by a regular
 expression or not. Often a more interesting question is to
 find out \emph{how} a regular expression matched a string?
 Answering this question will also help us with the problem we
 are after, namely tokenising an input string. 
 
-The algorithm we will be looking at was designed by Sulzmann
-\& Lu in a rather recent paper (from 2014). A link to it is
+The algorithm we will be looking at in this lecture was designed by Sulzmann
+\& Lu in a rather recent research paper (from 2014). A link to it is
 provided on KEATS, in case you are interested.\footnote{In my
 humble opinion this is an interesting instance of the research
 literature: it contains a very neat idea, but its presentation
-is rather sloppy. In earlier versions of their paper, a King's
-student and I found several rather annoying typos in their
-examples and definitions.} In order to give an answer for
+is rather sloppy. In earlier versions of this paper, a King's
+undergraduate student and I found several rather annoying typos in the
+examples and definitions.} My PhD student Fahad Ausaf and I even more recently 
+wrote a paper where we build on their result: we provided a 
+mathematical proof that their algorithm is really correct---the proof 
+Sulzmann \& Lu had originally given contained major flaws. Such correctness
+proofs are important: Kuklewicz maintains a unit-test library
+for the kind of algorithma we are interested in here and he showed 
+that many implementations in the ``wild'' are buggy, that is not
+satisfy his unit tests:
+
+\begin{center}
+\url{http://www.haskell.org/haskellwiki/Regex_Posix}
+\end{center}
+
+
+In order to give an answer for
 \emph{how} a regular expression matches a string, Sulzmann and
 Lu use \emph{values}. A value will be the output of the
 algorithm whenever the regular expression matches the string.
@@ -54,7 +68,7 @@
       & $\mid$ & $Seq(v_1,v_2)$\\
       & $\mid$ & $\Left(v)$   \\
       & $\mid$ & $Right(v)$  \\
-      & $\mid$ & $Stars [v_1,\ldots\,v_n]$ \\
+      & $\mid$ & $Stars\,[v_1,\ldots\,v_n]$ \\
 \end{tabular}
 \end{tabular}
 \end{center}
@@ -71,7 +85,7 @@
 of $r^*$. For sequence, there is exactly one value, composed 
 of two component values ($v_1$ and $v_2$).
 
-My definition of regular expressions and values in Scala is
+My implementation of regular expressions and values in Scala is
 shown below. I have in my implementation the convention that
 regular expressions are written entirely with upper-case
 letters, while values just start with a single upper-case
@@ -87,14 +101,16 @@
 
 Graphically the algorithm by Sulzmann \& Lu can be illustrated
 by the picture in Figure~\ref{Sulz} where the path from the
-left to the right involving $der/nullable$ is the first phase
-of the algorithm and $mkeps/inj$, the path from right to left,
+left to the right involving $\textit{der}/\textit{nullable}$ is the first phase
+of the algorithm and $\textit{mkeps}/\textit{inj}$, the path from right to left,
 the second phase. This picture shows the steps required when a
 regular expression, say $r_1$, matches the string $abc$. We
 first build the three derivatives (according to $a$, $b$ and
-$c$). We then use $nullable$ to find out whether the resulting
+$c$). We then use $\textit{nullable}$ to find out whether the resulting
 regular expression can match the empty string. If yes, we call
-the function $mkeps$.
+the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular
+expression has matched the empty string. Its definition
+is as follows:
 
 \begin{figure}[t]
 \begin{center}
@@ -102,43 +118,41 @@
                     every node/.style={minimum size=7mm}]
 \node (r1)  {$r_1$};
 \node (r2) [right=of r1]{$r_2$};
-\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$};
+\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\textit{der}\,a$};
 \node (r3) [right=of r2]{$r_3$};
-\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$};
+\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\textit{der}\,b$};
 \node (r4) [right=of r3]{$r_4$};
-\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$};
-\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}};
+\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\textit{der}\,c$};
+\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\textit{nullable}$}};
 \node (v4) [below=of r4]{$v_4$};
 \draw[->,line width=1mm](r4) -- (v4);
 \node (v3) [left=of v4] {$v_3$};
-\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$};
+\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\textit{inj}\,c$};
 \node (v2) [left=of v3]{$v_2$};
-\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$};
+\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\textit{inj}\,b$};
 \node (v1) [left=of v2] {$v_1$};
-\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$};
-\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}};
+\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\textit{inj}\,a$};
+\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};
 \end{tikzpicture}
 \end{center}
 \caption{The two phases of the algorithm by Sulzmann \& Lu.
 \label{Sulz}}
 \end{figure}
 
-The $mkeps$ function calculates a value for how a regular
-expression has matched the empty string. Its definition
-is as follows:
+
 
 \begin{center}
 \begin{tabular}{lcl}
-  $mkeps(\ONE)$       & $\dn$ & $Empty$\\
-  $mkeps(r_1 + r_2)$      & $\dn$ & if $nullable(r_1)$  \\
-                          &       & then $\Left(mkeps(r_1))$\\
-                          &       & else $Right(mkeps(r_2))$\\
-  $mkeps(r_1 \cdot r_2)$  & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\
-  $mkeps(r^*)$            & $\dn$ & $[]$  \\
+  $\textit{mkeps}(\ONE)$       & $\dn$ & $Empty$\\
+  $\textit{mkeps}(r_1 + r_2)$      & $\dn$ & if $\textit{nullable}(r_1)$  \\
+                          &       & then $\Left(\textit{mkeps}(r_1))$\\
+                          &       & else $Right(\textit{mkeps}(r_2))$\\
+  $\textit{mkeps}(r_1 \cdot r_2)$  & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\
+  $\textit{mkeps}(r^*)$            & $\dn$ & $Stars\,[]$  \\
 \end{tabular}
 \end{center}
 
-\noindent There are no cases for $\ZERO$ and $c$, since
+\noindent There is are no cases for $\ZERO$ and $c$, since
 these regular expression cannot match the empty string. Note
 also that in case of alternatives we give preference to the
 regular expression on the left-hand side. This will become
@@ -149,31 +163,31 @@
 has matched a string. For this we need a function that
 reverses this ``chopping off'' for values which we did in the
 first phase for derivatives. The corresponding function is
-called $inj$ for injection. This function takes three
+called $\textit{inj}$ (short for injection). This function takes three
 arguments: the first one is a regular expression for which we
 want to calculate the value, the second is the character we
 want to inject and the third argument is the value where we
 will inject the character into. The result of this function is a
-new value. The definition of $inj$ is as follows: 
+new value. The definition of $\textit{inj}$ is as follows: 
 
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
-  $inj\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
-  $inj\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(inj\,r_1\,c\,v)$\\
-  $inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(inj\,r_2\,c\,v)$\\
-  $inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
-  $inj\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
-  $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\
-  $inj\,(r^*)\,c\,Seq(v,vs)$         & $\dn$  & $inj\,r\,c\,v\,::\,vs$\\
+  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
+  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
+  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
+  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
+  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars(\textit{inj}\,r\,c\,v\,::\,vs)$\\
 \end{tabular}
 \end{center}
 
 \noindent This definition is by recursion on the regular
 expression and by analysing the shape of the values. Therefore
-there are, for example, three cases for sequence regular
-expressions (for all possible shapes of the value). The last
+there are three cases for sequence regular
+expressions (for all possible shapes of the value we can encounter). The last
 clause for the star regular expression returns a list where
-the first element is $inj\,r\,c\,v$ and the other elements are
+the first element is $\textit{inj}\,r\,c\,v$ and the other elements are
 $vs$. That means $\_\,::\,\_$ should be read as list cons.
 
 To understand what is going on, it might be best to do some
@@ -195,36 +209,36 @@
 
 \noindent According to the simple algorithm, we would test
 whether $r_4$ is nullable, which in this case it indeed is.
-This means we can use the function $mkeps$ to calculate a
+This means we can use the function $\textit{mkeps}$ to calculate a
 value for how $r_4$ was able to match the empty string.
 Remember that this function gives preference for alternatives
 on the left-hand side. However there is only $\ONE$ on the
-very right-hand side of $r_4$ that matches the empty string.
-Therefore $mkeps$ returns the value
+very right-hand side of $r_4$ (underlined)
+
+\begin{center}
+$r_4$:\;$(\ZERO \cdot (b \cdot c)) + 
+         ((\ZERO \cdot c) + \underline{\ONE})$\\
+\end{center}
+
+\noindent
+that matches the empty string.
+Therefore $\textit{mkeps}$ returns the value
 
 \begin{center}
 $v_4:\;Right(Right(Empty))$
 \end{center}
 
 \noindent If there had been a $\ONE$ on the left, then
-$mkeps$ would have returned something of the form
+$\textit{mkeps}$ would have returned something of the form
 $\Left(\ldots)$. The point is that from this value we can
 directly read off which part of $r_4$ matched the empty
 string: take the right-alternative first, and then the
-right-alternative again. Remember $r_4$ is of the form
-
-\begin{center}
-$r_4$:\;$(\ZERO \cdot (b \cdot c)) + 
-         ((\ZERO \cdot c) + \underline{\ONE})$\\
-\end{center}
-
-\noindent the value tells us that the underlined $\ONE$
-is responsible for matching the empty string.
+right-alternative again. 
 
 Next we have to ``inject'' the last character, that is $c$ in
 the running example, into this value $v_4$ in order to
 calculate how $r_3$ could have matched the string $c$.
-According to the definition of $inj$ we obtain
+According to the definition of $\textit{inj}$ we obtain
 
 \begin{center}
 $v_3:\;Right(Seq(Empty, Char(c)))$
@@ -263,12 +277,12 @@
   $|\Left(v)|$   & $\dn$ & $|v|$\\
   $|Right(v)|$  & $\dn$ & $|v|$\\
   $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
-  $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
+  $|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
 \end{tabular}
 \end{center}
 
 \noindent Using flatten we can see what are the strings behind 
-the values calculated by $mkeps$ and $inj$. In our running 
+the values calculated by $\textit{mkeps}$ and $\textit{inj}$. In our running 
 example:
 
 \begin{center}
@@ -280,10 +294,8 @@
 \end{tabular}
 \end{center}
 
-\noindent This indicates that $inj$ indeed is injecting, or
-adding, back a character into the value. If we continue until
-all characters are injected back, we have a value that can 
-indeed say how the string $abc$ was matched.
+\noindent This indicates that $\textit{inj}$ indeed is injecting, or
+adding, back a character into the value. 
 
 There is a problem, however, with the described algorithm
 so far: it is very slow. We need to include the simplification 
@@ -301,23 +313,29 @@
 first example consider the last derivation step in our earlier
 example:
 
-\begin{center}
-$r_4 = der\,c\,r_3 = 
-(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$
-\end{center}
+\begin{equation}\label{regexfour}
+r_4 = \textit{der}\,c\,r_3 = 
+(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)
+\end{equation}
 
 \noindent Simplifying this regular expression would just give
-us $\ONE$. Running $mkeps$ with this regular expression as
-input, however, would then provide us with $Empty$ instead of
-$Right(Right(Empty))$ that was obtained without the
-simplification. The problem is we need to recreate this more
-complicated value, rather than just return $Empty$.
+us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as
+input, however, would give us with the value $Empty$ instead of
+
+\[Right(Right(Empty))\]
 
-This will require what I call \emph{rectification functions}.
-They need to be calculated whenever a regular expression gets
-simplified. Rectification functions take a value as argument
-and return a (rectified) value. Let us first take a look again
-at our simplification rules:
+\noindent
+that was obtained  without the simplification. The problem  is we need
+to  recreate this  more  complicated value,  rather  than just  return
+$Empty$. This will require what I call \emph{rectification functions}.
+They  need  to  be  calculated  whenever  a  regular  expression  gets
+simplified.
+
+Rectification functions take a value as argument and return a
+(rectified) value. In the example above the argument would be $Empty$
+and the output $Right(Right(Empty))$.  Before we define these
+rectification functions in general, let us first take a look again at
+our simplification rules:
 
 \begin{center}
 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
@@ -331,7 +349,7 @@
 \end{tabular}
 \end{center}
 
-\noindent Applying them to $r_4$ will require several nested
+\noindent Applying them to $r_4$ in \eqref{regexfour} will require several nested
 simplifications in order end up with just $\ONE$. However,
 it is possible to apply them in a depth-first, or inside-out,
 manner in order to calculate this simplified regular
@@ -391,28 +409,28 @@
 case):
 
 \begin{center}
-\begin{tabular}{l}
-$simp(r)$:\\
-\quad case $r = r_1 + r_2$\\
-\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
-\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
-\qquad case $r_{1s} = \ZERO$: 
+\begin{tabular}{ll}
+$_1$ & $simp(r)$:\\
+$_2$ & \quad case $r = r_1 + r_2$\\
+$_3$ & \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
+$_4$ & \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
+$_5$ & \qquad case $r_{1s} = \ZERO$: 
        return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
-\qquad case $r_{2s} = \ZERO$: 
+$_6$ & \qquad case $r_{2s} = \ZERO$: 
        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
-\qquad case $r_{1s} = r_{2s}$:
+$_7$ & \qquad case $r_{1s} = r_{2s}$:
        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
-\qquad otherwise: 
+$_8$ & \qquad otherwise: 
        return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
 \end{tabular}
 \end{center}
 
 \noindent We first recursively call the simplification with
-$r_1$ and $r_2$. This gives simplified regular expressions,
+$r_1$ and $r_2$ (Lines 3 and 4). This gives simplified regular expressions,
 $r_{1s}$ and $r_{2s}$, as well as two rectification functions
 $f_{1s}$ and $f_{2s}$. We next need to test whether the
 simplified regular expressions are $\ZERO$ so as to make
-further simplifications. In case $r_{1s}$ is $\ZERO$,
+further simplifications. In case $r_{1s}$ is $\ZERO$ (Line 5),
 then we can return $r_{2s}$ (the other alternative). However
 for this we need to build a corresponding rectification 
 function, which as said above is
@@ -465,7 +483,7 @@
 
 \begin{center}
 \begin{tabular}{l}
-$simp(r)$:\qquad\qquad (continued)\\
+$simp(r)$:\hspace{5cm} (continued)\\
 \quad case $r = r_1 \cdot r_2$\\
 \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
@@ -517,10 +535,9 @@
 $\lambda v.\,v$
 \end{center}
 
-\noindent This completes the high-level version of the
-simplification function, which is also shown again in 
-Figure~\ref{simp}. This can now be used in a \emph{lexing
-function} as follows:
+\noindent This completes the high-level version of the simplification
+function, which is shown again in Figure~\ref{simp}. The Scala code
+for the simplifaction function is in Figure~\ref{simprect}.
 
 \begin{figure}[t]
 \begin{center}
@@ -562,13 +579,25 @@
 regular expression and a rectification function.\label{simp}}
 \end{figure}
 
+\begin{figure}[p]
+\lstinputlisting{../progs/app61.scala}
+
+\caption{The Scala code for the simplification function. The
+first part defines some auxillary functions for the rectification.
+The second part give the simplification function.
+\label{simprect}}
+\end{figure}
+
+
+We are now in the position to define a \emph{lexing function} as follows:
+
 \begin{center}
 \begin{tabular}{lcl}
-$lex\,r\,[]$ & $\dn$ & if $nullable(r)$ then $mkeps(r)$\\
+$lex\,r\,[]$ & $\dn$ & if $\textit{nullable}(r)$ then $\textit{mkeps}(r)$\\
              &       & else $error$\\
 $lex\,r\,c\!::\!s$ & $\dn$ & let 
-   $(r_{simp}, f_{rect}) = simp(der(c, r))$\\
-& & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
+   $(r_{simp}, f_{rect}) = simp(\textit{der}(c, r))$\\
+& & $\textit{inj}\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
 \end{tabular}
 \end{center}
 
@@ -576,7 +605,7 @@
 seen in earlier lectures. In the first clause we are given an
 empty string, $[]$, and need to test wether the regular
 expression is $nullable$. If yes, we can proceed normally and
-just return the value calculated by $mkeps$. The second clause
+just return the value calculated by $\textit{mkeps}$. The second clause
 is for strings where the first character is $c$, say, and the
 rest of the string is $s$. We first build the derivative of
 $r$ with respect to $c$; simplify the resulting regular
@@ -614,8 +643,8 @@
 {../progs/app03.scala}}
 
 \noindent Since we regard records as regular expressions we
-need to extend the functions $nullable$ and $der$. Similarly
-$mkeps$ and $inj$ need to be extended. This means we also need
+need to extend the functions $\textit{nullable}$ and $\textit{der}$. Similarly
+$\textit{mkeps}$ and $\textit{inj}$ need to be extended. This means we also need
 to extend the definition of values, which in Scala looks as
 follows:
 
@@ -646,7 +675,7 @@
   $env(\Left(v))$   & $\dn$ & $env(v)$\\
   $env(Right(v))$  & $\dn$ & $env(v)$\\
   $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\
-  $env([v_1,\ldots ,v_n])$ & $\dn$ & 
+  $env(Stars\,[v_1,\ldots ,v_n])$ & $\dn$ & 
      $env(v_1) \,@\ldots @\, env(v_n)$\\
   $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\
 \end{tabular}