afl-material: comparison handouts/ho04.tex

equal deleted inserted replaced

-:e74c696821a2
+:5deefcc8cffa
 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
 \section*{Handout 4 (Sulzmann \& Lu Algorithm)}
 So far our algorithm based on derivatives was only able to say
-yes or no depending on whether a string was matched by regular
+yes or no depending on whether a string was matched by a regular
 expression or not. Often a more interesting question is to
 find out \emph{how} a regular expression matched a string?
 Answering this question will also help us with the problem we
 are after, namely tokenising an input string.
-The algorithm we will be looking at was designed by Sulzmann
+The algorithm we will be looking at in this lecture was designed by Sulzmann
-\& Lu in a rather recent paper (from 2014). A link to it is
+\& Lu in a rather recent research paper (from 2014). A link to it is
 provided on KEATS, in case you are interested.\footnote{In my
 humble opinion this is an interesting instance of the research
 literature: it contains a very neat idea, but its presentation
-is rather sloppy. In earlier versions of their paper, a King's
+is rather sloppy. In earlier versions of this paper, a King's
-student and I found several rather annoying typos in their
+undergraduate student and I found several rather annoying typos in the
-examples and definitions.} In order to give an answer for
+examples and definitions.} My PhD student Fahad Ausaf and I even more recently
+wrote a paper where we build on their result: we provided a
+mathematical proof that their algorithm is really correct---the proof
+Sulzmann \& Lu had originally given contained major flaws. Such correctness
+proofs are important: Kuklewicz maintains a unit-test library
+for the kind of algorithma we are interested in here and he showed
+that many implementations in the ``wild'' are buggy, that is not
+satisfy his unit tests:
+\begin{center}
+\url{http://www.haskell.org/haskellwiki/Regex_Posix}
+\end{center}
+In order to give an answer for
 \emph{how} a regular expression matches a string, Sulzmann and
 Lu use \emph{values}. A value will be the output of the
 algorithm whenever the regular expression matches the string.
 If the string does not match the string, an error will be
 raised.
 &        & $Empty$   \\
 & $\mid$ & $Char(c)$          \\
 & $\mid$ & $Seq(v_1,v_2)$\\
 & $\mid$ & $\Left(v)$   \\
 & $\mid$ & $Right(v)$  \\
-& $\mid$ & $Stars [v_1,\ldots\,v_n]$ \\
+& $\mid$ & $Stars\,[v_1,\ldots\,v_n]$ \\
 \end{tabular}
 \end{tabular}
 \end{center}
 \noindent There is no value for the
 that was needed to match the string. This means we might also
 return the empty list $Stars []$, if no copy was needed in case
 of $r^*$. For sequence, there is exactly one value, composed
 of two component values ($v_1$ and $v_2$).
-My definition of regular expressions and values in Scala is
+My implementation of regular expressions and values in Scala is
 shown below. I have in my implementation the convention that
 regular expressions are written entirely with upper-case
 letters, while values just start with a single upper-case
 character and the rest are lower-case letters.
 {../progs/app02.scala}}
 Graphically the algorithm by Sulzmann \& Lu can be illustrated
 by the picture in Figure~\ref{Sulz} where the path from the
-left to the right involving $der/nullable$ is the first phase
+left to the right involving $\textit{der}/\textit{nullable}$ is the first phase
-of the algorithm and $mkeps/inj$, the path from right to left,
+of the algorithm and $\textit{mkeps}/\textit{inj}$, the path from right to left,
 the second phase. This picture shows the steps required when a
 regular expression, say $r_1$, matches the string $abc$. We
 first build the three derivatives (according to $a$, $b$ and
-$c$). We then use $nullable$ to find out whether the resulting
+$c$). We then use $\textit{nullable}$ to find out whether the resulting
 regular expression can match the empty string. If yes, we call
-the function $mkeps$.
+the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular
+expression has matched the empty string. Its definition
+is as follows:
 \begin{figure}[t]
 \begin{center}
 \begin{tikzpicture}[scale=2,node distance=1.2cm,
 every node/.style={minimum size=7mm}]
 \node (r1)  {$r_1$};
 \node (r2) [right=of r1]{$r_2$};
-\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$};
+\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\textit{der}\,a$};
 \node (r3) [right=of r2]{$r_3$};
-\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$};
+\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\textit{der}\,b$};
 \node (r4) [right=of r3]{$r_4$};
-\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$};
+\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\textit{der}\,c$};
-\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}};
+\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\textit{nullable}$}};
 \node (v4) [below=of r4]{$v_4$};
 \draw[->,line width=1mm](r4) -- (v4);
 \node (v3) [left=of v4] {$v_3$};
-\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$};
+\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\textit{inj}\,c$};
 \node (v2) [left=of v3]{$v_2$};
-\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$};
+\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\textit{inj}\,b$};
 \node (v1) [left=of v2] {$v_1$};
-\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$};
+\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\textit{inj}\,a$};
-\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}};
+\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};
 \end{tikzpicture}
 \end{center}
 \caption{The two phases of the algorithm by Sulzmann \& Lu.
 \label{Sulz}}
 \end{figure}
-The $mkeps$ function calculates a value for how a regular
-expression has matched the empty string. Its definition
-is as follows:
 \begin{center}
 \begin{tabular}{lcl}
-$mkeps(\ONE)$       & $\dn$ & $Empty$\\
+$\textit{mkeps}(\ONE)$       & $\dn$ & $Empty$\\
-$mkeps(r_1 + r_2)$      & $\dn$ & if $nullable(r_1)$  \\
+$\textit{mkeps}(r_1 + r_2)$      & $\dn$ & if $\textit{nullable}(r_1)$  \\
-&       & then $\Left(mkeps(r_1))$\\
+&       & then $\Left(\textit{mkeps}(r_1))$\\
-&       & else $Right(mkeps(r_2))$\\
+&       & else $Right(\textit{mkeps}(r_2))$\\
-$mkeps(r_1 \cdot r_2)$  & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\
+$\textit{mkeps}(r_1 \cdot r_2)$  & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\
-$mkeps(r^*)$            & $\dn$ & $[]$  \\
+$\textit{mkeps}(r^*)$            & $\dn$ & $Stars\,[]$  \\
 \end{tabular}
 \end{center}
-\noindent There are no cases for $\ZERO$ and $c$, since
+\noindent There is are no cases for $\ZERO$ and $c$, since
 these regular expression cannot match the empty string. Note
 also that in case of alternatives we give preference to the
 regular expression on the left-hand side. This will become
 important later on.
 The second phase of the algorithm is organised so that it will
 calculate a value for how the derivative regular expression
 has matched a string. For this we need a function that
 reverses this ``chopping off'' for values which we did in the
 first phase for derivatives. The corresponding function is
-called $inj$ for injection. This function takes three
+called $\textit{inj}$ (short for injection). This function takes three
 arguments: the first one is a regular expression for which we
 want to calculate the value, the second is the character we
 want to inject and the third argument is the value where we
 will inject the character into. The result of this function is a
-new value. The definition of $inj$ is as follows:
+new value. The definition of $\textit{inj}$ is as follows:
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
-$inj\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
+$\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
-$inj\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(inj\,r_1\,c\,v)$\\
+$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
-$inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(inj\,r_2\,c\,v)$\\
+$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
-$inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
+$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
-$inj\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
+$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
-$inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\
+$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
-$inj\,(r^*)\,c\,Seq(v,vs)$         & $\dn$  & $inj\,r\,c\,v\,::\,vs$\\
+$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars(\textit{inj}\,r\,c\,v\,::\,vs)$\\
 \end{tabular}
 \end{center}
 \noindent This definition is by recursion on the regular
 expression and by analysing the shape of the values. Therefore
-there are, for example, three cases for sequence regular
+there are three cases for sequence regular
-expressions (for all possible shapes of the value). The last
+expressions (for all possible shapes of the value we can encounter). The last
 clause for the star regular expression returns a list where
-the first element is $inj\,r\,c\,v$ and the other elements are
+the first element is $\textit{inj}\,r\,c\,v$ and the other elements are
 $vs$. That means $\_\,::\,\_$ should be read as list cons.
 To understand what is going on, it might be best to do some
 example calculations and compare them with Figure~\ref{Sulz}.
 For this note that we have not yet dealt with the need of
 \end{tabular}
 \end{center}
 \noindent According to the simple algorithm, we would test
 whether $r_4$ is nullable, which in this case it indeed is.
-This means we can use the function $mkeps$ to calculate a
+This means we can use the function $\textit{mkeps}$ to calculate a
 value for how $r_4$ was able to match the empty string.
 Remember that this function gives preference for alternatives
 on the left-hand side. However there is only $\ONE$ on the
-very right-hand side of $r_4$ that matches the empty string.
+very right-hand side of $r_4$ (underlined)
-Therefore $mkeps$ returns the value
+\begin{center}
+$r_4$:\;$(\ZERO \cdot (b \cdot c)) +
+((\ZERO \cdot c) + \underline{\ONE})$\\
+\end{center}
+\noindent
+that matches the empty string.
+Therefore $\textit{mkeps}$ returns the value
 \begin{center}
 $v_4:\;Right(Right(Empty))$
 \end{center}
 \noindent If there had been a $\ONE$ on the left, then
-$mkeps$ would have returned something of the form
+$\textit{mkeps}$ would have returned something of the form
 $\Left(\ldots)$. The point is that from this value we can
 directly read off which part of $r_4$ matched the empty
 string: take the right-alternative first, and then the
-right-alternative again. Remember $r_4$ is of the form
+right-alternative again.
-\begin{center}
-$r_4$:\;$(\ZERO \cdot (b \cdot c)) +
-((\ZERO \cdot c) + \underline{\ONE})$\\
-\end{center}
-\noindent the value tells us that the underlined $\ONE$
-is responsible for matching the empty string.
 Next we have to ``inject'' the last character, that is $c$ in
 the running example, into this value $v_4$ in order to
 calculate how $r_3$ could have matched the string $c$.
-According to the definition of $inj$ we obtain
+According to the definition of $\textit{inj}$ we obtain
 \begin{center}
 $v_3:\;Right(Seq(Empty, Char(c)))$
 \end{center}
 $|Empty|$     & $\dn$ & $[]$\\
 $|Char(c)|$   & $\dn$ & $[c]$\\
 $|\Left(v)|$   & $\dn$ & $|v|$\\
 $|Right(v)|$  & $\dn$ & $|v|$\\
 $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
-$|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
+$|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
 \end{tabular}
 \end{center}
 \noindent Using flatten we can see what are the strings behind
-the values calculated by $mkeps$ and $inj$. In our running
+the values calculated by $\textit{mkeps}$ and $\textit{inj}$. In our running
 example:
 \begin{center}
 \begin{tabular}{ll}
 $|v_4|$: & $[]$\\
 $|v_2|$: & $bc$\\
 $|v_1|$: & $abc$
 \end{tabular}
 \end{center}
-\noindent This indicates that $inj$ indeed is injecting, or
+\noindent This indicates that $\textit{inj}$ indeed is injecting, or
-adding, back a character into the value. If we continue until
+adding, back a character into the value.
-all characters are injected back, we have a value that can
-indeed say how the string $abc$ was matched.
 There is a problem, however, with the described algorithm
 so far: it is very slow. We need to include the simplification
 from Lecture 2. This is what we shall do next.
 to read several times before it makes sense and also might
 require that you do some example calculations yourself. As a
 first example consider the last derivation step in our earlier
 example:
-\begin{center}
+\begin{equation}\label{regexfour}
-$r_4 = der\,c\,r_3 =
+r_4 = \textit{der}\,c\,r_3 =
-(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$
+(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)
-\end{center}
+\end{equation}
 \noindent Simplifying this regular expression would just give
-us $\ONE$. Running $mkeps$ with this regular expression as
+us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as
-input, however, would then provide us with $Empty$ instead of
+input, however, would give us with the value $Empty$ instead of
-$Right(Right(Empty))$ that was obtained without the
-simplification. The problem is we need to recreate this more
+\[Right(Right(Empty))\]
-complicated value, rather than just return $Empty$.
+\noindent
-This will require what I call \emph{rectification functions}.
+that was obtained  without the simplification. The problem  is we need
-They need to be calculated whenever a regular expression gets
+to  recreate this  more  complicated value,  rather  than just  return
-simplified. Rectification functions take a value as argument
+$Empty$. This will require what I call \emph{rectification functions}.
-and return a (rectified) value. Let us first take a look again
+They  need  to  be  calculated  whenever  a  regular  expression  gets
-at our simplification rules:
+simplified.
+Rectification functions take a value as argument and return a
+(rectified) value. In the example above the argument would be $Empty$
+and the output $Right(Right(Empty))$.  Before we define these
+rectification functions in general, let us first take a look again at
+our simplification rules:
 \begin{center}
 \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
 $r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\
 $\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\
 $\ZERO + r$ & $\mapsto$ & $r$\\
 $r + r$ & $\mapsto$ & $r$\\
 \end{tabular}
 \end{center}
-\noindent Applying them to $r_4$ will require several nested
+\noindent Applying them to $r_4$ in \eqref{regexfour} will require several nested
 simplifications in order end up with just $\ONE$. However,
 it is possible to apply them in a depth-first, or inside-out,
 manner in order to calculate this simplified regular
 expression.
 Let us package this idea with rectification functions
 into a single function (still only considering the alternative
 case):
 \begin{center}
-\begin{tabular}{l}
+\begin{tabular}{ll}
-$simp(r)$:\\
+$_1$ & $simp(r)$:\\
-\quad case $r = r_1 + r_2$\\
+$_2$ & \quad case $r = r_1 + r_2$\\
-\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
+$_3$ & \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
-\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
+$_4$ & \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
-\qquad case $r_{1s} = \ZERO$:
+$_5$ & \qquad case $r_{1s} = \ZERO$:
 return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
-\qquad case $r_{2s} = \ZERO$:
+$_6$ & \qquad case $r_{2s} = \ZERO$:
 return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
-\qquad case $r_{1s} = r_{2s}$:
+$_7$ & \qquad case $r_{1s} = r_{2s}$:
 return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
-\qquad otherwise:
+$_8$ & \qquad otherwise:
 return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
 \end{tabular}
 \end{center}
 \noindent We first recursively call the simplification with
-$r_1$ and $r_2$. This gives simplified regular expressions,
+$r_1$ and $r_2$ (Lines 3 and 4). This gives simplified regular expressions,
 $r_{1s}$ and $r_{2s}$, as well as two rectification functions
 $f_{1s}$ and $f_{2s}$. We next need to test whether the
 simplified regular expressions are $\ZERO$ so as to make
-further simplifications. In case $r_{1s}$ is $\ZERO$,
+further simplifications. In case $r_{1s}$ is $\ZERO$ (Line 5),
 then we can return $r_{2s}$ (the other alternative). However
 for this we need to build a corresponding rectification
 function, which as said above is
 \begin{center}
 case. In this case the main simplification function is as
 follows
 \begin{center}
 \begin{tabular}{l}
-$simp(r)$:\qquad\qquad (continued)\\
+$simp(r)$:\hspace{5cm} (continued)\\
 \quad case $r = r_1 \cdot r_2$\\
 \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 \qquad case $r_{1s} = \ZERO$:
 return $(\ZERO, f_{error})$\\
 \begin{center}
 $\lambda v.\,v$
 \end{center}
-\noindent This completes the high-level version of the
+\noindent This completes the high-level version of the simplification
-simplification function, which is also shown again in
+function, which is shown again in Figure~\ref{simp}. The Scala code
-Figure~\ref{simp}. This can now be used in a \emph{lexing
+for the simplifaction function is in Figure~\ref{simprect}.
-function} as follows:
 \begin{figure}[t]
 \begin{center}
 \begin{tabular}{l}
 $simp(r)$:\\
 \end{center}
 \caption{The simplification function that returns a simplified
 regular expression and a rectification function.\label{simp}}
 \end{figure}
+\begin{figure}[p]
+\lstinputlisting{../progs/app61.scala}
+\caption{The Scala code for the simplification function. The
+first part defines some auxillary functions for the rectification.
+The second part give the simplification function.
+\label{simprect}}
+\end{figure}
+We are now in the position to define a \emph{lexing function} as follows:
 \begin{center}
 \begin{tabular}{lcl}
-$lex\,r\,[]$ & $\dn$ & if $nullable(r)$ then $mkeps(r)$\\
+$lex\,r\,[]$ & $\dn$ & if $\textit{nullable}(r)$ then $\textit{mkeps}(r)$\\
 &       & else $error$\\
 $lex\,r\,c\!::\!s$ & $\dn$ & let
-$(r_{simp}, f_{rect}) = simp(der(c, r))$\\
+$(r_{simp}, f_{rect}) = simp(\textit{der}(c, r))$\\
-& & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$
+& & $\textit{inj}\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$
 \end{tabular}
 \end{center}
 \noindent This corresponds to the $matches$ function we have
 seen in earlier lectures. In the first clause we are given an
 empty string, $[]$, and need to test wether the regular
 expression is $nullable$. If yes, we can proceed normally and
-just return the value calculated by $mkeps$. The second clause
+just return the value calculated by $\textit{mkeps}$. The second clause
 is for strings where the first character is $c$, say, and the
 rest of the string is $s$. We first build the derivative of
 $r$ with respect to $c$; simplify the resulting regular
 expression. We continue lexing with the simplified regular
 expression and the string $s$. Whatever will be returned as
 {\small\lstinputlisting[language=Scala]
 {../progs/app03.scala}}
 \noindent Since we regard records as regular expressions we
-need to extend the functions $nullable$ and $der$. Similarly
+need to extend the functions $\textit{nullable}$ and $\textit{der}$. Similarly
-$mkeps$ and $inj$ need to be extended. This means we also need
+$\textit{mkeps}$ and $\textit{inj}$ need to be extended. This means we also need
 to extend the definition of values, which in Scala looks as
 follows:
 {\small\lstinputlisting[language=Scala]
 {../progs/app04.scala}}
 $env(Empty)$     & $\dn$ & $[]$\\
 $env(Char(c))$   & $\dn$ & $[]$\\
 $env(\Left(v))$   & $\dn$ & $env(v)$\\
 $env(Right(v))$  & $\dn$ & $env(v)$\\
 $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\
-$env([v_1,\ldots ,v_n])$ & $\dn$ &
+$env(Stars\,[v_1,\ldots ,v_n])$ & $\dn$ &
 $env(v_1) \,@\ldots @\, env(v_n)$\\
 $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\
 \end{tabular}
 \end{center}

changeset 422	5deefcc8cffa
parent 417	e74c696821a2
child 444	3056a4c071b0