afl-material: comparison handouts/ho04.tex

equal deleted inserted replaced

-:9ce78065f68d
+:2f5a4d76756d
+% !TEX program = xelatex
 \documentclass{article}
 \usepackage{../style}
 \usepackage{../langs}
 \usepackage{../graphics}
 \begin{document}
-\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
+\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2019}
 \section*{Handout 4 (Sulzmann \& Lu Algorithm)}
 So far our algorithm based on derivatives was only able to say
 yes or no depending on whether a string was matched by a regular
 provided on KEATS, in case you are interested.\footnote{In my
 humble opinion this is an interesting instance of the research
 literature: it contains a very neat idea, but its presentation
 is rather sloppy. In earlier versions of this paper, a King's
 undergraduate student and I found several rather annoying typos in the
-examples and definitions.} My PhD student Fahad Ausaf and I even more recently
+examples and definitions.} My former PhD student Fahad Ausaf and I even more recently
 wrote a paper where we build on their result: we provided a
 mathematical proof that their algorithm is really correct---the proof
 Sulzmann \& Lu had originally given contained major flaws. Such correctness
 proofs are important: Kuklewicz maintains a unit-test library
 for the kind of algorithms we are interested in here and he showed
 \begin{center}
 \url{http://www.haskell.org/haskellwiki/Regex_Posix}
 \end{center}
-In order to give an answer for
+Coming back to the algorithm by Sulzmann \& Lu, their idea is to
-\emph{how} a regular expression matches a string, Sulzmann and
+introduce \emph{values} for producing an answer for \emph{how} a regular
-Lu use \emph{values}. A value will be the output of the
+expression matches a string. So rather than a boolean like in the
-algorithm whenever the regular expression matches the string.
+Brzozowski algorithm, a value will be the output of the Sulzman \& Lu
-If the string does not match the string, an error will be
+algorithm whenever the regular expression matches the string. If the
-raised.
+string does not match the string, an error will be raised. The
-The definition for values is given below. They are shown
+definition for values is given below. They are shown together with the
-together with the regular expressions $r$ to which
+regular expressions $r$ to which they correspond:
-they correspond:
 \begin{center}
 \begin{tabular}{cc}
 \begin{tabular}{@{}rrl@{}}
 \multicolumn{3}{c}{regular expressions}\medskip\\
 each regular expression with the exception of $r_1 + r_2$
 where there are two values, namely $\Left(v)$ and $Right(v)$
 corresponding to the two alternatives. Note that $r^*$ is
 associated with a list of values, one for each copy of $r$
 that was needed to match the string. This means we might also
-return the empty list $Stars []$, if no copy was needed
+return the empty list $Stars\,[]$, if no copy was needed
 for $r^*$. For sequence, there is exactly one value, composed
 of two component values ($v_1$ and $v_2$).
 My implementation of regular expressions and values in Scala is
 shown below. I have in my implementation the convention that
 {../progs/app02.scala}}
 Graphically the algorithm by Sulzmann \& Lu can be illustrated
 by the picture in Figure~\ref{Sulz} where the path from the
-left to the right involving $\textit{der}/\textit{nullable}$ is the first phase
+left to the right involving $\der/\nullable$ is the first phase
-of the algorithm and $\textit{mkeps}/\textit{inj}$, the path from right to left,
+of the algorithm and $\textit{mkeps}/\inj$, the path from right to left,
 the second phase. This picture shows the steps required when a
 regular expression, say $r_1$, matches the string $abc$. We
 first build the three derivatives (according to $a$, $b$ and
-$c$). We then use $\textit{nullable}$ to find out whether the resulting
+$c$). We then use $\nullable$ to find out whether the resulting
 regular expression can match the empty string. If yes, we call
 the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular
 expression has matched the empty string. Its definition
 is as follows:
 \begin{center}
 \begin{tikzpicture}[scale=2,node distance=1.2cm,
 every node/.style={minimum size=7mm}]
 \node (r1)  {$r_1$};
 \node (r2) [right=of r1]{$r_2$};
-\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\textit{der}\,a$};
+\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\der\,a$};
 \node (r3) [right=of r2]{$r_3$};
-\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\textit{der}\,b$};
+\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\der\,b$};
 \node (r4) [right=of r3]{$r_4$};
-\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\textit{der}\,c$};
+\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\der\,c$};
-\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\textit{nullable}$}};
+\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\nullable$}};
 \node (v4) [below=of r4]{$v_4$};
 \draw[->,line width=1mm](r4) -- (v4);
 \node (v3) [left=of v4] {$v_3$};
-\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\textit{inj}\,c$};
+\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\inj\,c$};
 \node (v2) [left=of v3]{$v_2$};
-\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\textit{inj}\,b$};
+\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\inj\,b$};
 \node (v1) [left=of v2] {$v_1$};
-\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\textit{inj}\,a$};
+\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\inj\,a$};
 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};
 \end{tikzpicture}
 \end{center}
 \caption{The two phases of the algorithm by Sulzmann \& Lu.
 \label{Sulz}}
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{mkeps}(\ONE)$       & $\dn$ & $Empty$\\
-$\textit{mkeps}(r_1 + r_2)$      & $\dn$ & if $\textit{nullable}(r_1)$  \\
+$\textit{mkeps}(r_1 + r_2)$      & $\dn$ & if $\nullable(r_1)$  \\
 &       & then $\Left(\textit{mkeps}(r_1))$\\
 &       & else $Right(\textit{mkeps}(r_2))$\\
 $\textit{mkeps}(r_1 \cdot r_2)$  & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\
 $\textit{mkeps}(r^*)$            & $\dn$ & $Stars\,[]$  \\
 \end{tabular}
 The second phase of the algorithm is organised so that it will
 calculate a value for how the derivative regular expression
 has matched a string. For this we need a function that
 reverses this ``chopping off'' for values which we did in the
 first phase for derivatives. The corresponding function is
-called $\textit{inj}$ (short for injection). This function takes three
+called $\inj$ (short for injection). This function takes three
 arguments: the first one is a regular expression for which we
 want to calculate the value, the second is the character we
 want to inject and the third argument is the value where we
 will inject the character into. The result of this function is a
-new value. The definition of $\textit{inj}$ is as follows:
+new value. The definition of $\inj$ is as follows:
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
-$\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
+$\inj\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
-$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
+$\inj\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\inj\,r_1\,c\,v)$\\
-$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
+$\inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\inj\,r_2\,c\,v)$\\
-$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+$\inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\
-$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+$\inj\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\
-$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
+$\inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\inj\,r_2\,c\,v)$\\
-$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars(\textit{inj}\,r\,c\,v\,::\,vs)$\\
+$\inj\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars(\inj\,r\,c\,v\,::\,vs)$\\
 \end{tabular}
 \end{center}
 \noindent This definition is by recursion on the regular
 expression and by analysing the shape of the values. Therefore
 there are three cases for sequence regular
 expressions (for all possible shapes of the value we can encounter). The last
 clause for the star regular expression returns a list where
-the first element is $\textit{inj}\,r\,c\,v$ and the other elements are
+the first element is $\inj\,r\,c\,v$ and the other elements are
 $vs$. That means $\_\,::\,\_$ should be read as list cons.
 To understand what is going on, it might be best to do some
 example calculations and compare them with Figure~\ref{Sulz}.
 For this note that we have not yet dealt with the need of
 \noindent If there had been a $\ONE$ on the left, then
 $\textit{mkeps}$ would have returned something of the form
 $\Left(\ldots)$. The point is that from this value we can
 directly read off which part of $r_4$ matched the empty
 string: take the right-alternative first, and then the
-right-alternative again.
+right-alternative again, then you got to the $\ONE$.
 Next we have to ``inject'' the last character, that is $c$ in
 the running example, into this value $v_4$ in order to
 calculate how $r_3$ could have matched the string $c$.
-For this we call injection with $\textit{inj}(r_3, c, v_4)$.
+For this we call injection with $\inj(r_3, c, v_4)$.
-According to the definition of $\textit{inj}$ we obtain
+According to the definition of $\inj$ we obtain
 \begin{center}
 $v_3:\;Right(Seq(Empty, Char(c)))$
 \end{center}
 \noindent This is the correct result, because $r_3$ needs
 to use the right-hand alternative, and then $\ONE$ needs
 to match the empty string and $c$ needs to match $c$.
 Next we need to inject back the letter $b$ into $v_3$.
-For this we call injection with $\textit{inj}(r_2, b, v_3)$.
+For this we call injection with $\inj(r_2, b, v_3)$.
 This gives
 \begin{center}
 $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
 \end{center}
 \noindent which is again the correct result for how $r_2$
 matched the string $bc$. Finally we need to inject back the
 letter $a$ into $v_2$ giving the final result.
-For this we call injection with $\textit{inj}(r_1, a, v_2)$
+For this we call injection with $\inj(r_1, a, v_2)$
 and obtain
 \begin{center}
 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
 \end{center}
 \noindent This value corresponds to how the regular expression $r_1$,
 namely $a \cdot (b \cdot c)$, matched the string $abc$.
 There are a few auxiliary functions that are of interest
 when analysing this algorithm. One is called \emph{flatten},
 written $|\_|$, which extracts the string ``underlying'' a
 value. It is defined recursively as
 $|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
 \end{tabular}
 \end{center}
 \noindent Using flatten we can see what are the strings behind
-the values calculated by $\textit{mkeps}$ and $\textit{inj}$. In our running
+the values calculated by $\textit{mkeps}$ and $\inj$. In our running
 example:
 \begin{center}
 \begin{tabular}{ll}
 $|v_4|$: & $[]$\\
 $|v_2|$: & $bc$\\
 $|v_1|$: & $abc$
 \end{tabular}
 \end{center}
-\noindent This indicates that $\textit{inj}$ indeed is injecting, or
+\noindent This indicates that $\inj$ indeed is injecting, or
 adding, back a character into the value.
-There is a problem, however, with the described algorithm
+The definition of $\inj$ might still be very puzzling and each clause
-so far: it is very slow. We need to include the simplification
+might look arbitrary, but there is in fact a relatively simple idea
-from Lecture 2. This is what we shall do next.
+behind it. Ultimately the $\inj$-functions is determined by the
+derivative functions. For this consider one of the ``squares'' from
+Figure~\ref{Sulz}:
+\begin{center}
+\begin{tikzpicture}[scale=2,node distance=1.2cm,
+every node/.style={minimum size=7mm}]
+\node (r)  {$r$};
+\node (rd) [right=of r]{$r_{der}$};
+\draw[->,line width=1mm](r)--(rd) node[above,midway] {$\der\,c$};
+\node (vd) [below=of r2]{$v_{der}$};
+\draw[->,line width=1mm](rd) -- (vd);
+\node (v) [left=of vd] {$v$};
+\draw[->,line width=1mm](vd)--(v) node[below,midway] {$\inj\,c$};
+\draw[->,line width=0.5mm,dotted](r) -- (v) node[left,midway,red] {\bf ?};
+\end{tikzpicture}
+\end{center}
+\noindent
+The input to the $\inj$-function is $r$ (on the top left), $c$ (the
+character to be injected) and $v_{der}$ (bottom right). The output is
+the value $v$ for how the regular expression $r$ matched the
+corresponding string. How does $\inj$ calculate this value? Well, it has
+to analyse the value $v_{der}$ and transform it into the value $v$ for
+the regular expression $r$. The value $v_{der}$ corresponds to the
+$r_{der}$-regular expression which is the derivative of $r$. Remember
+that $v_{der}$ is the value for how $r_{der}$ matches the corresponding string
+where $c$ has been chopped off.
+Let $r$ be $r_1 + r_2$. Then $r_{der}$
+is of the form $(\der\,c\,r_1) + (\der\,c\,r_2)$. What are the possible
+values corresponding to $r_{der}$? Well, they can be only of the form
+$\Left(\ldots)$ and $\Right(\ldots)$. Therefore you have two
+cases in the $\inj$ function -- one for $\Left$ and one for $\Right$.
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}
+$\inj\,(r_1 + r_2)\,c\,\,\Left(v)$ & $\dn$ & $\ldots$\\
+$\inj\,(r_1 + r_2)\,c\,\,\Right(v)$ & $\dn$ & $\ldots$\\
+\end{tabular}
+\end{center}
+\noindent
+Let's look next at the sequence case where $r = r_1 \cdot r_2$. What does the
+derivative of $r$ look like? According to the definition it is:
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+$\der\, c\, (r_1 \cdot r_2)$  & $\dn$  &  if $\nullable (r_1)$\\
+& & then $(\der\,c\,r_1) \cdot r_2 + \der\, c\, r_2$\\
+& & else $(\der\, c\, r_1) \cdot r_2$\\
+\end{tabular}
+\end{center}
+\noindent As you can see there is a derivative in the if-branch and another
+in the else-branch. In the if-branch we have an alternative of the form
+$\_ + \_$. We already know what the possible values are for such a regular
+expression, namely $\Left$ or $\Right$. Therefore we have in $\inj$ the
+two cases:
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}
+$\inj\,(r_1 \cdot r_2)\,c\,\,\Left(Seq(v_1,v_2))$ & $\dn$  & $\ldots$\\
+$\inj\,(r_1 \cdot r_2)\,c\,\,Right(v)$ & $\dn$  & $\ldots$\\
+\end{tabular}
+\end{center}
+\noindent
+In the first case we even know that it is not just a $\Left$-value, but
+$\Left(Seq(\ldots))$, because the corresponding regular expression
+in the derivation is \mbox{$(\der\,c\,r_1) \cdot r_2$}. Hence we know
+it must be a $Seq$-value enclosed inside $\Left$.
+The third clause for $r_1\cdot r_2$ in the $\inj$-function
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+$\inj\,(r_1 \cdot r_2)\,c\,\,Seq(v_1,v_2)$ & $\dn$  & $\ldots$\\
+\end{tabular}
+\end{center}
+\noindent corresponds to the else-branch in the derivative. In this
+case we know the derivative is of the form $(\der\,c\,r_1) \cdot r_2$ and
+therefore the value must be of the form $Seq(\ldots)$.
+Hopefully the $inj$-function makes now more sense. I let you explain
+the $r^*$ case. What do the derivative of $r^*$  and
+the corresponding value look like? Does this explain the shape of
+the clause?
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+$\inj\,(r^*)\,c\,\,Seq(v,Stars\,vs)$ & $\dn$  & $Stars(\inj\,r\,c\,v\,::\,vs)$\\
+\end{tabular}
+\end{center}
+\noindent If yes, you made sense of the left-hand sides of
+the $\inj$-definition.
+How about the right-hand sides? Well, in the
+$r^*$ case for example we have according to the square in the picture
+above a value $v_{der}$ which says how the derivative $r_{der}$
+matched the string. Since the derivative is of the form
+$(\der\,c\,r) \cdot (r^*)$ the corresponding value is of the
+form $Seq(v, Stars\,vs)$. But for $r^*$ we are looking for a value
+for the original (top left) regular expression --- so it cannot
+be a value of the shape $Seq(\ldots, Stars\ldots)$ because that is the
+shape for sequence regular expressions. What we need is a value
+of the form $Stars \ldots$ (remember the correspondence between
+values and regular expressions). This suggests to define the right hand
+side as
+\begin{center}
+$\ldots \dn Stars(\inj\,r\,c\,v\,::\,vs)$
+\end{center}
+\noindent It is a value of the right shape, namely $Stars$. It injected
+$c$ into the first-value, which is in fact the value where we need to
+undo the derivative. Remember again the shape of the derivative of $r^*$.
+In place where we chopped off the $c$, we now need to do the $\inj$ of $c$.
+Therefore $\inj\,r\,c\,v$ in the definition above. That is the same with
+the other clauses in $\inj$.
+Phew\ldotsSweat\ldots Unfortunately, there is one more problem with the
+described algorithm so far: it is very slow. We need to include in all
+this the simplification from Lecture 2. This is what we shall do next.
 \subsubsection*{Simplification}
 Generally the matching algorithms based on derivatives do
 require that you do some example calculations yourself. As a
 first example consider the last derivation step in our earlier
 example:
 \begin{equation}\label{regexfour}
-r_4 = \textit{der}\,c\,r_3 =
+r_4 = \der\,c\,r_3 =
 (\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)
 \end{equation}
 \noindent Simplifying this regular expression would just give
 us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as
 We are now in the position to define a \emph{lexing function} as follows:
 \begin{center}
 \begin{tabular}{lcl}
-$lex\,r\,[]$ & $\dn$ & if $\textit{nullable}(r)$ then $\textit{mkeps}(r)$\\
+$lex\,r\,[]$ & $\dn$ & if $\nullable(r)$ then $\textit{mkeps}(r)$\\
 &       & else $error$\\
 $lex\,r\,c\!::\!s$ & $\dn$ & let
-$(r_{simp}, f_{rect}) = simp(\textit{der}(c, r))$\\
+$(r_{simp}, f_{rect}) = simp(\der(c, r))$\\
-& & $\textit{inj}\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$
+& & $\inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$
 \end{tabular}
 \end{center}
 \noindent This corresponds to the $matches$ function we have
 seen in earlier lectures. In the first clause we are given an
 {\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 {../progs/app03.scala}}
 \noindent Since we regard records as regular expressions we
-need to extend the functions $\textit{nullable}$ and $\textit{der}$. Similarly
+need to extend the functions $\nullable$ and $\der$. Similarly
-$\textit{mkeps}$ and $\textit{inj}$ need to be extended. This means we also need
+$\textit{mkeps}$ and $\inj$ need to be extended. This means we also need
 to extend the definition of values, which in Scala looks as
 follows:
 {\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
 {\ifodd\value{lstnumber}\color{capri!3}\fi}]

changeset 669	2f5a4d76756d
parent 643	08375ca3874e
child 671	83e38043ed78