--- a/handouts/ho04.tex Fri Oct 25 17:23:23 2019 +0100
+++ b/handouts/ho04.tex Sun Oct 27 11:16:09 2019 +0000
@@ -1,3 +1,4 @@
+% !TEX program = xelatex
\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
@@ -5,7 +6,7 @@
\begin{document}
-\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
+\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2019}
\section*{Handout 4 (Sulzmann \& Lu Algorithm)}
@@ -23,7 +24,7 @@
literature: it contains a very neat idea, but its presentation
is rather sloppy. In earlier versions of this paper, a King's
undergraduate student and I found several rather annoying typos in the
-examples and definitions.} My PhD student Fahad Ausaf and I even more recently
+examples and definitions.} My former PhD student Fahad Ausaf and I even more recently
wrote a paper where we build on their result: we provided a
mathematical proof that their algorithm is really correct---the proof
Sulzmann \& Lu had originally given contained major flaws. Such correctness
@@ -37,15 +38,14 @@
\end{center}
-In order to give an answer for
-\emph{how} a regular expression matches a string, Sulzmann and
-Lu use \emph{values}. A value will be the output of the
-algorithm whenever the regular expression matches the string.
-If the string does not match the string, an error will be
-raised.
-The definition for values is given below. They are shown
-together with the regular expressions $r$ to which
-they correspond:
+Coming back to the algorithm by Sulzmann \& Lu, their idea is to
+introduce \emph{values} for producing an answer for \emph{how} a regular
+expression matches a string. So rather than a boolean like in the
+Brzozowski algorithm, a value will be the output of the Sulzman \& Lu
+algorithm whenever the regular expression matches the string. If the
+string does not match the string, an error will be raised. The
+definition for values is given below. They are shown together with the
+regular expressions $r$ to which they correspond:
\begin{center}
\begin{tabular}{cc}
@@ -81,7 +81,7 @@
corresponding to the two alternatives. Note that $r^*$ is
associated with a list of values, one for each copy of $r$
that was needed to match the string. This means we might also
-return the empty list $Stars []$, if no copy was needed
+return the empty list $Stars\,[]$, if no copy was needed
for $r^*$. For sequence, there is exactly one value, composed
of two component values ($v_1$ and $v_2$).
@@ -103,12 +103,12 @@
Graphically the algorithm by Sulzmann \& Lu can be illustrated
by the picture in Figure~\ref{Sulz} where the path from the
-left to the right involving $\textit{der}/\textit{nullable}$ is the first phase
-of the algorithm and $\textit{mkeps}/\textit{inj}$, the path from right to left,
+left to the right involving $\der/\nullable$ is the first phase
+of the algorithm and $\textit{mkeps}/\inj$, the path from right to left,
the second phase. This picture shows the steps required when a
regular expression, say $r_1$, matches the string $abc$. We
first build the three derivatives (according to $a$, $b$ and
-$c$). We then use $\textit{nullable}$ to find out whether the resulting
+$c$). We then use $\nullable$ to find out whether the resulting
regular expression can match the empty string. If yes, we call
the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular
expression has matched the empty string. Its definition
@@ -120,20 +120,20 @@
every node/.style={minimum size=7mm}]
\node (r1) {$r_1$};
\node (r2) [right=of r1]{$r_2$};
-\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\textit{der}\,a$};
+\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\der\,a$};
\node (r3) [right=of r2]{$r_3$};
-\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\textit{der}\,b$};
+\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\der\,b$};
\node (r4) [right=of r3]{$r_4$};
-\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\textit{der}\,c$};
-\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\textit{nullable}$}};
+\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\der\,c$};
+\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\nullable$}};
\node (v4) [below=of r4]{$v_4$};
\draw[->,line width=1mm](r4) -- (v4);
\node (v3) [left=of v4] {$v_3$};
-\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\textit{inj}\,c$};
+\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\inj\,c$};
\node (v2) [left=of v3]{$v_2$};
-\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\textit{inj}\,b$};
+\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\inj\,b$};
\node (v1) [left=of v2] {$v_1$};
-\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\textit{inj}\,a$};
+\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\inj\,a$};
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};
\end{tikzpicture}
\end{center}
@@ -146,7 +146,7 @@
\begin{center}
\begin{tabular}{lcl}
$\textit{mkeps}(\ONE)$ & $\dn$ & $Empty$\\
- $\textit{mkeps}(r_1 + r_2)$ & $\dn$ & if $\textit{nullable}(r_1)$ \\
+ $\textit{mkeps}(r_1 + r_2)$ & $\dn$ & if $\nullable(r_1)$ \\
& & then $\Left(\textit{mkeps}(r_1))$\\
& & else $Right(\textit{mkeps}(r_2))$\\
$\textit{mkeps}(r_1 \cdot r_2)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\
@@ -165,22 +165,22 @@
has matched a string. For this we need a function that
reverses this ``chopping off'' for values which we did in the
first phase for derivatives. The corresponding function is
-called $\textit{inj}$ (short for injection). This function takes three
+called $\inj$ (short for injection). This function takes three
arguments: the first one is a regular expression for which we
want to calculate the value, the second is the character we
want to inject and the third argument is the value where we
will inject the character into. The result of this function is a
-new value. The definition of $\textit{inj}$ is as follows:
+new value. The definition of $\inj$ is as follows:
\begin{center}
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
- $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
- $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
- $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
- $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
- $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
- $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
- $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\textit{inj}\,r\,c\,v\,::\,vs)$\\
+ $\inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
+ $\inj\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\inj\,r_1\,c\,v)$\\
+ $\inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\inj\,r_2\,c\,v)$\\
+ $\inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\
+ $\inj\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\
+ $\inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\inj\,r_2\,c\,v)$\\
+ $\inj\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\inj\,r\,c\,v\,::\,vs)$\\
\end{tabular}
\end{center}
@@ -189,7 +189,7 @@
there are three cases for sequence regular
expressions (for all possible shapes of the value we can encounter). The last
clause for the star regular expression returns a list where
-the first element is $\textit{inj}\,r\,c\,v$ and the other elements are
+the first element is $\inj\,r\,c\,v$ and the other elements are
$vs$. That means $\_\,::\,\_$ should be read as list cons.
To understand what is going on, it might be best to do some
@@ -235,13 +235,13 @@
$\Left(\ldots)$. The point is that from this value we can
directly read off which part of $r_4$ matched the empty
string: take the right-alternative first, and then the
-right-alternative again.
+right-alternative again, then you got to the $\ONE$.
Next we have to ``inject'' the last character, that is $c$ in
the running example, into this value $v_4$ in order to
calculate how $r_3$ could have matched the string $c$.
-For this we call injection with $\textit{inj}(r_3, c, v_4)$.
-According to the definition of $\textit{inj}$ we obtain
+For this we call injection with $\inj(r_3, c, v_4)$.
+According to the definition of $\inj$ we obtain
\begin{center}
$v_3:\;Right(Seq(Empty, Char(c)))$
@@ -251,7 +251,7 @@
to use the right-hand alternative, and then $\ONE$ needs
to match the empty string and $c$ needs to match $c$.
Next we need to inject back the letter $b$ into $v_3$.
-For this we call injection with $\textit{inj}(r_2, b, v_3)$.
+For this we call injection with $\inj(r_2, b, v_3)$.
This gives
\begin{center}
@@ -261,7 +261,7 @@
\noindent which is again the correct result for how $r_2$
matched the string $bc$. Finally we need to inject back the
letter $a$ into $v_2$ giving the final result.
-For this we call injection with $\textit{inj}(r_1, a, v_2)$
+For this we call injection with $\inj(r_1, a, v_2)$
and obtain
\begin{center}
@@ -271,6 +271,7 @@
\noindent This value corresponds to how the regular expression $r_1$,
namely $a \cdot (b \cdot c)$, matched the string $abc$.
+
There are a few auxiliary functions that are of interest
when analysing this algorithm. One is called \emph{flatten},
written $|\_|$, which extracts the string ``underlying'' a
@@ -288,7 +289,7 @@
\end{center}
\noindent Using flatten we can see what are the strings behind
-the values calculated by $\textit{mkeps}$ and $\textit{inj}$. In our running
+the values calculated by $\textit{mkeps}$ and $\inj$. In our running
example:
\begin{center}
@@ -300,12 +301,138 @@
\end{tabular}
\end{center}
-\noindent This indicates that $\textit{inj}$ indeed is injecting, or
+\noindent This indicates that $\inj$ indeed is injecting, or
adding, back a character into the value.
-There is a problem, however, with the described algorithm
-so far: it is very slow. We need to include the simplification
-from Lecture 2. This is what we shall do next.
+The definition of $\inj$ might still be very puzzling and each clause
+might look arbitrary, but there is in fact a relatively simple idea
+behind it. Ultimately the $\inj$-functions is determined by the
+derivative functions. For this consider one of the ``squares'' from
+Figure~\ref{Sulz}:
+
+
+ \begin{center}
+ \begin{tikzpicture}[scale=2,node distance=1.2cm,
+ every node/.style={minimum size=7mm}]
+ \node (r) {$r$};
+ \node (rd) [right=of r]{$r_{der}$};
+ \draw[->,line width=1mm](r)--(rd) node[above,midway] {$\der\,c$};
+ \node (vd) [below=of r2]{$v_{der}$};
+ \draw[->,line width=1mm](rd) -- (vd);
+ \node (v) [left=of vd] {$v$};
+ \draw[->,line width=1mm](vd)--(v) node[below,midway] {$\inj\,c$};
+ \draw[->,line width=0.5mm,dotted](r) -- (v) node[left,midway,red] {\bf ?};
+ \end{tikzpicture}
+ \end{center}
+
+\noindent
+The input to the $\inj$-function is $r$ (on the top left), $c$ (the
+character to be injected) and $v_{der}$ (bottom right). The output is
+the value $v$ for how the regular expression $r$ matched the
+corresponding string. How does $\inj$ calculate this value? Well, it has
+to analyse the value $v_{der}$ and transform it into the value $v$ for
+the regular expression $r$. The value $v_{der}$ corresponds to the
+$r_{der}$-regular expression which is the derivative of $r$. Remember
+that $v_{der}$ is the value for how $r_{der}$ matches the corresponding string
+where $c$ has been chopped off.
+
+Let $r$ be $r_1 + r_2$. Then $r_{der}$
+is of the form $(\der\,c\,r_1) + (\der\,c\,r_2)$. What are the possible
+values corresponding to $r_{der}$? Well, they can be only of the form
+$\Left(\ldots)$ and $\Right(\ldots)$. Therefore you have two
+cases in the $\inj$ function -- one for $\Left$ and one for $\Right$.
+
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}
+ $\inj\,(r_1 + r_2)\,c\,\,\Left(v)$ & $\dn$ & $\ldots$\\
+ $\inj\,(r_1 + r_2)\,c\,\,\Right(v)$ & $\dn$ & $\ldots$\\
+\end{tabular}
+\end{center}
+
+\noindent
+Let's look next at the sequence case where $r = r_1 \cdot r_2$. What does the
+derivative of $r$ look like? According to the definition it is:
+
+\begin{center}
+ \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+ $\der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $\nullable (r_1)$\\
+ & & then $(\der\,c\,r_1) \cdot r_2 + \der\, c\, r_2$\\
+ & & else $(\der\, c\, r_1) \cdot r_2$\\
+ \end{tabular}
+\end{center}
+
+\noindent As you can see there is a derivative in the if-branch and another
+in the else-branch. In the if-branch we have an alternative of the form
+$\_ + \_$. We already know what the possible values are for such a regular
+expression, namely $\Left$ or $\Right$. Therefore we have in $\inj$ the
+two cases:
+
+\begin{center}
+ \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}
+ $\inj\,(r_1 \cdot r_2)\,c\,\,\Left(Seq(v_1,v_2))$ & $\dn$ & $\ldots$\\
+ $\inj\,(r_1 \cdot r_2)\,c\,\,Right(v)$ & $\dn$ & $\ldots$\\
+\end{tabular}
+\end{center}
+
+\noindent
+In the first case we even know that it is not just a $\Left$-value, but
+$\Left(Seq(\ldots))$, because the corresponding regular expression
+in the derivation is \mbox{$(\der\,c\,r_1) \cdot r_2$}. Hence we know
+it must be a $Seq$-value enclosed inside $\Left$.
+The third clause for $r_1\cdot r_2$ in the $\inj$-function
+
+\begin{center}
+ \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+ $\inj\,(r_1 \cdot r_2)\,c\,\,Seq(v_1,v_2)$ & $\dn$ & $\ldots$\\
+ \end{tabular}
+ \end{center}
+
+\noindent corresponds to the else-branch in the derivative. In this
+case we know the derivative is of the form $(\der\,c\,r_1) \cdot r_2$ and
+therefore the value must be of the form $Seq(\ldots)$.
+
+Hopefully the $inj$-function makes now more sense. I let you explain
+the $r^*$ case. What do the derivative of $r^*$ and
+the corresponding value look like? Does this explain the shape of
+the clause?
+
+\begin{center}
+ \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+ $\inj\,(r^*)\,c\,\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\inj\,r\,c\,v\,::\,vs)$\\
+ \end{tabular}
+\end{center}
+
+\noindent If yes, you made sense of the left-hand sides of
+the $\inj$-definition.
+
+How about the right-hand sides? Well, in the
+$r^*$ case for example we have according to the square in the picture
+above a value $v_{der}$ which says how the derivative $r_{der}$
+matched the string. Since the derivative is of the form
+$(\der\,c\,r) \cdot (r^*)$ the corresponding value is of the
+form $Seq(v, Stars\,vs)$. But for $r^*$ we are looking for a value
+for the original (top left) regular expression --- so it cannot
+be a value of the shape $Seq(\ldots, Stars\ldots)$ because that is the
+shape for sequence regular expressions. What we need is a value
+of the form $Stars \ldots$ (remember the correspondence between
+values and regular expressions). This suggests to define the right hand
+side as
+
+\begin{center}
+$\ldots \dn Stars(\inj\,r\,c\,v\,::\,vs)$
+\end{center}
+
+\noindent It is a value of the right shape, namely $Stars$. It injected
+$c$ into the first-value, which is in fact the value where we need to
+undo the derivative. Remember again the shape of the derivative of $r^*$.
+In place where we chopped off the $c$, we now need to do the $\inj$ of $c$.
+Therefore $\inj\,r\,c\,v$ in the definition above. That is the same with
+the other clauses in $\inj$.
+
+
+Phew\ldotsSweat\ldots Unfortunately, there is one more problem with the
+described algorithm so far: it is very slow. We need to include in all
+this the simplification from Lecture 2. This is what we shall do next.
\subsubsection*{Simplification}
@@ -320,7 +447,7 @@
example:
\begin{equation}\label{regexfour}
-r_4 = \textit{der}\,c\,r_3 =
+r_4 = \der\,c\,r_3 =
(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)
\end{equation}
@@ -601,11 +728,11 @@
\begin{center}
\begin{tabular}{lcl}
-$lex\,r\,[]$ & $\dn$ & if $\textit{nullable}(r)$ then $\textit{mkeps}(r)$\\
+$lex\,r\,[]$ & $\dn$ & if $\nullable(r)$ then $\textit{mkeps}(r)$\\
& & else $error$\\
$lex\,r\,c\!::\!s$ & $\dn$ & let
- $(r_{simp}, f_{rect}) = simp(\textit{der}(c, r))$\\
-& & $\textit{inj}\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$
+ $(r_{simp}, f_{rect}) = simp(\der(c, r))$\\
+& & $\inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$
\end{tabular}
\end{center}
@@ -652,8 +779,8 @@
{../progs/app03.scala}}
\noindent Since we regard records as regular expressions we
-need to extend the functions $\textit{nullable}$ and $\textit{der}$. Similarly
-$\textit{mkeps}$ and $\textit{inj}$ need to be extended. This means we also need
+need to extend the functions $\nullable$ and $\der$. Similarly
+$\textit{mkeps}$ and $\inj$ need to be extended. This means we also need
to extend the definition of values, which in Scala looks as
follows: