diff -r e74c696821a2 -r 5deefcc8cffa handouts/ho04.tex --- a/handouts/ho04.tex Tue Aug 23 23:12:55 2016 +0200 +++ b/handouts/ho04.tex Tue Sep 20 12:24:29 2016 +0100 @@ -10,20 +10,34 @@ \section*{Handout 4 (Sulzmann \& Lu Algorithm)} So far our algorithm based on derivatives was only able to say -yes or no depending on whether a string was matched by regular +yes or no depending on whether a string was matched by a regular expression or not. Often a more interesting question is to find out \emph{how} a regular expression matched a string? Answering this question will also help us with the problem we are after, namely tokenising an input string. -The algorithm we will be looking at was designed by Sulzmann -\& Lu in a rather recent paper (from 2014). A link to it is +The algorithm we will be looking at in this lecture was designed by Sulzmann +\& Lu in a rather recent research paper (from 2014). A link to it is provided on KEATS, in case you are interested.\footnote{In my humble opinion this is an interesting instance of the research literature: it contains a very neat idea, but its presentation -is rather sloppy. In earlier versions of their paper, a King's -student and I found several rather annoying typos in their -examples and definitions.} In order to give an answer for +is rather sloppy. In earlier versions of this paper, a King's +undergraduate student and I found several rather annoying typos in the +examples and definitions.} My PhD student Fahad Ausaf and I even more recently +wrote a paper where we build on their result: we provided a +mathematical proof that their algorithm is really correct---the proof +Sulzmann \& Lu had originally given contained major flaws. Such correctness +proofs are important: Kuklewicz maintains a unit-test library +for the kind of algorithma we are interested in here and he showed +that many implementations in the ``wild'' are buggy, that is not +satisfy his unit tests: + +\begin{center} +\url{http://www.haskell.org/haskellwiki/Regex_Posix} +\end{center} + + +In order to give an answer for \emph{how} a regular expression matches a string, Sulzmann and Lu use \emph{values}. A value will be the output of the algorithm whenever the regular expression matches the string. @@ -54,7 +68,7 @@ & $\mid$ & $Seq(v_1,v_2)$\\ & $\mid$ & $\Left(v)$ \\ & $\mid$ & $Right(v)$ \\ - & $\mid$ & $Stars [v_1,\ldots\,v_n]$ \\ + & $\mid$ & $Stars\,[v_1,\ldots\,v_n]$ \\ \end{tabular} \end{tabular} \end{center} @@ -71,7 +85,7 @@ of $r^*$. For sequence, there is exactly one value, composed of two component values ($v_1$ and $v_2$). -My definition of regular expressions and values in Scala is +My implementation of regular expressions and values in Scala is shown below. I have in my implementation the convention that regular expressions are written entirely with upper-case letters, while values just start with a single upper-case @@ -87,14 +101,16 @@ Graphically the algorithm by Sulzmann \& Lu can be illustrated by the picture in Figure~\ref{Sulz} where the path from the -left to the right involving $der/nullable$ is the first phase -of the algorithm and $mkeps/inj$, the path from right to left, +left to the right involving $\textit{der}/\textit{nullable}$ is the first phase +of the algorithm and $\textit{mkeps}/\textit{inj}$, the path from right to left, the second phase. This picture shows the steps required when a regular expression, say $r_1$, matches the string $abc$. We first build the three derivatives (according to $a$, $b$ and -$c$). We then use $nullable$ to find out whether the resulting +$c$). We then use $\textit{nullable}$ to find out whether the resulting regular expression can match the empty string. If yes, we call -the function $mkeps$. +the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular +expression has matched the empty string. Its definition +is as follows: \begin{figure}[t] \begin{center} @@ -102,43 +118,41 @@ every node/.style={minimum size=7mm}] \node (r1) {$r_1$}; \node (r2) [right=of r1]{$r_2$}; -\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$}; +\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\textit{der}\,a$}; \node (r3) [right=of r2]{$r_3$}; -\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$}; +\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\textit{der}\,b$}; \node (r4) [right=of r3]{$r_4$}; -\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$}; -\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}}; +\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\textit{der}\,c$}; +\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\textit{nullable}$}}; \node (v4) [below=of r4]{$v_4$}; \draw[->,line width=1mm](r4) -- (v4); \node (v3) [left=of v4] {$v_3$}; -\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$}; +\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\textit{inj}\,c$}; \node (v2) [left=of v3]{$v_2$}; -\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$}; +\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\textit{inj}\,b$}; \node (v1) [left=of v2] {$v_1$}; -\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$}; -\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}}; +\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\textit{inj}\,a$}; +\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}}; \end{tikzpicture} \end{center} \caption{The two phases of the algorithm by Sulzmann \& Lu. \label{Sulz}} \end{figure} -The $mkeps$ function calculates a value for how a regular -expression has matched the empty string. Its definition -is as follows: + \begin{center} \begin{tabular}{lcl} - $mkeps(\ONE)$ & $\dn$ & $Empty$\\ - $mkeps(r_1 + r_2)$ & $\dn$ & if $nullable(r_1)$ \\ - & & then $\Left(mkeps(r_1))$\\ - & & else $Right(mkeps(r_2))$\\ - $mkeps(r_1 \cdot r_2)$ & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\ - $mkeps(r^*)$ & $\dn$ & $[]$ \\ + $\textit{mkeps}(\ONE)$ & $\dn$ & $Empty$\\ + $\textit{mkeps}(r_1 + r_2)$ & $\dn$ & if $\textit{nullable}(r_1)$ \\ + & & then $\Left(\textit{mkeps}(r_1))$\\ + & & else $Right(\textit{mkeps}(r_2))$\\ + $\textit{mkeps}(r_1 \cdot r_2)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\ + $\textit{mkeps}(r^*)$ & $\dn$ & $Stars\,[]$ \\ \end{tabular} \end{center} -\noindent There are no cases for $\ZERO$ and $c$, since +\noindent There is are no cases for $\ZERO$ and $c$, since these regular expression cannot match the empty string. Note also that in case of alternatives we give preference to the regular expression on the left-hand side. This will become @@ -149,31 +163,31 @@ has matched a string. For this we need a function that reverses this ``chopping off'' for values which we did in the first phase for derivatives. The corresponding function is -called $inj$ for injection. This function takes three +called $\textit{inj}$ (short for injection). This function takes three arguments: the first one is a regular expression for which we want to calculate the value, the second is the character we want to inject and the third argument is the value where we will inject the character into. The result of this function is a -new value. The definition of $inj$ is as follows: +new value. The definition of $\textit{inj}$ is as follows: \begin{center} \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} - $inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ - $inj\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(inj\,r_1\,c\,v)$\\ - $inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(inj\,r_2\,c\,v)$\\ - $inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(inj\,r_1\,c\,v_1,v_2)$\\ - $inj\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(inj\,r_1\,c\,v_1,v_2)$\\ - $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\ - $inj\,(r^*)\,c\,Seq(v,vs)$ & $\dn$ & $inj\,r\,c\,v\,::\,vs$\\ + $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ + $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ + $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ + $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ + $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ + $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ + $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\textit{inj}\,r\,c\,v\,::\,vs)$\\ \end{tabular} \end{center} \noindent This definition is by recursion on the regular expression and by analysing the shape of the values. Therefore -there are, for example, three cases for sequence regular -expressions (for all possible shapes of the value). The last +there are three cases for sequence regular +expressions (for all possible shapes of the value we can encounter). The last clause for the star regular expression returns a list where -the first element is $inj\,r\,c\,v$ and the other elements are +the first element is $\textit{inj}\,r\,c\,v$ and the other elements are $vs$. That means $\_\,::\,\_$ should be read as list cons. To understand what is going on, it might be best to do some @@ -195,36 +209,36 @@ \noindent According to the simple algorithm, we would test whether $r_4$ is nullable, which in this case it indeed is. -This means we can use the function $mkeps$ to calculate a +This means we can use the function $\textit{mkeps}$ to calculate a value for how $r_4$ was able to match the empty string. Remember that this function gives preference for alternatives on the left-hand side. However there is only $\ONE$ on the -very right-hand side of $r_4$ that matches the empty string. -Therefore $mkeps$ returns the value +very right-hand side of $r_4$ (underlined) + +\begin{center} +$r_4$:\;$(\ZERO \cdot (b \cdot c)) + + ((\ZERO \cdot c) + \underline{\ONE})$\\ +\end{center} + +\noindent +that matches the empty string. +Therefore $\textit{mkeps}$ returns the value \begin{center} $v_4:\;Right(Right(Empty))$ \end{center} \noindent If there had been a $\ONE$ on the left, then -$mkeps$ would have returned something of the form +$\textit{mkeps}$ would have returned something of the form $\Left(\ldots)$. The point is that from this value we can directly read off which part of $r_4$ matched the empty string: take the right-alternative first, and then the -right-alternative again. Remember $r_4$ is of the form - -\begin{center} -$r_4$:\;$(\ZERO \cdot (b \cdot c)) + - ((\ZERO \cdot c) + \underline{\ONE})$\\ -\end{center} - -\noindent the value tells us that the underlined $\ONE$ -is responsible for matching the empty string. +right-alternative again. Next we have to ``inject'' the last character, that is $c$ in the running example, into this value $v_4$ in order to calculate how $r_3$ could have matched the string $c$. -According to the definition of $inj$ we obtain +According to the definition of $\textit{inj}$ we obtain \begin{center} $v_3:\;Right(Seq(Empty, Char(c)))$ @@ -263,12 +277,12 @@ $|\Left(v)|$ & $\dn$ & $|v|$\\ $|Right(v)|$ & $\dn$ & $|v|$\\ $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\ - $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\ + $|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\ \end{tabular} \end{center} \noindent Using flatten we can see what are the strings behind -the values calculated by $mkeps$ and $inj$. In our running +the values calculated by $\textit{mkeps}$ and $\textit{inj}$. In our running example: \begin{center} @@ -280,10 +294,8 @@ \end{tabular} \end{center} -\noindent This indicates that $inj$ indeed is injecting, or -adding, back a character into the value. If we continue until -all characters are injected back, we have a value that can -indeed say how the string $abc$ was matched. +\noindent This indicates that $\textit{inj}$ indeed is injecting, or +adding, back a character into the value. There is a problem, however, with the described algorithm so far: it is very slow. We need to include the simplification @@ -301,23 +313,29 @@ first example consider the last derivation step in our earlier example: -\begin{center} -$r_4 = der\,c\,r_3 = -(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$ -\end{center} +\begin{equation}\label{regexfour} +r_4 = \textit{der}\,c\,r_3 = +(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE) +\end{equation} \noindent Simplifying this regular expression would just give -us $\ONE$. Running $mkeps$ with this regular expression as -input, however, would then provide us with $Empty$ instead of -$Right(Right(Empty))$ that was obtained without the -simplification. The problem is we need to recreate this more -complicated value, rather than just return $Empty$. +us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as +input, however, would give us with the value $Empty$ instead of + +\[Right(Right(Empty))\] -This will require what I call \emph{rectification functions}. -They need to be calculated whenever a regular expression gets -simplified. Rectification functions take a value as argument -and return a (rectified) value. Let us first take a look again -at our simplification rules: +\noindent +that was obtained without the simplification. The problem is we need +to recreate this more complicated value, rather than just return +$Empty$. This will require what I call \emph{rectification functions}. +They need to be calculated whenever a regular expression gets +simplified. + +Rectification functions take a value as argument and return a +(rectified) value. In the example above the argument would be $Empty$ +and the output $Right(Right(Empty))$. Before we define these +rectification functions in general, let us first take a look again at +our simplification rules: \begin{center} \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l} @@ -331,7 +349,7 @@ \end{tabular} \end{center} -\noindent Applying them to $r_4$ will require several nested +\noindent Applying them to $r_4$ in \eqref{regexfour} will require several nested simplifications in order end up with just $\ONE$. However, it is possible to apply them in a depth-first, or inside-out, manner in order to calculate this simplified regular @@ -391,28 +409,28 @@ case): \begin{center} -\begin{tabular}{l} -$simp(r)$:\\ -\quad case $r = r_1 + r_2$\\ -\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\ -\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\ -\qquad case $r_{1s} = \ZERO$: +\begin{tabular}{ll} +$_1$ & $simp(r)$:\\ +$_2$ & \quad case $r = r_1 + r_2$\\ +$_3$ & \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\ +$_4$ & \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\ +$_5$ & \qquad case $r_{1s} = \ZERO$: return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\ -\qquad case $r_{2s} = \ZERO$: +$_6$ & \qquad case $r_{2s} = \ZERO$: return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\ -\qquad case $r_{1s} = r_{2s}$: +$_7$ & \qquad case $r_{1s} = r_{2s}$: return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\ -\qquad otherwise: +$_8$ & \qquad otherwise: return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\ \end{tabular} \end{center} \noindent We first recursively call the simplification with -$r_1$ and $r_2$. This gives simplified regular expressions, +$r_1$ and $r_2$ (Lines 3 and 4). This gives simplified regular expressions, $r_{1s}$ and $r_{2s}$, as well as two rectification functions $f_{1s}$ and $f_{2s}$. We next need to test whether the simplified regular expressions are $\ZERO$ so as to make -further simplifications. In case $r_{1s}$ is $\ZERO$, +further simplifications. In case $r_{1s}$ is $\ZERO$ (Line 5), then we can return $r_{2s}$ (the other alternative). However for this we need to build a corresponding rectification function, which as said above is @@ -465,7 +483,7 @@ \begin{center} \begin{tabular}{l} -$simp(r)$:\qquad\qquad (continued)\\ +$simp(r)$:\hspace{5cm} (continued)\\ \quad case $r = r_1 \cdot r_2$\\ \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\ \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\ @@ -517,10 +535,9 @@ $\lambda v.\,v$ \end{center} -\noindent This completes the high-level version of the -simplification function, which is also shown again in -Figure~\ref{simp}. This can now be used in a \emph{lexing -function} as follows: +\noindent This completes the high-level version of the simplification +function, which is shown again in Figure~\ref{simp}. The Scala code +for the simplifaction function is in Figure~\ref{simprect}. \begin{figure}[t] \begin{center} @@ -562,13 +579,25 @@ regular expression and a rectification function.\label{simp}} \end{figure} +\begin{figure}[p] +\lstinputlisting{../progs/app61.scala} + +\caption{The Scala code for the simplification function. The +first part defines some auxillary functions for the rectification. +The second part give the simplification function. +\label{simprect}} +\end{figure} + + +We are now in the position to define a \emph{lexing function} as follows: + \begin{center} \begin{tabular}{lcl} -$lex\,r\,[]$ & $\dn$ & if $nullable(r)$ then $mkeps(r)$\\ +$lex\,r\,[]$ & $\dn$ & if $\textit{nullable}(r)$ then $\textit{mkeps}(r)$\\ & & else $error$\\ $lex\,r\,c\!::\!s$ & $\dn$ & let - $(r_{simp}, f_{rect}) = simp(der(c, r))$\\ -& & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$ + $(r_{simp}, f_{rect}) = simp(\textit{der}(c, r))$\\ +& & $\textit{inj}\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$ \end{tabular} \end{center} @@ -576,7 +605,7 @@ seen in earlier lectures. In the first clause we are given an empty string, $[]$, and need to test wether the regular expression is $nullable$. If yes, we can proceed normally and -just return the value calculated by $mkeps$. The second clause +just return the value calculated by $\textit{mkeps}$. The second clause is for strings where the first character is $c$, say, and the rest of the string is $s$. We first build the derivative of $r$ with respect to $c$; simplify the resulting regular @@ -614,8 +643,8 @@ {../progs/app03.scala}} \noindent Since we regard records as regular expressions we -need to extend the functions $nullable$ and $der$. Similarly -$mkeps$ and $inj$ need to be extended. This means we also need +need to extend the functions $\textit{nullable}$ and $\textit{der}$. Similarly +$\textit{mkeps}$ and $\textit{inj}$ need to be extended. This means we also need to extend the definition of values, which in Scala looks as follows: @@ -646,7 +675,7 @@ $env(\Left(v))$ & $\dn$ & $env(v)$\\ $env(Right(v))$ & $\dn$ & $env(v)$\\ $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\ - $env([v_1,\ldots ,v_n])$ & $\dn$ & + $env(Stars\,[v_1,\ldots ,v_n])$ & $\dn$ & $env(v_1) \,@\ldots @\, env(v_n)$\\ $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\ \end{tabular}