\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../graphics}\begin{document}\section*{Handout 4 (Sulzmann \& Lu Algorithm)}So far our algorithm based on derivatives was only able to sayyes or no depending on whether a string was matched by regularexpression or not. Often a more interesting question is tofind out \emph{how} a regular expression matched a string?Answering this question will help us with the problem we areafter, namely tokenising an input string, that is splitting itup into its ``word'' components. The algorithm we will belooking at was designed by Sulzmann \& Lu in a rather recentpaper. A link to it is provided on KEATS, in case you areinterested.\footnote{In my humble opinion this is aninteresting instance of the research literature: it contains avery neat idea, but its presentation is rather sloppy. Inearlier versions of their paper, students and I found severalrather annoying typos in their examples and definitions.}In order to give an answer for how a regular expression matcheda string, Sulzmann and Lu introduce \emph{values}. A value will be the output of the algorithm whenever the regular expression matches the string. If not, an error will be raised.Since the first phase of the algorithm is identical to thederivative based matcher from the first coursework, the function $nullable$ will be used to decide whether as stringis matched by a regular expression. If $nullable$ saysyes, then values are constructed that reflect how the regularexpression matched the string. The definitions for regular expressions $r$ and values $v$ is shown below:\begin{center}\begin{tabular}{cc}\begin{tabular}{@{}rrl@{}} $r$ & $::=$ & $\varnothing$\\ & $\mid$ & $\epsilon$ \\ & $\mid$ & $c$ \\ & $\mid$ & $r_1 \cdot r_2$\\ & $\mid$ & $r_1 + r_2$ \\ \\ & $\mid$ & $r^*$ \\\end{tabular}&\begin{tabular}{@{\hspace{0mm}}rrl@{}} $v$ & $::=$ & \\ & & $Empty$ \\ & $\mid$ & $Char(c)$ \\ & $\mid$ & $Seq(v_1,v_2)$\\ & $\mid$ & $Left(v)$ \\ & $\mid$ & $Right(v)$ \\ & $\mid$ & $[v_1,\ldots\,v_n]$ \\\end{tabular}\end{tabular}\end{center}\noindent As you can see there is a very strong correspondencebetween regular expressions and values. There is no value forthe $\varnothing$ regular expression (since it does not matchany string). Then there is exactly one value corresponding to each regular expression. with the exception of $r_1 + r_2$ where there are two values $Left(v)$ and $Right(v)$ corresponding to the two alternatives. Note that $r^*$is associated with a list of values, one for each copy of $r$ that was needed to match the string. This mean we might also return the empty list $[]$, if no copy was needed.Graphically the algorithm can be represneted by the picture in Figure~\ref{Sulz} where the path involving $der/nullable$ is the firstphase of the algorithm and $mkeps/inj$ the second phase.This picture shows the steps required when a regularexpression, say $r_1$, matches the string $abc$. We firstbuild the three derivatives (according to $a$, $b$ and $c$).We then use $nullable$ to find out whether the resultingregular expression can match the empty string. If yeswe call the function $mkeps$.\begin{figure}[t]\begin{center}\begin{tikzpicture}[scale=2,node distance=1.2cm, every node/.style={minimum size=7mm}]\node (r1) {$r_1$};\node (r2) [right=of r1]{$r_2$};\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$};\node (r3) [right=of r2]{$r_3$};\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$};\node (r4) [right=of r3]{$r_4$};\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$};\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}};\node (v4) [below=of r4]{$v_4$};\draw[->,line width=1mm](r4) -- (v4);\node (v3) [left=of v4] {$v_3$};\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$};\node (v2) [left=of v3]{$v_2$};\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$};\node (v1) [left=of v2] {$v_1$};\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$};\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}};\end{tikzpicture}\end{center}\caption{The two phases of the algorithm by Sulzmann \& Lu.\label{Sulz}}\end{figure}The $mkeps$ function calculates a value for how a regularexpression could have matched the empty string. Its definitionis as follows:\begin{center}\begin{tabular}{lcl} $mkeps(\epsilon)$ & $\dn$ & $Empty$\\ $mkeps(r_1 + r_2)$ & $\dn$ & if $nullable(r_1)$ \\ & & then $Left(mkeps(r_1))$\\ & & else $Right(mkeps(r_2))$\\ $mkeps(r_1 \cdot r_2)$ & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\ $mkeps(r^*)$ & $\dn$ & $[]$ \\\end{tabular}\end{center}\noindent There are no cases for $\epsilon$ and $c$, sincethese regular expression cannot match the empty string. Note that in case of alternatives we give preference to theregular expression on the left-hand side. This will becomeimportant later on.The algorithm is organised recursively such that it will calculate a value for how the derivative regular expressionhas matched a string where the first character has beenchopped off. Now we need a function that reverses this``chopping off'' for values. The corresponding functionis called $inj$ for injection. This function takesthree arguments: the first one is a regular expressionfor which we want to calculate the value, the second is the character we want to inject and the third argument is thevalue where we will inject the character. The resultof this function is a new value. The definition of $inj$ is as follows: \begin{center}\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} $inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ $inj\,(r_1 + r_2)\,c\,Left(v)$ & $\dn$ & $Left(inj\,r_1\,c\,v)$\\ $inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(inj\,r_2\,c\,v)$\\ $inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(inj\,r_1\,c\,v_1,v_2)$\\ $inj\,(r_1 \cdot r_2)\,c\,Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(inj\,r_1\,c\,v_1,v_2)$\\ $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\ $inj\,(r^*)\,c\,Seq(v,vs)$ & $\dn$ & $inj\,r\,c\,v\,::\,vs$\\\end{tabular}\end{center}\noindent This definition is by recursion on theregular expression and by analysing the shape of the values. Therefore there are, for example, three cases for sequneceregular expressions. The last clause returns a list where the first element is $inj\,r\,c\,v$ and the other elements are $vs$.To understand what is going on, it might be best to do someexample calculations and compare with Figure~\ref{Sulz}. Forthis note that we have not yet dealt with the need ofsimplifying regular expreesions (this will be a topic on itsown later). Suppose the regular expression is $a \cdot (b\cdot c)$ and the input string is $abc$. The derivatives areas follows:\begin{center}\begin{tabular}{ll}$r_1$: & $a \cdot (b \cdot c)$\\$r_2$: & $\epsilon \cdot (b \cdot c)$\\$r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\$r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\\end{tabular}\end{center}\noindent According to the simple algorithm. we would testwhether $r_4$ is nullable, which it is. This means we canuse the function $mkeps$ to calculate a value for how $r_4$was able to match the empty string. Remember that this function gives preference for alternatives on the left-handside. However there is only $\epsilon$ on the very right-handside of $r_4$ that matches the empty string. Therefore$mkeps$ returns the value\begin{center}$v_4:\;Right(Right(Empty))$\end{center}\noindent The point is that from this value we can directly read off which part of $r_4$ matched the empty string. Next we have to ``inject'' the last character, that is $c$ in therunning example, into this value in order to calculate how$r_3$ could have matched the string $c$. According to the definition of $inj$ we obtain\begin{center}$v_3:\;Right(Seq(Empty, Char(c)))$\end{center}\noindent This is the correct result, because $r_3$ needsto use the right-hand alternative, and then $\epsilon$ needsto match the empty string and $c$ needs to match $c$.Next we need to inject back the letter $b$ into $v_3$. Thisgives\begin{center}$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$\end{center}\noindent Finally we need to inject back the letter $a$ into$v_2$ giving the final result\begin{center}$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$\end{center}\noindent This now corresponds to how the regularexpression $a \cdot (b \cdot c)$ matched the string $abc$.This is the expected result. So at least in this case thealgorithm seems to calculate what it is supposed to.There are a few auxiliary function that are of interestin analysing this algorithm. One is called \emph{flatten},written $|\_|$, which extracts the string ``underlying'' a value. It is defined as\begin{center}\begin{tabular}{lcl} $|Empty|$ & $\dn$ & $[]$\\ $|Char(c)|$ & $\dn$ & $[c]$\\ $|Left(v)|$ & $\dn$ & $|v|$\\ $|Right(v)|$ & $\dn$ & $|v|$\\ $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\ $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\\end{tabular}\end{center}\noindent Using flatten we can see what is the string behind the values calculated by $mkeps$ and $inj$ in our running example:\begin{center}\begin{tabular}{ll}$v_4$: & $[]$\\$v_3$: & $c$\\$v_2$: & $bc$\\$v_1$: & $abc$\end{tabular}\end{center}\noindent This indicates that $inj$ indeed is injecting, or adding, backa character into the value.\subsubsection*{Simplification}Generally the matching algorithms based on derivatives dopoorly unless the regular expressions are simplified aftereach derivatives step.\newpageAlgorithm by Sulzmann, Lexing\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: