diff -r 3e3b927a85cf -r c14e5ebf0c3b handouts/ho04.tex --- a/handouts/ho04.tex Thu Oct 16 11:59:39 2014 +0100 +++ b/handouts/ho04.tex Thu Oct 16 17:30:05 2014 +0100 @@ -16,25 +16,254 @@ after, namely tokenising an input string, that is splitting it up into its ``word'' components. The algorithm we will be looking at was designed by Sulzmann \& Lu in a rather recent -paper. A link to it is provided at KEATS, in case you are +paper. A link to it is provided on KEATS, in case you are interested.\footnote{In my humble opinion this is an interesting instance of the research literature: it contains a -very neat idea, but its presentation is rather sloppy. -Students and I found several rather annoying typos in their -examples and definitions.} +very neat idea, but its presentation is rather sloppy. In +earlier versions of their paper, students and I found several +rather annoying typos in their examples and definitions.} In order to give an answer for how a regular expression matched a string, Sulzmann and Lu introduce \emph{values}. A value will be the output of the algorithm whenever the regular -expression matches the string. If not an error will be raised. +expression matches the string. If not, an error will be raised. Since the first phase of the algorithm is identical to the derivative based matcher from the first coursework, the function $nullable$ will be used to decide whether as string -is matched by a regular expression. +is matched by a regular expression. If $nullable$ says +yes, then values are constructed that reflect how the regular +expression matched the string. The definitions for regular +expressions $r$ and values $v$ is shown below: + +\begin{center} +\begin{tabular}{cc} +\begin{tabular}{@{}rrl@{}} + $r$ & $::=$ & $\varnothing$\\ + & $\mid$ & $\epsilon$ \\ + & $\mid$ & $c$ \\ + & $\mid$ & $r_1 \cdot r_2$\\ + & $\mid$ & $r_1 + r_2$ \\ + \\ + & $\mid$ & $r^*$ \\ +\end{tabular} +& +\begin{tabular}{@{\hspace{0mm}}rrl@{}} + $v$ & $::=$ & \\ + & & $Empty$ \\ + & $\mid$ & $Char(c)$ \\ + & $\mid$ & $Seq(v_1,v_2)$\\ + & $\mid$ & $Left(v)$ \\ + & $\mid$ & $Right(v)$ \\ + & $\mid$ & $[v_1,\ldots\,v_n]$ \\ +\end{tabular} +\end{tabular} +\end{center} + +\noindent As you can see there is a very strong correspondence +between regular expressions and values. There is no value for +the $\varnothing$ regular expression (since it does not match +any string). Then there is exactly one value corresponding to +each regular expression. with the exception of $r_1 + r_2$ +where there are two values $Left(v)$ and $Right(v)$ +corresponding to the two alternatives. Note that $r^*$ +is associated with a list of values, one for each copy of $r$ +that was needed to match the string. This mean we might also +return the empty list $[]$, if no copy was needed. + +Graphically the algorithm can be represneted by the +picture in Figure~\ref{Sulz} where the path involving $der/nullable$ is the first +phase of the algorithm and $mkeps/inj$ the second phase. +This picture shows the steps required when a regular +expression, say $r_1$, matches the string $abc$. We first +build the three derivatives (according to $a$, $b$ and $c$). +We then use $nullable$ to find out whether the resulting +regular expression can match the empty string. If yes +we call the function $mkeps$. + +\begin{figure}[t] +\begin{center} +\begin{tikzpicture}[scale=2,node distance=1.2cm, + every node/.style={minimum size=7mm}] +\node (r1) {$r_1$}; +\node (r2) [right=of r1]{$r_2$}; +\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$}; +\node (r3) [right=of r2]{$r_3$}; +\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$}; +\node (r4) [right=of r3]{$r_4$}; +\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$}; +\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}}; +\node (v4) [below=of r4]{$v_4$}; +\draw[->,line width=1mm](r4) -- (v4); +\node (v3) [left=of v4] {$v_3$}; +\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$}; +\node (v2) [left=of v3]{$v_2$}; +\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$}; +\node (v1) [left=of v2] {$v_1$}; +\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$}; +\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}}; +\end{tikzpicture} +\end{center} +\caption{The two phases of the algorithm by Sulzmann \& Lu. +\label{Sulz}} +\end{figure} + +The $mkeps$ function calculates a value for how a regular +expression could have matched the empty string. Its definition +is as follows: + +\begin{center} +\begin{tabular}{lcl} + $mkeps(\epsilon)$ & $\dn$ & $Empty$\\ + $mkeps(r_1 + r_2)$ & $\dn$ & if $nullable(r_1)$ \\ + & & then $Left(mkeps(r_1))$\\ + & & else $Right(mkeps(r_2))$\\ + $mkeps(r_1 \cdot r_2)$ & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\ + $mkeps(r^*)$ & $\dn$ & $[]$ \\ +\end{tabular} +\end{center} + +\noindent There are no cases for $\epsilon$ and $c$, since +these regular expression cannot match the empty string. +Note that in case of alternatives we give preference to the +regular expression on the left-hand side. This will become +important later on. + +The algorithm is organised recursively such that it will +calculate a value for how the derivative regular expression +has matched a string where the first character has been +chopped off. Now we need a function that reverses this +``chopping off'' for values. The corresponding function +is called $inj$ for injection. This function takes +three arguments: the first one is a regular expression +for which we want to calculate the value, the second is the +character we want to inject and the third argument is the +value where we will inject the character. The result +of this function is a new value. The definition of $inj$ +is as follows: + +\begin{center} +\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} + $inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ + $inj\,(r_1 + r_2)\,c\,Left(v)$ & $\dn$ & $Left(inj\,r_1\,c\,v)$\\ + $inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(inj\,r_2\,c\,v)$\\ + $inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(inj\,r_1\,c\,v_1,v_2)$\\ + $inj\,(r_1 \cdot r_2)\,c\,Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(inj\,r_1\,c\,v_1,v_2)$\\ + $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\ + $inj\,(r^*)\,c\,Seq(v,vs)$ & $\dn$ & $inj\,r\,c\,v\,::\,vs$\\ +\end{tabular} +\end{center} + +\noindent This definition is by recursion on the +regular expression and by analysing the shape of the values. +Therefore there are, for example, three cases for sequnece +regular expressions. The last clause returns a list where +the first element is $inj\,r\,c\,v$ and the other elements +are $vs$. + +To understand what is going on, it might be best to do some +example calculations and compare with Figure~\ref{Sulz}. For +this note that we have not yet dealt with the need of +simplifying regular expreesions (this will be a topic on its +own later). Suppose the regular expression is $a \cdot (b +\cdot c)$ and the input string is $abc$. The derivatives are +as follows: + +\begin{center} +\begin{tabular}{ll} +$r_1$: & $a \cdot (b \cdot c)$\\ +$r_2$: & $\epsilon \cdot (b \cdot c)$\\ +$r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\ +$r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\ +\end{tabular} +\end{center} + +\noindent According to the simple algorithm. we would test +whether $r_4$ is nullable, which it is. This means we can +use the function $mkeps$ to calculate a value for how $r_4$ +was able to match the empty string. Remember that this +function gives preference for alternatives on the left-hand +side. However there is only $\epsilon$ on the very right-hand +side of $r_4$ that matches the empty string. Therefore +$mkeps$ returns the value + +\begin{center} +$v_4:\;Right(Right(Empty))$ +\end{center} + +\noindent The point is that from this value we can directly +read off which part of $r_4$ matched the empty string. Next we +have to ``inject'' the last character, that is $c$ in the +running example, into this value in order to calculate how +$r_3$ could have matched the string $c$. According to the +definition of $inj$ we obtain + +\begin{center} +$v_3:\;Right(Seq(Empty, Char(c)))$ +\end{center} + +\noindent This is the correct result, because $r_3$ needs +to use the right-hand alternative, and then $\epsilon$ needs +to match the empty string and $c$ needs to match $c$. +Next we need to inject back the letter $b$ into $v_3$. This +gives + +\begin{center} +$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$ +\end{center} + +\noindent Finally we need to inject back the letter $a$ into +$v_2$ giving the final result + +\begin{center} +$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$ +\end{center} + +\noindent This now corresponds to how the regular +expression $a \cdot (b \cdot c)$ matched the string $abc$. +This is the expected result. So at least in this case the +algorithm seems to calculate what it is supposed to. + +There are a few auxiliary function that are of interest +in analysing this algorithm. One is called \emph{flatten}, +written $|\_|$, which extracts the string ``underlying'' a +value. It is defined as + +\begin{center} +\begin{tabular}{lcl} + $|Empty|$ & $\dn$ & $[]$\\ + $|Char(c)|$ & $\dn$ & $[c]$\\ + $|Left(v)|$ & $\dn$ & $|v|$\\ + $|Right(v)|$ & $\dn$ & $|v|$\\ + $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\ + $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\ +\end{tabular} +\end{center} + +\noindent Using flatten we can see what is the string behind +the values calculated by $mkeps$ and $inj$ in our running +example: + +\begin{center} +\begin{tabular}{ll} +$v_4$: & $[]$\\ +$v_3$: & $c$\\ +$v_2$: & $bc$\\ +$v_1$: & $abc$ +\end{tabular} +\end{center} + +\noindent +This indicates that $inj$ indeed is injecting, or adding, back +a character into the value. +\subsubsection*{Simplification} +Generally the matching algorithms based on derivatives do +poorly unless the regular expressions are simplified after +each derivatives step. +\newpage Algorithm by Sulzmann, Lexing \end{document}