% !TEX program = xelatex\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../graphics}\usepackage{skull}\begin{document}\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2019}\section*{Handout 4 (Sulzmann \& Lu Algorithm)}So far our algorithm based on derivatives is only able to sayyes or no depending on whether a string is matched by a regularexpression or not. Often a more interesting question is tofind out \emph{how} a regular expression matched a string?Answering this question will also help us with the problem weare after, namely tokenising an input string. The algorithm we will be looking at in this lecture was designed by Sulzmann\& Lu in a rather recent research paper (from 2014). A link to it isprovided on KEATS, in case you are interested.\footnote{In myhumble opinion this is an interesting instance of the researchliterature: it contains a very neat idea, but its presentationis rather sloppy. In earlier versions of this paper, a King'sundergraduate student and I found several rather annoying typos in theexamples and definitions.} My former PhD student Fahad Ausaf and I even more recently wrote a paper where we build on their result: we provided a mathematical proof that their algorithm is really correct---the proof Sulzmann \& Lu had originally given contained major flaws. Such correctnessproofs are important: Kuklewicz maintains a unit-test libraryfor the kind of algorithms we are interested in here and he showed that many implementations in the ``wild'' are buggy, that is notsatisfy his unit tests:\begin{center}\url{http://www.haskell.org/haskellwiki/Regex_Posix}\end{center}Coming back to the algorithm by Sulzmann \& Lu, their idea is tointroduce \emph{values} for producing an answer for \emph{how} a regularexpression matches a string. So rather than a boolean like in theBrzozowski algorithm, a value will be the output of the Sulzman \& Lualgorithm whenever the regular expression matches the string. If thestring does not match the string, an error will be raised. Thedefinition for values is given below. They are shown together with theregular expressions $r$ to which they correspond:\begin{center}\begin{tabular}{cc}\begin{tabular}{@{}rrl@{}}\multicolumn{3}{c}{regular expressions}\medskip\\ $r$ & $::=$ & $\ZERO$\\ & $\mid$ & $\ONE$ \\ & $\mid$ & $c$ \\ & $\mid$ & $r_1 \cdot r_2$\\ & $\mid$ & $r_1 + r_2$ \\ \\ & $\mid$ & $r^*$ \\\end{tabular}&\begin{tabular}{@{\hspace{0mm}}rrl@{}}\multicolumn{3}{c}{values}\medskip\\ $v$ & $::=$ & \\ & & $Empty$ \\ & $\mid$ & $Char(c)$ \\ & $\mid$ & $Seq(v_1,v_2)$\\ & $\mid$ & $\Left(v)$ \\ & $\mid$ & $Right(v)$ \\ & $\mid$ & $Stars\,[v_1,\ldots\,v_n]$ \\\end{tabular}\end{tabular}\end{center}\noindent There is no value for the$\ZERO$ regular expression, since it does not match anystring. Otherwise there is exactly one value corresponding toeach regular expression with the exception of $r_1 + r_2$where there are two values, namely $\Left(v)$ and $Right(v)$corresponding to the two alternatives. Note that $r^*$ isassociated with a list of values, one for each copy of $r$that was needed to match the string. This means we might alsoreturn the empty list $Stars\,[]$, if no copy was neededfor $r^*$. For sequence, there is exactly one value, composed of two component values ($v_1$ and $v_2$).My implementation of regular expressions and values in Scala isshown below. I use the convention thatregular expressions are written entirely with upper-caseletters, whereas values start with a single upper-casecharacter and the rest are lower-case letters.{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor= {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app01.scala}}{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor= {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app02.scala}}Graphically the algorithm by Sulzmann \& Lu can be illustratedby the picture in Figure~\ref{Sulz} where the path from theleft to the right involving $\der/\nullable$ is the first phaseof the algorithm and $\textit{mkeps}/\inj$, the path from right to left,the second phase. This picture shows the steps required when aregular expression, say $r_1$, matches the string $abc$. Wefirst build the three derivatives (according to $a$, $b$ and$c$). We then use $\nullable$ to find out whether the resultingregular expression can match the empty string. If yes, we callthe function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regularexpression has matched the empty string. Its definitionis as follows:\begin{figure}[t]\begin{center}\begin{tikzpicture}[scale=2,node distance=1.2cm, every node/.style={minimum size=7mm}]\node (r1) {$r_1$};\node (r2) [right=of r1]{$r_2$};\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\der\,a$};\node (r3) [right=of r2]{$r_3$};\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\der\,b$};\node (r4) [right=of r3]{$r_4$};\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\der\,c$};\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\nullable$}};\node (v4) [below=of r4]{$v_4$};\draw[->,line width=1mm](r4) -- (v4);\node (v3) [left=of v4] {$v_3$};\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\inj\,c$};\node (v2) [left=of v3]{$v_2$};\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\inj\,b$};\node (v1) [left=of v2] {$v_1$};\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\inj\,a$};\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};\end{tikzpicture}\end{center}\caption{The two phases of the algorithm by Sulzmann \& Lu.\label{Sulz}}\end{figure}\begin{center}\begin{tabular}{lcl} $\textit{mkeps}(\ONE)$ & $\dn$ & $Empty$\\ $\textit{mkeps}(r_1 + r_2)$ & $\dn$ & if $\nullable(r_1)$ \\ & & then $\Left(\textit{mkeps}(r_1))$\\ & & else $Right(\textit{mkeps}(r_2))$\\ $\textit{mkeps}(r_1 \cdot r_2)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\ $\textit{mkeps}(r^*)$ & $\dn$ & $Stars\,[]$ \\\end{tabular}\end{center}\noindent There are no cases for $\ZERO$ and $c$, sincethese regular expression cannot match the empty string. Notealso that in case of alternatives we give preference to theregular expression on the left-hand side. This will becomeimportant later on.The second phase of the algorithm is organised so that it willcalculate a value for how the derivative regular expressionhas matched a string. For this we need a function thatreverses this ``chopping off'' for values which we did in thefirst phase for derivatives. The corresponding function iscalled $\inj$ (short for injection). This function takes threearguments: the first one is a regular expression for which wewant to calculate the value, the second is the character wewant to inject and the third argument is the value where wewill inject the character into. The result of this function is anew value. The definition of $\inj$ is as follows: \begin{center}\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} $\inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ $\inj\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\inj\,r_1\,c\,v)$\\ $\inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\inj\,r_2\,c\,v)$\\ $\inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\ $\inj\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\ $\inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\inj\,r_2\,c\,v)$\\ $\inj\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\inj\,r\,c\,v\,::\,vs)$\\\end{tabular}\end{center}\noindent This definition is by recursion on the regularexpression and by analysing the shape of the values. Thereforethere are three cases for sequence regularexpressions (for all possible shapes of the value we can encounter). The lastclause for the star regular expression returns a list wherethe first element is $\inj\,r\,c\,v$ and the other elements are$vs$. That means $\_\,::\,\_$ should be read as list cons.To understand what is going on, it might be best to do someexample calculations and compare them with Figure~\ref{Sulz}.For this note that we have not yet dealt with the need ofsimplifying regular expressions (this will be a topic on itsown later). Suppose the regular expression is $a \cdot (b\cdot c)$ and the input string is $abc$. The derivatives fromthe first phase are as follows:\begin{center}\begin{tabular}{ll}$r_1$: & $a \cdot (b \cdot c)$\\$r_2$: & $\ONE \cdot (b \cdot c)$\\$r_3$: & $(\ZERO \cdot (b \cdot c)) + (\ONE \cdot c)$\\$r_4$: & $(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$\\\end{tabular}\end{center}\noindent According to the simple algorithm, we would testwhether $r_4$ is nullable, which in this case it indeed is.This means we can use the function $\textit{mkeps}$ to calculate avalue for how $r_4$ was able to match the empty string.Remember that this function gives preference for alternativeson the left-hand side. However there is only $\ONE$ on thevery right-hand side of $r_4$ (underlined)\begin{center}$r_4$:\;$(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \underline{\ONE})$\\\end{center}\noindentthat matches the empty string.Therefore $\textit{mkeps}$ returns the value\begin{center}$v_4:\;Right(Right(Empty))$\end{center}\noindent If there had been a $\ONE$ on the left, then$\textit{mkeps}$ would have returned something of the form$\Left(\ldots)$. The point is that from this value we candirectly read off which part of $r_4$ matched the emptystring: take the right-alternative first, and then theright-alternative again, then you got to the $\ONE$.Next we have to ``inject'' the last character, that is $c$ inthe running example, into this value $v_4$ in order tocalculate how $r_3$ could have matched the string $c$.For this we call injection with $\inj(r_3, c, v_4)$.According to the definition of $\inj$ we obtain\begin{center}$v_3:\;Right(Seq(Empty, Char(c)))$\end{center}\noindent This is the correct result, because $r_3$ needsto use the right-hand alternative, and then $\ONE$ needsto match the empty string and $c$ needs to match $c$.Next we need to inject back the letter $b$ into $v_3$.For this we call injection with $\inj(r_2, b, v_3)$.This gives\begin{center}$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$\end{center}\noindent which is again the correct result for how $r_2$matched the string $bc$. Finally we need to inject back theletter $a$ into $v_2$ giving the final result.For this we call injection with $\inj(r_1, a, v_2)$and obtain\begin{center}$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$\end{center}\noindent This value corresponds to how the regular expression $r_1$,namely $a \cdot (b \cdot c)$, matched the string $abc$.There are a few auxiliary functions that are of interestwhen analysing this algorithm. One is called \emph{flatten},written $|\_|$, which extracts the string ``underlying'' a value. It is defined recursively as\begin{center}\begin{tabular}{lcl} $|Empty|$ & $\dn$ & $[]$\\ $|Char(c)|$ & $\dn$ & $[c]$\\ $|\Left(v)|$ & $\dn$ & $|v|$\\ $|Right(v)|$ & $\dn$ & $|v|$\\ $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\ $|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\\end{tabular}\end{center}\noindent Using flatten we can see what are the strings behind the values calculated by $\textit{mkeps}$ and $\inj$. In our running example:\begin{center}\begin{tabular}{ll}$|v_4|$: & $[]$\\$|v_3|$: & $c$\\$|v_2|$: & $bc$\\$|v_1|$: & $abc$\end{tabular}\end{center}\noindent This indicates that $\inj$ indeed is injecting, oradding, back a character into the value. The definition of $\inj$ might still be very puzzling and each clausemight look arbitrary, but there is in fact a relatively simple ideabehind it. Ultimately the $\inj$-functions is determined by thederivative functions. For this consider one of the ``squares'' fromFigure~\ref{Sulz}: \begin{center} \begin{tikzpicture}[scale=2,node distance=1.2cm, every node/.style={minimum size=7mm}] \node (r) {$r$}; \node (rd) [right=of r]{$r_{der}$}; \draw[->,line width=1mm](r)--(rd) node[above,midway] {$\der\,c$}; \node (vd) [below=of r2]{$v_{der}$}; \draw[->,line width=1mm](rd) -- (vd); \node (v) [left=of vd] {$v$}; \draw[->,line width=1mm](vd)--(v) node[below,midway] {$\inj\,c$}; \draw[->,line width=0.5mm,dotted](r) -- (v) node[left,midway,red] {\bf ?}; \end{tikzpicture} \end{center}\noindentThe input to the $\inj$-function is $r$ (on the top left), $c$ (thecharacter to be injected) and $v_{der}$ (bottom right). The output isthe value $v$ for how the regular expression $r$ matched thecorresponding string. How does $\inj$ calculate this value? Well, it hasto analyse the value $v_{der}$ and transform it into the value $v$ forthe regular expression $r$. The value $v_{der}$ corresponds to the$r_{der}$-regular expression which is the derivative of $r$. Rememberthat $v_{der}$ is the value for how $r_{der}$ matches the corresponding stringwhere $c$ has been chopped off.Let $r$ be $r_1 + r_2$. Then $r_{der}$is of the form $(\der\,c\,r_1) + (\der\,c\,r_2)$. What are the possiblevalues corresponding to $r_{der}$? Well, they can be only of the form$\Left(\ldots)$ and $\Right(\ldots)$. Therefore you have twocases in the $\inj$ function -- one for $\Left$ and one for $\Right$.\begin{center}\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l} $\inj\,(r_1 + r_2)\,c\,\,\Left(v)$ & $\dn$ & $\ldots$\\ $\inj\,(r_1 + r_2)\,c\,\,\Right(v)$ & $\dn$ & $\ldots$\\\end{tabular}\end{center}\noindentLet's look next at the sequence case where $r = r_1 \cdot r_2$. What does thederivative of $r$ look like? According to the definition it is: \begin{center} \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} $\der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $\nullable (r_1)$\\ & & then $(\der\,c\,r_1) \cdot r_2 + \der\, c\, r_2$\\ & & else $(\der\, c\, r_1) \cdot r_2$\\ \end{tabular} \end{center} \noindent As you can see there is a derivative in the if-branch and anotherin the else-branch. In the if-branch we have an alternative of the form $\_ + \_$. We already know what the possible values are for such a regular expression, namely $\Left$ or $\Right$. Therefore we have in $\inj$ thetwo cases:\begin{center} \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l} $\inj\,(r_1 \cdot r_2)\,c\,\,\Left(Seq(v_1,v_2))$ & $\dn$ & $\ldots$\\ $\inj\,(r_1 \cdot r_2)\,c\,\,Right(v)$ & $\dn$ & $\ldots$\\\end{tabular}\end{center}\noindentIn the first case we even know that it is not just a $\Left$-value, but $\Left(Seq(\ldots))$, because the corresponding regular expression in the derivation is \mbox{$(\der\,c\,r_1) \cdot r_2$}. Hence we know it must be a $Seq$-value enclosed inside $\Left$. The third clause for $r_1\cdot r_2$ in the $\inj$-function\begin{center} \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} $\inj\,(r_1 \cdot r_2)\,c\,\,Seq(v_1,v_2)$ & $\dn$ & $\ldots$\\ \end{tabular} \end{center}\noindent corresponds to the else-branch in the derivative. In thiscase we know the derivative is of the form $(\der\,c\,r_1) \cdot r_2$ andtherefore the value must be of the form $Seq(\ldots)$.Hopefully the $inj$-function makes now more sense. I let you explainthe $r^*$ case. What do the derivative of $r^*$ andthe corresponding value look like? Does this explain the shape of the clause?\begin{center} \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} $\inj\,(r^*)\,c\,\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\inj\,r\,c\,v\,::\,vs)$\\ \end{tabular}\end{center}\noindent If yes, you made sense of the left-hand sides ofthe $\inj$-definition. How about the right-hand sides? Well, in the$r^*$ case for example we have according to the square in the picture above a value $v_{der}$ which says how the derivative $r_{der}$matched the string. Since the derivative is of the form$(\der\,c\,r) \cdot (r^*)$ the corresponding value is of theform $Seq(v, Stars\,vs)$. But for $r^*$ we are looking for a valuefor the original (top left) regular expression --- so it cannotbe a value of the shape $Seq(\ldots, Stars\ldots)$ because that is theshape for sequence regular expressions. What we need is a valueof the form $Stars \ldots$ (remember the correspondence between values and regular expressions). This suggests to define the right hand side as\begin{center}$\ldots \dn Stars(\inj\,r\,c\,v\,::\,vs)$\end{center} \noindent It is a value of the right shape, namely $Stars$. It injected $c$ into the first-value, which is in fact the value where we need to undo the derivative. Remember again the shape of the derivative of $r^*$.In place where we chopped off the $c$, we now need to do the $\inj$ of $c$.Therefore $\inj\,r\,c\,v$ in the definition above. That is the same withthe other clauses in $\inj$. Phew\ldots{}Sweat\ldots!\#@$\skull$\%\ldots Unfortunately, there isa gigantic problem with the described algorithm so far: it is veryslow. We need to include in all this the simplification from Lecture2. And what rotten luck: simplification messes things up and we needto rectify the mess. This is what we shall do next.\subsubsection*{Simplification}Generally the matching algorithms based on derivatives dopoorly unless the regular expressions are simplified aftereach derivative step. But this is a bit more involved in thealgorithm of Sulzmann \& Lu. So what follows might require youto read several times before it makes sense and also mightrequire that you do some example calculations yourself. As afirst example consider the last derivation step in our earlierexample:\begin{equation}\label{regexfour}r_4 = \der\,c\,r_3 = (\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)\end{equation}\noindent Simplifying this regular expression would just giveus $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ asinput, however, would give us with the value $Empty$ instead of\[Right(Right(Empty))\]\noindentthat was obtained without the simplification. The problem is we needto recreate this more complicated value, rather than just return$Empty$. This will require what I call \emph{rectification functions}.They need to be calculated whenever a regular expression getssimplified.Rectification functions take a value as argument and return a(rectified) value. In the example above the argument would be $Empty$and the output $Right(Right(Empty))$. Before we define theserectification functions in general, let us first take a look again atour simplification rules:\begin{center}\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}$r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ $\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ $r \cdot \ONE$ & $\mapsto$ & $r$\\ $\ONE \cdot r$ & $\mapsto$ & $r$\\ $r + \ZERO$ & $\mapsto$ & $r$\\ $\ZERO + r$ & $\mapsto$ & $r$\\ $r + r$ & $\mapsto$ & $r$\\ \end{tabular}\end{center}\noindent Applying them to $r_4$ in \eqref{regexfour} will require several nestedsimplifications in order end up with just $\ONE$. However,it is possible to apply them in a depth-first, or inside-out,manner in order to calculate this simplified regularexpression.The rectification we can implement by letting simp return notjust a (simplified) regular expression, but also arectification function. Let us consider the alternative case,$r_1 + r_2$, first. By going depth-first, we first simplifythe component regular expressions $r_1$ and $r_2.$ This willreturn simplified versions, say $r_{1s}$ and $r_{2s}$, of thecomponent regular expressions (if they can be simplified) butalso two rectification functions $f_{1s}$ and $f_{2s}$. Weneed to assemble them in order to obtain a rectified value for$r_1 + r_2$. In case $r_{1s}$ simplified to $\ZERO$, wecontinue the derivative calculation with $r_{2s}$. TheSulzmann \& Lu algorithm would return a corresponding value,say $v_{2s}$. But now this value needs to be ``rectified'' tothe value \begin{center}$Right(v_{2s})$\end{center}\noindent The reason is that we look for the value that tellsus how $r_1 + r_2$ could have matched the string, not just$r_2$ or $r_{2s}$. Unfortunately, this is still not the rightvalue in general because there might be some simplificationsthat happened inside $r_2$ and for which the simplificationfunction returned also a rectification function $f_{2s}$. So infact we need to apply this one too which gives\begin{center}$Right(f_{2s}(v_{2s}))$\end{center}\noindent This is now the correct, or rectified, value. Sincethe simplification will be done in the first phase of thealgorithm, but the rectification needs to be done to thevalues in the second phase, it is advantageous to calculatethe rectification as a function, remember this function andthen apply the value to this function during the second phase.So if we want to implement the rectification as function, we would need to return\begin{center}$\lambda v.\,Right(f_{2s}(v))$\end{center}\noindent which is the lambda-calculus notation fora function that expects a value $v$ and returns everythingafter the dot where $v$ is replaced by whatever value is given.Let us package this idea with rectification functionsinto a single function (still only considering the alternativecase):\begin{center}\begin{tabular}{ll}$_1$ & $simp(r)$:\\$_2$ & \quad case $r = r_1 + r_2$\\$_3$ & \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\$_4$ & \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\$_5$ & \qquad case $r_{1s} = \ZERO$: return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\$_6$ & \qquad case $r_{2s} = \ZERO$: return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\$_7$ & \qquad case $r_{1s} = r_{2s}$: return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\$_8$ & \qquad otherwise: return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\\end{tabular}\end{center}\noindent We first recursively call the simplification with$r_1$ and $r_2$ (Lines 3 and 4). This gives simplified regular expressions,$r_{1s}$ and $r_{2s}$, as well as two rectification functions$f_{1s}$ and $f_{2s}$. We next need to test whether thesimplified regular expressions are $\ZERO$ so as to makefurther simplifications. In case $r_{1s}$ is $\ZERO$ (Line 5),then we can return $r_{2s}$ (the other alternative). Howeverfor this we need to build a corresponding rectification function, which as said above is\begin{center}$\lambda v.\,Right(f_{2s}(v))$\end{center}\noindent The case where $r_{2s} = \ZERO$ is similar:We return $r_{1s}$ and rectify with $\Left(\_)$ and theother calculated rectification function $f_{1s}$. This gives\begin{center}$\lambda v.\,\Left(f_{1s}(v))$\end{center}\noindent The next case where $r_{1s} = r_{2s}$ can be treatedlike the one where $r_{2s} = \ZERO$. We return $r_{1s}$and rectify with $\Left(\_)$ and so on.The otherwise-case is slightly more complicated. In this caseneither $r_{1s}$ nor $r_{2s}$ are $\ZERO$ and also$r_{1s} \not= r_{2s}$, which means no further simplificationcan be applied. Accordingly, we return $r_{1s} + r_{2s}$ asthe simplified regular expression. In principle we also do nothave to do any rectification, because no simplification wasdone in this case. But this is actually not true: There mighthave been simplifications inside $r_{1s}$ and $r_{2s}$. Wetherefore need to take into account the calculatedrectification functions $f_{1s}$ and $f_{2s}$. We can do thisby defining a rectification function $f_{alt}$ which takes tworectification functions as arguments and applies themaccording to whether the value is of the form $\Left(\_)$ or$Right(\_)$:\begin{center}\begin{tabular}{l@{\hspace{1mm}}l}$f_{alt}(f_1, f_2) \dn$\\\qquad $\lambda v.\,$ case $v = \Left(v')$: & return $\Left(f_1(v'))$\\\qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: & return $Right(f_2(v'))$\\ \end{tabular}\end{center}The other interesting case with simplification is the sequencecase. In this case the main simplification function is asfollows\begin{center}\begin{tabular}{l}$simp(r)$:\hspace{5cm} (continued)\\\quad case $r = r_1 \cdot r_2$\\\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\\qquad case $r_{1s} = \ZERO$: return $(\ZERO, f_{error})$\\\qquad case $r_{2s} = \ZERO$: return $(\ZERO, f_{error})$\\\qquad case $r_{1s} = \ONE$: return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\\qquad case $r_{2s} = \ONE$: return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\\qquad otherwise: return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\\end{tabular}\end{center}\noindent whereby in the last line $f_{seq}$ is again pushingthe two rectification functions into the two components of theSeq-value:\begin{center}\begin{tabular}{l@{\hspace{1mm}}l}$f_{seq}(f_1, f_2) \dn$\\\qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: & return $Seq(f_1(v_1), f_2(v_2))$\\\end{tabular}\end{center}\noindent Note that in the case of $r_{1s} = \ZERO$(similarly $r_{2s}$) we use the function $f_{error}$ forrectification. If you think carefully, then you will realisethat this function will actually never been called. This isbecause a sequence with $\ZERO$ will never recognise anystring and therefore the second phase of the algorithm wouldnever been called. The simplification function still expectsus to give a function. So in my own implementation I justreturned a function that raises an error. In the casewhere $r_{1s} = \ONE$ (similarly $r_{2s}$) we haveto create a sequence where the first component is a rectifiedversion of $Empty$. Therefore we call $f_{1s}$ with $Empty$.Since we only simplify regular expressions of the form $r_1 +r_2$ and $r_1 \cdot r_2$ we do not have to do anything elsein the remaining cases. The rectification function willbe just the identity, which in lambda-calculus terms isjust\begin{center}$\lambda v.\,v$\end{center}\noindent This completes the high-level version of the simplificationfunction, which is shown again in Figure~\ref{simp}. The Scala codefor the simplification function is in Figure~\ref{simprect}.\begin{figure}[t]\begin{center}\begin{tabular}{l}$simp(r)$:\\\quad case $r = r_1 + r_2$\\\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\\qquad case $r_{1s} = \ZERO$: return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\\qquad case $r_{2s} = \ZERO$: return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\\qquad case $r_{1s} = r_{2s}$: return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\\qquad otherwise: return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$ \medskip\\\quad case $r = r_1 \cdot r_2$\\\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\\qquad case $r_{1s} = \ZERO$: return $(\ZERO, f_{error})$\\\qquad case $r_{2s} = \ZERO$: return $(\ZERO, f_{error})$\\\qquad case $r_{1s} = \ONE$: return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\\qquad case $r_{2s} = \ONE$: return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\\qquad otherwise: return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$ \medskip\\\quad otherwise:\\\qquad return $(r, \lambda v.\,v)$\\\end{tabular}\end{center}\caption{The simplification function that returns a simplified regular expression and a rectification function.\label{simp}}\end{figure}\begin{figure}[p]\small \lstinputlisting[numbers=left,linebackgroundcolor= {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app61.scala}\caption{The Scala code for the simplification function. Thefirst part defines some auxiliary functions for the rectification.The second part give the simplification function.\label{simprect}}\end{figure}We are now in the position to define a \emph{lexing function} as follows:\begin{center}\begin{tabular}{lcl}$lex\,r\,[]$ & $\dn$ & if $\nullable(r)$ then $\textit{mkeps}(r)$\\ & & else $error$\\$lex\,r\,c\!::\!s$ & $\dn$ & let $(r_{simp}, f_{rect}) = simp(\der(c, r))$\\& & $\inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$ \end{tabular}\end{center}\noindent This corresponds to the $matches$ function we haveseen in earlier lectures. In the first clause we are given anempty string, $[]$, and need to test whether the regularexpression is $nullable$. If yes, we can proceed normally andjust return the value calculated by $\textit{mkeps}$. The second clauseis for strings where the first character is $c$, say, and therest of the string is $s$. We first build the derivative of$r$ with respect to $c$; simplify the resulting regularexpression. We continue lexing with the simplified regularexpression and the string $s$. Whatever will be returned asvalue, we sill need to rectify using the $f_{rect}$ from thesimplification and finally inject $c$ back into the(rectified) value.\subsubsection*{Records}Remember we wanted to tokenize input strings, that meanssplitting strings into their ``word'' components. Furthermorewe want to classify each token as being a keyword oridentifier and so on. For this one more feature will berequired, which I call a \emph{record} regular expression.While values encode how a regular expression matches a string,records can be used to ``focus'' on some particular parts ofthe regular expression and ``forget'' about others. Let us look at an example. Suppose you have the regularexpression $a\cdot b + a\cdot c$. Clearly this regular expression can onlyrecognise two strings. But suppose you are not interestedwhether it can recognise $ab$ or $ac$, but rather if itmatched, then what was the last character of the matchedstring\ldots either $b$ or $c$. You can do this by annotatingthe regular expression with a record, written in general$(x:r)$, where $x$ is just an identifier (in my implementationa plain string) and $r$ is a regular expression. A record willbe regarded as a regular expression. The extended definitionin Scala therefore looks as follows:{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor= {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app03.scala}}\noindent Since we regard records as regular expressions weneed to extend the functions $\nullable$ and $\der$. Similarly$\textit{mkeps}$ and $\inj$ need to be extended. This means we also needto extend the definition of values, which in Scala looks asfollows:{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor= {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app04.scala}}\noindent Let us now look at the purpose of records moreclosely and let us return to our question whether the stringterminated in a $b$ or $c$. We can do this as follows: weannotate the regular expression $ab + ac$ with a recordas follows\begin{center}$a\cdot (x:b) + a\cdot (x:c)$\end{center}\noindent This regular expression can still only recognisethe strings $ab$ and $ac$, but we can now use a functionthat takes a value and returns all records. I call thisfunction \emph{env} for environment\ldots it builds a listof identifiers associated with a string. This functioncan be defined as follows:\begin{center}\begin{tabular}{lcl} $env(Empty)$ & $\dn$ & $[]$\\ $env(Char(c))$ & $\dn$ & $[]$\\ $env(\Left(v))$ & $\dn$ & $env(v)$\\ $env(Right(v))$ & $\dn$ & $env(v)$\\ $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\ $env(Stars\,[v_1,\ldots ,v_n])$ & $\dn$ & $env(v_1) \,@\ldots @\, env(v_n)$\\ $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\\end{tabular}\end{center}\noindent where in the last clause we use the flatten functiondefined earlier. As can be seen, the function $env$ ``picks''out all underlying strings where a record is given. Sincethere can be more than one, the environment will potentiallycontain many ``records''. If we now postprocess the valuecalculated by $lex$ extracting all records using $env$, we cananswer the question whether the last element in the string wasan $b$ or a $c$. Lets see this in action: if we use $a\cdot b+ a\cdot c$ and $ac$ the calculated value will be\begin{center}$Right(Seq(Char(a), Char(c)))$\end{center}\noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ anduse the $env$ function to extract the recording for $x$ we obtain\begin{center}$[(x:c)]$\end{center}\noindent If we had given the string $ab$ instead, then therecord would have been $[(x:b)]$. The fun starts if we iterate this. Consider the regular expression \begin{center}$(a\cdot (x:b) + a\cdot (y:c))^*$\end{center}\noindent and the string $ababacabacab$. This string is clearly matched by the regular expression, but we are onlyinterested in the sequence of $b$s and $c$s. Using $env$we obtain\begin{center}$[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$\end{center}\noindent While this feature might look silly, it is in factquite useful. For example if we want to match the name ofan email we might use the regular expression\[(name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot (domain: [a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot (top\_level: [a\mbox{-}z\,.]^{\{2,6\}})\]\noindent Then if we match the email address\[\texttt{christian.urban@kcl.ac.uk}\]\noindent we can use the $env$ function and find outwhat the name, domain and top-level part of the emailaddress are:\begin{center}$[(name:\texttt{christian.urban}), (domain:\texttt{kcl}), (top\_level:\texttt{ac.uk})]$\end{center}Recall that we want to lex a little programming language,called the \emph{While}-language. A simple program in thislanguage is shown in Figure~\ref{while}. The main keywords inthis language are \pcode{while}, \pcode{if}, \pcode{then} and\pcode{else}. As usual we have syntactic categories foridentifiers, operators, numbers and so on. For this we wouldneed to design the corresponding regular expressions torecognise these syntactic categories. I let you do this designtask. Having these regular expressions at our disposal, we canform the regular expression\begin{figure}[t]\begin{center}\mbox{\lstinputlisting[language=while]{../progs/fib.while}}\end{center}\caption{The Fibonacci program in the While-language.\label{while}}\end{figure}\begin{center}$\textit{WhileRegs} \;\dn\; \left(\begin{array}{l} (k, KeyWords)\; +\\ (i, Ids)\;+\\ (o, Ops)\;+ \\ (n, Nums)\;+ \\ (s, Semis)\;+ \\ (p, (LParens + RParens))\;+\\ (b, (Begin + End))\;+ \\ (w, WhiteSpaces) \end{array}\right)^{\mbox{\LARGE{}*}}$\end{center}\noindent and ask the algorithm by Sulzmann \& Lu to lex, saythe following string\begin{center}\large\code{"if true then then 42 else +"}\end{center}\noindentBy using the records and extracting the environment, the result is the following list:\begin{center}\tt\begin{tabular}{l}(k, if),\\ (w, " "),\\ (i, true),\\ (w, " "),\\ (k, then),\\ (w, " "),\\ (k, then),\\ (w, " "),\\ (n, 42),\\ (w, " "),\\ (k, else),\\ (w, " "),\\ (o, +)\end{tabular}\end{center}\noindent Typically we are not interested in the whitespaces and comments and would filter them out: this gives\begin{center}\tt\begin{tabular}{l} (k, if),\\ (i, true),\\ (k, then),\\ (k, then),\\ (n, 42),\\ (k, else),\\ (o, +)\end{tabular}\end{center}\noindentwhich will be the input for the next phase of our compiler.\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: