\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../graphics}\begin{document}\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}\section*{Handout 4 (Sulzmann \& Lu Algorithm)}So far our algorithm based on derivatives was only able to sayyes or no depending on whether a string was matched by a regularexpression or not. Often a more interesting question is tofind out \emph{how} a regular expression matched a string?Answering this question will also help us with the problem weare after, namely tokenising an input string. The algorithm we will be looking at in this lecture was designed by Sulzmann\& Lu in a rather recent research paper (from 2014). A link to it isprovided on KEATS, in case you are interested.\footnote{In myhumble opinion this is an interesting instance of the researchliterature: it contains a very neat idea, but its presentationis rather sloppy. In earlier versions of this paper, a King'sundergraduate student and I found several rather annoying typos in theexamples and definitions.} My PhD student Fahad Ausaf and I even more recently wrote a paper where we build on their result: we provided a mathematical proof that their algorithm is really correct---the proof Sulzmann \& Lu had originally given contained major flaws. Such correctnessproofs are important: Kuklewicz maintains a unit-test libraryfor the kind of algorithma we are interested in here and he showed that many implementations in the ``wild'' are buggy, that is notsatisfy his unit tests:\begin{center}\url{http://www.haskell.org/haskellwiki/Regex_Posix}\end{center}In order to give an answer for\emph{how} a regular expression matches a string, Sulzmann andLu use \emph{values}. A value will be the output of thealgorithm whenever the regular expression matches the string.If the string does not match the string, an error will beraised. The definition for values is given below. They are shown together with the regular expressions $r$ to whichthey correspond:\begin{center}\begin{tabular}{cc}\begin{tabular}{@{}rrl@{}}\multicolumn{3}{c}{regular expressions}\medskip\\ $r$ & $::=$ & $\ZERO$\\ & $\mid$ & $\ONE$ \\ & $\mid$ & $c$ \\ & $\mid$ & $r_1 \cdot r_2$\\ & $\mid$ & $r_1 + r_2$ \\ \\ & $\mid$ & $r^*$ \\\end{tabular}&\begin{tabular}{@{\hspace{0mm}}rrl@{}}\multicolumn{3}{c}{values}\medskip\\ $v$ & $::=$ & \\ & & $Empty$ \\ & $\mid$ & $Char(c)$ \\ & $\mid$ & $Seq(v_1,v_2)$\\ & $\mid$ & $\Left(v)$ \\ & $\mid$ & $Right(v)$ \\ & $\mid$ & $Stars\,[v_1,\ldots\,v_n]$ \\\end{tabular}\end{tabular}\end{center}\noindent There is no value for the$\ZERO$ regular expression, since it does not match anystring. Otherwise there is exactly one value corresponding toeach regular expression with the exception of $r_1 + r_2$where there are two values, namely $\Left(v)$ and $Right(v)$corresponding to the two alternatives. Note that $r^*$ isassociated with a list of values, one for each copy of $r$that was needed to match the string. This means we might alsoreturn the empty list $Stars []$, if no copy was needed in caseof $r^*$. For sequence, there is exactly one value, composed of two component values ($v_1$ and $v_2$).My implementation of regular expressions and values in Scala isshown below. I have in my implementation the convention thatregular expressions are written entirely with upper-caseletters, while values just start with a single upper-casecharacter and the rest are lower-case letters.{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor= {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app01.scala}}{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor= {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app02.scala}}Graphically the algorithm by Sulzmann \& Lu can be illustratedby the picture in Figure~\ref{Sulz} where the path from theleft to the right involving $\textit{der}/\textit{nullable}$ is the first phaseof the algorithm and $\textit{mkeps}/\textit{inj}$, the path from right to left,the second phase. This picture shows the steps required when aregular expression, say $r_1$, matches the string $abc$. Wefirst build the three derivatives (according to $a$, $b$ and$c$). We then use $\textit{nullable}$ to find out whether the resultingregular expression can match the empty string. If yes, we callthe function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regularexpression has matched the empty string. Its definitionis as follows:\begin{figure}[t]\begin{center}\begin{tikzpicture}[scale=2,node distance=1.2cm, every node/.style={minimum size=7mm}]\node (r1) {$r_1$};\node (r2) [right=of r1]{$r_2$};\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\textit{der}\,a$};\node (r3) [right=of r2]{$r_3$};\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\textit{der}\,b$};\node (r4) [right=of r3]{$r_4$};\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\textit{der}\,c$};\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\textit{nullable}$}};\node (v4) [below=of r4]{$v_4$};\draw[->,line width=1mm](r4) -- (v4);\node (v3) [left=of v4] {$v_3$};\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\textit{inj}\,c$};\node (v2) [left=of v3]{$v_2$};\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\textit{inj}\,b$};\node (v1) [left=of v2] {$v_1$};\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\textit{inj}\,a$};\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};\end{tikzpicture}\end{center}\caption{The two phases of the algorithm by Sulzmann \& Lu.\label{Sulz}}\end{figure}\begin{center}\begin{tabular}{lcl} $\textit{mkeps}(\ONE)$ & $\dn$ & $Empty$\\ $\textit{mkeps}(r_1 + r_2)$ & $\dn$ & if $\textit{nullable}(r_1)$ \\ & & then $\Left(\textit{mkeps}(r_1))$\\ & & else $Right(\textit{mkeps}(r_2))$\\ $\textit{mkeps}(r_1 \cdot r_2)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\ $\textit{mkeps}(r^*)$ & $\dn$ & $Stars\,[]$ \\\end{tabular}\end{center}\noindent There is are no cases for $\ZERO$ and $c$, sincethese regular expression cannot match the empty string. Notealso that in case of alternatives we give preference to theregular expression on the left-hand side. This will becomeimportant later on.The second phase of the algorithm is organised so that it willcalculate a value for how the derivative regular expressionhas matched a string. For this we need a function thatreverses this ``chopping off'' for values which we did in thefirst phase for derivatives. The corresponding function iscalled $\textit{inj}$ (short for injection). This function takes threearguments: the first one is a regular expression for which wewant to calculate the value, the second is the character wewant to inject and the third argument is the value where wewill inject the character into. The result of this function is anew value. The definition of $\textit{inj}$ is as follows: \begin{center}\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\textit{inj}\,r\,c\,v\,::\,vs)$\\\end{tabular}\end{center}\noindent This definition is by recursion on the regularexpression and by analysing the shape of the values. Thereforethere are three cases for sequence regularexpressions (for all possible shapes of the value we can encounter). The lastclause for the star regular expression returns a list wherethe first element is $\textit{inj}\,r\,c\,v$ and the other elements are$vs$. That means $\_\,::\,\_$ should be read as list cons.To understand what is going on, it might be best to do someexample calculations and compare them with Figure~\ref{Sulz}.For this note that we have not yet dealt with the need ofsimplifying regular expressions (this will be a topic on itsown later). Suppose the regular expression is $a \cdot (b\cdot c)$ and the input string is $abc$. The derivatives fromthe first phase are as follows:\begin{center}\begin{tabular}{ll}$r_1$: & $a \cdot (b \cdot c)$\\$r_2$: & $\ONE \cdot (b \cdot c)$\\$r_3$: & $(\ZERO \cdot (b \cdot c)) + (\ONE \cdot c)$\\$r_4$: & $(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$\\\end{tabular}\end{center}\noindent According to the simple algorithm, we would testwhether $r_4$ is nullable, which in this case it indeed is.This means we can use the function $\textit{mkeps}$ to calculate avalue for how $r_4$ was able to match the empty string.Remember that this function gives preference for alternativeson the left-hand side. However there is only $\ONE$ on thevery right-hand side of $r_4$ (underlined)\begin{center}$r_4$:\;$(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \underline{\ONE})$\\\end{center}\noindentthat matches the empty string.Therefore $\textit{mkeps}$ returns the value\begin{center}$v_4:\;Right(Right(Empty))$\end{center}\noindent If there had been a $\ONE$ on the left, then$\textit{mkeps}$ would have returned something of the form$\Left(\ldots)$. The point is that from this value we candirectly read off which part of $r_4$ matched the emptystring: take the right-alternative first, and then theright-alternative again. Next we have to ``inject'' the last character, that is $c$ inthe running example, into this value $v_4$ in order tocalculate how $r_3$ could have matched the string $c$.According to the definition of $\textit{inj}$ we obtain\begin{center}$v_3:\;Right(Seq(Empty, Char(c)))$\end{center}\noindent This is the correct result, because $r_3$ needsto use the right-hand alternative, and then $\ONE$ needsto match the empty string and $c$ needs to match $c$.Next we need to inject back the letter $b$ into $v_3$. Thisgives\begin{center}$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$\end{center}\noindent which is again the correct result for how $r_2$matched the string $bc$. Finally we need to inject back theletter $a$ into $v_2$ giving the final result\begin{center}$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$\end{center}\noindent This now corresponds to how the regularexpression $a \cdot (b \cdot c)$ matched the string $abc$.There are a few auxiliary functions that are of interestwhen analysing this algorithm. One is called \emph{flatten},written $|\_|$, which extracts the string ``underlying'' a value. It is defined recursively as\begin{center}\begin{tabular}{lcl} $|Empty|$ & $\dn$ & $[]$\\ $|Char(c)|$ & $\dn$ & $[c]$\\ $|\Left(v)|$ & $\dn$ & $|v|$\\ $|Right(v)|$ & $\dn$ & $|v|$\\ $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\ $|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\\end{tabular}\end{center}\noindent Using flatten we can see what are the strings behind the values calculated by $\textit{mkeps}$ and $\textit{inj}$. In our running example:\begin{center}\begin{tabular}{ll}$|v_4|$: & $[]$\\$|v_3|$: & $c$\\$|v_2|$: & $bc$\\$|v_1|$: & $abc$\end{tabular}\end{center}\noindent This indicates that $\textit{inj}$ indeed is injecting, oradding, back a character into the value. There is a problem, however, with the described algorithmso far: it is very slow. We need to include the simplification from Lecture 2. This is what we shall do next.\subsubsection*{Simplification}Generally the matching algorithms based on derivatives dopoorly unless the regular expressions are simplified aftereach derivative step. But this is a bit more involved in thealgorithm of Sulzmann \& Lu. So what follows might require youto read several times before it makes sense and also mightrequire that you do some example calculations yourself. As afirst example consider the last derivation step in our earlierexample:\begin{equation}\label{regexfour}r_4 = \textit{der}\,c\,r_3 = (\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)\end{equation}\noindent Simplifying this regular expression would just giveus $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ asinput, however, would give us with the value $Empty$ instead of\[Right(Right(Empty))\]\noindentthat was obtained without the simplification. The problem is we needto recreate this more complicated value, rather than just return$Empty$. This will require what I call \emph{rectification functions}.They need to be calculated whenever a regular expression getssimplified.Rectification functions take a value as argument and return a(rectified) value. In the example above the argument would be $Empty$and the output $Right(Right(Empty))$. Before we define theserectification functions in general, let us first take a look again atour simplification rules:\begin{center}\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}$r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ $\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ $r \cdot \ONE$ & $\mapsto$ & $r$\\ $\ONE \cdot r$ & $\mapsto$ & $r$\\ $r + \ZERO$ & $\mapsto$ & $r$\\ $\ZERO + r$ & $\mapsto$ & $r$\\ $r + r$ & $\mapsto$ & $r$\\ \end{tabular}\end{center}\noindent Applying them to $r_4$ in \eqref{regexfour} will require several nestedsimplifications in order end up with just $\ONE$. However,it is possible to apply them in a depth-first, or inside-out,manner in order to calculate this simplified regularexpression.The rectification we can implement by letting simp return notjust a (simplified) regular expression, but also arectification function. Let us consider the alternative case,$r_1 + r_2$, first. By going depth-first, we first simplifythe component regular expressions $r_1$ and $r_2.$ This willreturn simplified versions, say $r_{1s}$ and $r_{2s}$, of thecomponent regular expressions (if they can be simplified) butalso two rectification functions $f_{1s}$ and $f_{2s}$. Weneed to assemble them in order to obtain a rectified value for$r_1 + r_2$. In case $r_{1s}$ simplified to $\ZERO$, wecontinue the derivative calculation with $r_{2s}$. TheSulzmann \& Lu algorithm would return a corresponding value,say $v_{2s}$. But now this value needs to be ``rectified'' tothe value \begin{center}$Right(v_{2s})$\end{center}\noindent The reason is that we look for the value that tellsus how $r_1 + r_2$ could have matched the string, not just$r_2$ or $r_{2s}$. Unfortunately, this is still not the rightvalue in general because there might be some simplificationsthat happened inside $r_2$ and for which the simplificationfunction retuned also a rectification function $f_{2s}$. So infact we need to apply this one too which gives\begin{center}$Right(f_{2s}(v_{2s}))$\end{center}\noindent This is now the correct, or rectified, value. Sincethe simplification will be done in the first phase of thealgorithm, but the rectification needs to be done to thevalues in the second phase, it is advantageous to calculatethe rectification as a function, remember this function andthen apply the value to this function during the second phase.So if we want to implement the rectification as function, we would need to return\begin{center}$\lambda v.\,Right(f_{2s}(v))$\end{center}\noindent which is the lambda-calculus notation fora function that expects a value $v$ and returns everythingafter the dot where $v$ is replaced by whatever value is given.Let us package this idea with rectification functionsinto a single function (still only considering the alternativecase):\begin{center}\begin{tabular}{ll}$_1$ & $simp(r)$:\\$_2$ & \quad case $r = r_1 + r_2$\\$_3$ & \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\$_4$ & \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\$_5$ & \qquad case $r_{1s} = \ZERO$: return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\$_6$ & \qquad case $r_{2s} = \ZERO$: return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\$_7$ & \qquad case $r_{1s} = r_{2s}$: return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\$_8$ & \qquad otherwise: return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\\end{tabular}\end{center}\noindent We first recursively call the simplification with$r_1$ and $r_2$ (Lines 3 and 4). This gives simplified regular expressions,$r_{1s}$ and $r_{2s}$, as well as two rectification functions$f_{1s}$ and $f_{2s}$. We next need to test whether thesimplified regular expressions are $\ZERO$ so as to makefurther simplifications. In case $r_{1s}$ is $\ZERO$ (Line 5),then we can return $r_{2s}$ (the other alternative). Howeverfor this we need to build a corresponding rectification function, which as said above is\begin{center}$\lambda v.\,Right(f_{2s}(v))$\end{center}\noindent The case where $r_{2s} = \ZERO$ is similar:We return $r_{1s}$ and rectify with $\Left(\_)$ and theother calculated rectification function $f_{1s}$. This gives\begin{center}$\lambda v.\,\Left(f_{1s}(v))$\end{center}\noindent The next case where $r_{1s} = r_{2s}$ can be treatedlike the one where $r_{2s} = \ZERO$. We return $r_{1s}$and rectify with $\Left(\_)$ and so on.The otherwise-case is slightly more complicated. In this caseneither $r_{1s}$ nor $r_{2s}$ are $\ZERO$ and also$r_{1s} \not= r_{2s}$, which means no further simplificationcan be applied. Accordingly, we return $r_{1s} + r_{2s}$ asthe simplified regular expression. In principle we also do nothave to do any rectification, because no simplification wasdone in this case. But this is actually not true: There mighthave been simplifications inside $r_{1s}$ and $r_{2s}$. Wetherefore need to take into account the calculatedrectification functions $f_{1s}$ and $f_{2s}$. We can do thisby defining a rectification function $f_{alt}$ which takes tworectification functions as arguments and applies themaccording to whether the value is of the form $\Left(\_)$ or$Right(\_)$:\begin{center}\begin{tabular}{l@{\hspace{1mm}}l}$f_{alt}(f_1, f_2) \dn$\\\qquad $\lambda v.\,$ case $v = \Left(v')$: & return $\Left(f_1(v'))$\\\qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: & return $Right(f_2(v'))$\\ \end{tabular}\end{center}The other interesting case with simplification is the sequencecase. In this case the main simplification function is asfollows\begin{center}\begin{tabular}{l}$simp(r)$:\hspace{5cm} (continued)\\\quad case $r = r_1 \cdot r_2$\\\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\\qquad case $r_{1s} = \ZERO$: return $(\ZERO, f_{error})$\\\qquad case $r_{2s} = \ZERO$: return $(\ZERO, f_{error})$\\\qquad case $r_{1s} = \ONE$: return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\\qquad case $r_{2s} = \ONE$: return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\\qquad otherwise: return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\\end{tabular}\end{center}\noindent whereby in the last line $f_{seq}$ is again pushingthe two rectification functions into the two components of theSeq-value:\begin{center}\begin{tabular}{l@{\hspace{1mm}}l}$f_{seq}(f_1, f_2) \dn$\\\qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: & return $Seq(f_1(v_1), f_2(v_2))$\\\end{tabular}\end{center}\noindent Note that in the case of $r_{1s} = \ZERO$(similarly $r_{2s}$) we use the function $f_{error}$ forrectification. If you think carefully, then you will realisethat this function will actually never been called. This isbecause a sequence with $\ZERO$ will never recognise anystring and therefore the second phase of the algorithm wouldnever been called. The simplification function still expectsus to give a function. So in my own implementation I justreturned a function that raises an error. In the casewhere $r_{1s} = \ONE$ (similarly $r_{2s}$) we haveto create a sequence where the first component is a rectifiedversion of $Empty$. Therefore we call $f_{1s}$ with $Empty$.Since we only simplify regular expressions of the form $r_1 +r_2$ and $r_1 \cdot r_2$ we do not have to do anything elsein the remaining cases. The rectification function willbe just the identity, which in lambda-calculus terms isjust\begin{center}$\lambda v.\,v$\end{center}\noindent This completes the high-level version of the simplificationfunction, which is shown again in Figure~\ref{simp}. The Scala codefor the simplification function is in Figure~\ref{simprect}.\begin{figure}[t]\begin{center}\begin{tabular}{l}$simp(r)$:\\\quad case $r = r_1 + r_2$\\\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\\qquad case $r_{1s} = \ZERO$: return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\\qquad case $r_{2s} = \ZERO$: return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\\qquad case $r_{1s} = r_{2s}$: return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\\qquad otherwise: return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$ \medskip\\\quad case $r = r_1 \cdot r_2$\\\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\\qquad case $r_{1s} = \ZERO$: return $(\ZERO, f_{error})$\\\qquad case $r_{2s} = \ZERO$: return $(\ZERO, f_{error})$\\\qquad case $r_{1s} = \ONE$: return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\\qquad case $r_{2s} = \ONE$: return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\\qquad otherwise: return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$ \medskip\\\quad otherwise:\\\qquad return $(r, \lambda v.\,v)$\\\end{tabular}\end{center}\caption{The simplification function that returns a simplified regular expression and a rectification function.\label{simp}}\end{figure}\begin{figure}[p]\small \lstinputlisting[numbers=left,linebackgroundcolor= {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app61.scala}\caption{The Scala code for the simplification function. Thefirst part defines some auxillary functions for the rectification.The second part give the simplification function.\label{simprect}}\end{figure}We are now in the position to define a \emph{lexing function} as follows:\begin{center}\begin{tabular}{lcl}$lex\,r\,[]$ & $\dn$ & if $\textit{nullable}(r)$ then $\textit{mkeps}(r)$\\ & & else $error$\\$lex\,r\,c\!::\!s$ & $\dn$ & let $(r_{simp}, f_{rect}) = simp(\textit{der}(c, r))$\\& & $\textit{inj}\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$ \end{tabular}\end{center}\noindent This corresponds to the $matches$ function we haveseen in earlier lectures. In the first clause we are given anempty string, $[]$, and need to test wether the regularexpression is $nullable$. If yes, we can proceed normally andjust return the value calculated by $\textit{mkeps}$. The second clauseis for strings where the first character is $c$, say, and therest of the string is $s$. We first build the derivative of$r$ with respect to $c$; simplify the resulting regularexpression. We continue lexing with the simplified regularexpression and the string $s$. Whatever will be returned asvalue, we sill need to rectify using the $f_{rect}$ from thesimplification and finally inject $c$ back into the(rectified) value.\subsubsection*{Records}Remember we wanted to tokenize input strings, that meanssplitting strings into their ``word'' components. Furthermorewe want to classify each token as being a keyword oridentifier and so on. For this one more feature will berequired, which I call a \emph{record} regular expression.While values encode how a regular expression matches a string,records can be used to ``focus'' on some particular parts ofthe regular expression and ``forget'' about others. Let us look at an example. Suppose you have the regularexpression $a\cdot b + a\cdot c$. Clearly this regular expression can onlyrecognise two strings. But suppose you are not interestedwhether it can recognise $ab$ or $ac$, but rather if itmatched, then what was the last character of the matchedstring\ldots either $b$ or $c$. You can do this by annotatingthe regular expression with a record, written in general$(x:r)$, where $x$ is just an identifier (in my implementationa plain string) and $r$ is a regular expression. A record willbe regarded as a regular expression. The extended definitionin Scala therefore looks as follows:{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor= {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app03.scala}}\noindent Since we regard records as regular expressions weneed to extend the functions $\textit{nullable}$ and $\textit{der}$. Similarly$\textit{mkeps}$ and $\textit{inj}$ need to be extended. This means we also needto extend the definition of values, which in Scala looks asfollows:{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor= {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app04.scala}}\noindent Let us now look at the purpose of records moreclosely and let us return to our question whether the stringterminated in a $b$ or $c$. We can do this as follows: weannotate the regular expression $ab + ac$ with a recordas follows\begin{center}$a\cdot (x:b) + a\cdot (x:c)$\end{center}\noindent This regular expression can still only recognisethe strings $ab$ and $ac$, but we can now use a functionthat takes a value and returns all records. I call thisfunction \emph{env} for environment\ldots it builds a listof identifiers associated with a string. This functioncan be defined as follows:\begin{center}\begin{tabular}{lcl} $env(Empty)$ & $\dn$ & $[]$\\ $env(Char(c))$ & $\dn$ & $[]$\\ $env(\Left(v))$ & $\dn$ & $env(v)$\\ $env(Right(v))$ & $\dn$ & $env(v)$\\ $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\ $env(Stars\,[v_1,\ldots ,v_n])$ & $\dn$ & $env(v_1) \,@\ldots @\, env(v_n)$\\ $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\\end{tabular}\end{center}\noindent where in the last clause we use the flatten functiondefined earlier. As can be seen, the function $env$ ``picks''out all underlying strings where a record is given. Sincethere can be more than one, the environment will potentiallycontain many ``records''. If we now postprocess the valuecalculated by $lex$ extracting all records using $env$, we cananswer the question whether the last element in the string wasan $b$ or a $c$. Lets see this in action: if we use $a\cdot b+ a\cdot c$ and $ac$ the calculated value will be\begin{center}$Right(Seq(Char(a), Char(c)))$\end{center}\noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ anduse the $env$ function to extract the recording for $x$ we obtain\begin{center}$[(x:c)]$\end{center}\noindent If we had given the string $ab$ instead, then therecord would have been $[(x:b)]$. The fun starts if we iterate this. Consider the regular expression \begin{center}$(a\cdot (x:b) + a\cdot (y:c))^*$\end{center}\noindent and the string $ababacabacab$. This string is clearly matched by the regular expression, but we are onlyinterested in the sequence of $b$s and $c$s. Using $env$we obtain\begin{center}$[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$\end{center}\noindent While this feature might look silly, it is in factquite useful. For example if we want to match the name ofan email we might use the regular expression\[(name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot (domain: [a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot (top\_level: [a\mbox{-}z\,.]^{\{2,6\}})\]\noindent Then if we match the email address\[\texttt{christian.urban@kcl.ac.uk}\]\noindent we can use the $env$ function and find outwhat the name, domain and top-level part of the emailaddress are:\begin{center}$[(name:\texttt{christian.urban}), (domain:\texttt{kcl}), (top\_level:\texttt{ac.uk})]$\end{center}Recall that we want to lex a little programming language,called the \emph{While}-language. A simple program in thislanguage is shown in Figure~\ref{while}. The main keywords inthis language are \pcode{while}, \pcode{if}, \pcode{then} and\pcode{else}. As usual we have syntactic categories foridentifiers, operators, numbers and so on. For this we wouldneed to design the corresponding regular expressions torecognise these syntactic categories. I let you do this designtask. Having these regular expressions at our disposal, we canform the regular expression\begin{figure}[t]\begin{center}\mbox{\lstinputlisting[language=while]{../progs/fib.while}}\end{center}\caption{The Fibonacci program in the While-language.\label{while}}\end{figure}\begin{center}$\textit{WhileRegs} \;\dn\; \left(\begin{array}{l} (k, KeyWords)\; +\\ (i, Ids)\;+\\ (o, Ops)\;+ \\ (n, Nums)\;+ \\ (s, Semis)\;+ \\ (p, (LParens + RParens))\;+\\ (b, (Begin + End))\;+ \\ (w, WhiteSpacess) \end{array}\right)^{\mbox{\LARGE{}*}}$\end{center}\noindent and ask the algorithm by Sulzmann \& Lu to lex, saythe following string\begin{center}\large\code{"if true then then 42 else +"}\end{center}\noindentBy using the records and extracting the environment, the result is the following list:\begin{center}\tt\begin{tabular}{l}KEYWORD(if),\\ WHITESPACE,\\ IDENT(true),\\ WHITESPACE,\\ KEYWORD(then),\\ WHITESPACE,\\ KEYWORD(then),\\ WHITESPACE,\\ NUM(42),\\ WHITESPACE,\\ KEYWORD(else),\\ WHITESPACE,\\ OP(+)\end{tabular}\end{center}\noindent Typically we are not interested in the whitespaces and comments and would filter them out: this gives\begin{center}\tt\begin{tabular}{l}KEYWORD(if),\\ IDENT(true),\\ KEYWORD(then),\\ KEYWORD(then),\\ NUM(42),\\ KEYWORD(else),\\ OP(+)\end{tabular}\end{center}\noindentwhich will be the input for the next phase of our compiler.\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: