\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../graphics}\begin{document}\section*{Handout 4 (Sulzmann \& Lu Algorithm)}So far our algorithm based on derivatives was only able to sayyes or no depending on whether a string was matched by regularexpression or not. Often a more interesting question is tofind out \emph{how} a regular expression matched a string?Answering this question will also help us with the problem weare after, namely tokenising an input string. The algorithm wewill be looking at for this was designed by Sulzmann \& Lu ina rather recent paper. A link to it is provided on KEATS, incase you are interested.\footnote{In my humble opinion this isan interesting instance of the research literature: itcontains a very neat idea, but its presentation is rathersloppy. In earlier versions of their paper, students and Ifound several rather annoying typos in their examples anddefinitions.}In order to give an answer for how a regular expressionmatched a string, Sulzmann and Lu introduce \emph{values}. Avalue will be the output of the algorithm whenever the regularexpression matches the string. If not, an error will beraised. Since the first phase of the algorithm by Sulzmann \&Lu is identical to the derivative based matcher from the firstcoursework, the function $nullable$ will be used to decidewhether as string is matched by a regular expression. If$nullable$ says yes, then values are constructed that reflecthow the regular expression matched the string. The definitionsfor regular expressions $r$ and values $v$ is shown next toeach other below:\begin{center}\begin{tabular}{cc}\begin{tabular}{@{}rrl@{}}\multicolumn{3}{c}{regular expressions}\\ $r$ & $::=$ & $\varnothing$\\ & $\mid$ & $\epsilon$ \\ & $\mid$ & $c$ \\ & $\mid$ & $r_1 \cdot r_2$\\ & $\mid$ & $r_1 + r_2$ \\ \\ & $\mid$ & $r^*$ \\\end{tabular}&\begin{tabular}{@{\hspace{0mm}}rrl@{}}\multicolumn{3}{c}{values}\\ $v$ & $::=$ & \\ & & $Empty$ \\ & $\mid$ & $Char(c)$ \\ & $\mid$ & $Seq(v_1,v_2)$\\ & $\mid$ & $Left(v)$ \\ & $\mid$ & $Right(v)$ \\ & $\mid$ & $[v_1,\ldots\,v_n]$ \\\end{tabular}\end{tabular}\end{center}\noindent The point is that there is a very strongcorrespondence between them. There is no value for the$\varnothing$ regular expression, since it does not match anystring. Otherwise there is exactly one value corresponding toeach regular expression with the exception of $r_1 + r_2$where there are two values, namely $Left(v)$ and $Right(v)$corresponding to the two alternatives. Note that $r^*$ isassociated with a list of values, one for each copy of $r$that was needed to match the string. This means we might alsoreturn the empty list $[]$, if no copy was needed.To emphasise the connection between regular expressions andvalues, I have in my implementation the convention thatregular expressions are written entirely with upper-caseletters, while values just start with a single upper-casecharacter. My definition of values in Scala is below. I use this in the REPL of Scala; when I use the Scala compilerI need to rename some constructors, because Scala on Macsdoes not like classes that are called \pcode{EMPTY} and\pcode{Empty}.{\small\lstinputlisting[language=Scala,numbers=none]{../progs/app02.scala}}Graphically the algorithm by Sulzmann \& Lu can be illustratedby the picture in Figure~\ref{Sulz} where the path from theleft to the right involving $der/nullable$ is the first phaseof the algorithm and $mkeps/inj$, the path from right to left,the second phase. This picture shows the steps required when aregular expression, say $r_1$, matches the string $abc$. Wefirst build the three derivatives (according to $a$, $b$ and$c$). We then use $nullable$ to find out whether the resultingregular expression can match the empty string. If yes, we callthe function $mkeps$.\begin{figure}[t]\begin{center}\begin{tikzpicture}[scale=2,node distance=1.2cm, every node/.style={minimum size=7mm}]\node (r1) {$r_1$};\node (r2) [right=of r1]{$r_2$};\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$};\node (r3) [right=of r2]{$r_3$};\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$};\node (r4) [right=of r3]{$r_4$};\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$};\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}};\node (v4) [below=of r4]{$v_4$};\draw[->,line width=1mm](r4) -- (v4);\node (v3) [left=of v4] {$v_3$};\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$};\node (v2) [left=of v3]{$v_2$};\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$};\node (v1) [left=of v2] {$v_1$};\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$};\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}};\end{tikzpicture}\end{center}\caption{The two phases of the algorithm by Sulzmann \& Lu.\label{Sulz}}\end{figure}The $mkeps$ function calculates a value for how a regularexpression has matched the empty string. Its definitionis as follows:\begin{center}\begin{tabular}{lcl} $mkeps(\epsilon)$ & $\dn$ & $Empty$\\ $mkeps(r_1 + r_2)$ & $\dn$ & if $nullable(r_1)$ \\ & & then $Left(mkeps(r_1))$\\ & & else $Right(mkeps(r_2))$\\ $mkeps(r_1 \cdot r_2)$ & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\ $mkeps(r^*)$ & $\dn$ & $[]$ \\\end{tabular}\end{center}\noindent There are no cases for $\varnothing$ and $c$, sincethese regular expression cannot match the empty string. Notealso that in case of alternatives we give preference to theregular expression on the left-hand side. This will becomeimportant later on.The second phase of the algorithm is organised so that it willcalculate a value for how the derivative regular expressionhas matched a string whose first character has been choppedoff. Now we need a function that reverses this ``choppingoff'' for values. The corresponding function is called $inj$for injection. This function takes three arguments: the firstone is a regular expression for which we want to calculate thevalue, the second is the character we want to inject and thethird argument is the value where we will inject thecharacter. The result of this function is a new value. Thedefinition of $inj$ is as follows: \begin{center}\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} $inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ $inj\,(r_1 + r_2)\,c\,Left(v)$ & $\dn$ & $Left(inj\,r_1\,c\,v)$\\ $inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(inj\,r_2\,c\,v)$\\ $inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(inj\,r_1\,c\,v_1,v_2)$\\ $inj\,(r_1 \cdot r_2)\,c\,Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(inj\,r_1\,c\,v_1,v_2)$\\ $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\ $inj\,(r^*)\,c\,Seq(v,vs)$ & $\dn$ & $inj\,r\,c\,v\,::\,vs$\\\end{tabular}\end{center}\noindent This definition is by recursion on the regularexpression and by analysing the shape of the values. Thereforethere are, for example, three cases for sequence regularexpressions. The last clause for the star regular expressionreturns a list where the first element is $inj\,r\,c\,v$ andthe other elements are $vs$. That means $\_\,::\,\_$ should be read as list cons.To understand what is going on, it might be best to do someexample calculations and compare them with Figure~\ref{Sulz}.For this note that we have not yet dealt with the need ofsimplifying regular expressions (this will be a topic on itsown later). Suppose the regular expression is $a \cdot (b\cdot c)$ and the input string is $abc$. The derivatives fromthe first phase are as follows:\begin{center}\begin{tabular}{ll}$r_1$: & $a \cdot (b \cdot c)$\\$r_2$: & $\epsilon \cdot (b \cdot c)$\\$r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\$r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\\end{tabular}\end{center}\noindent According to the simple algorithm, we would testwhether $r_4$ is nullable, which in this case it is. Thismeans we can use the function $mkeps$ to calculate a value forhow $r_4$ was able to match the empty string. Remember thatthis function gives preference for alternatives on theleft-hand side. However there is only $\epsilon$ on the veryright-hand side of $r_4$ that matches the empty string.Therefore $mkeps$ returns the value\begin{center}$v_4:\;Right(Right(Empty))$\end{center}\noindent The point is that from this value we can directlyread off which part of $r_4$ matched the empty string. Next wehave to ``inject'' the last character, that is $c$ in therunning example, into this value $v_4$ in order to calculatehow $r_3$ could have matched the string $c$. According to thedefinition of $inj$ we obtain\begin{center}$v_3:\;Right(Seq(Empty, Char(c)))$\end{center}\noindent This is the correct result, because $r_3$ needsto use the right-hand alternative, and then $\epsilon$ needsto match the empty string and $c$ needs to match $c$.Next we need to inject back the letter $b$ into $v_3$. Thisgives\begin{center}$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$\end{center}\noindent which is again the correct result for how $r_2$matched the string $bc$. Finally we need to inject back theletter $a$ into $v_2$ giving the final result\begin{center}$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$\end{center}\noindent This now corresponds to how the regularexpression $a \cdot (b \cdot c)$ matched the string $abc$.There are a few auxiliary functions that are of interestwhen analysing this algorithm. One is called \emph{flatten},written $|\_|$, which extracts the string ``underlying'' a value. It is defined recursively as\begin{center}\begin{tabular}{lcl} $|Empty|$ & $\dn$ & $[]$\\ $|Char(c)|$ & $\dn$ & $[c]$\\ $|Left(v)|$ & $\dn$ & $|v|$\\ $|Right(v)|$ & $\dn$ & $|v|$\\ $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\ $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\\end{tabular}\end{center}\noindent Using flatten we can see what is the string behind the values calculated by $mkeps$ and $inj$ in our running example are:\begin{center}\begin{tabular}{ll}$|v_4|$: & $[]$\\$|v_3|$: & $c$\\$|v_2|$: & $bc$\\$|v_1|$: & $abc$\end{tabular}\end{center}\noindent This indicates that $inj$ indeed is injecting, oradding, back a character into the value. Next we look at howsimplification can be included into this algorithm.\subsubsection*{Simplification}Generally the matching algorithms based on derivatives dopoorly unless the regular expressions are simplified aftereach derivative step. But this is a bit more involved in thealgorithm of Sulzmann \& Lu. So what follows might require youto read several times before it makes sense and also mightrequire that you do some example calculations. As a firstexample consider the last derivation step in our earlierexample:\begin{center}$r_4 = der\,c\,r_3 = (\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\end{center}\noindent Simplifying this regular expression would just giveus $\epsilon$. Running $mkeps$ with this regular expression asinput, however, would then provide us with $Empty$ instead of$Right(Right(Empty))$ that was obtained without thesimplification. The problem is we need to recreate this morecomplicated value, rather than just $Empty$.This will require what I call \emph{rectification functions}.They need to be calculated whenever a regular expression getssimplified. Rectification functions take a value as argumentand return a (rectified) value. Let us first take a look againat our simplification rules:\begin{center}\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}$r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\ $\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\ $r \cdot \epsilon$ & $\mapsto$ & $r$\\ $\epsilon \cdot r$ & $\mapsto$ & $r$\\ $r + \varnothing$ & $\mapsto$ & $r$\\ $\varnothing + r$ & $\mapsto$ & $r$\\ $r + r$ & $\mapsto$ & $r$\\ \end{tabular}\end{center}\noindent Applying them to $r_4$ will require several nestedsimplifications in order end up with just $\epsilon$. However,it is possible to apply them in a depth-first, or inside-out,manner in order to calculate this simplified regularexpression.The rectification we can implement this by letting simp returnnot just a (simplified) regular expression, but also arectification function. Let us consider the alternative case,$r_1 + r_2$, first. By going depth-first, we first simplifythe component regular expressions $r_1$ and $r_2.$ This willreturn simplified versions (if they can be simplified), say$r_{1s}$ and $r_{2s}$, but also two rectification functions$f_{1s}$ and $f_{2s}$. We need to assemble them in order toobtain a rectified value for $r_1 + r_2$. In case $r_{1s}$simplified to $\varnothing$, we continue the derivativecalculation with $r_{2s}$. The Sulzmann \& Lu algorithm wouldreturn a corresponding value, say $v_{2s}$. But now this valueneeds to be ``rectified'' to the value \begin{center}$Right(v_{2s})$\end{center}\noindent The reason is that we look for the value that tellsus how $r_1 + r_2$ could have matched the string, not just$r_2$ or $r_{2s}$. Unfortunately, this is still not the rightvalue in general because there might be some simplificationsthat happened inside $r_2$ and for which the simplificationfunction retuned also a rectification function $f_{2s}$. So infact we need to apply this one too which gives\begin{center}$Right(f_{2s}(v_{2s}))$\end{center}\noindent This is now the correct, or rectified, value. Sincethe simplification will be done in the first phase of thealgorithm, but the rectification needs to be done to thevalues in the second phase, it is advantageous to calculatethe rectification as a function, remember this function andthen apply the value to this function during the second phase.So if we want to implement the rectification as function, we would need to return\begin{center}$\lambda v.\,Right(f_{2s}(v))$\end{center}\noindent which is the lambda-calculus notation fora function that expects a value $v$ and returns everythingafter the dot where $v$ is replaced by whatever value is given.Let us package this idea with rectification functionsinto a single function (still only considering the alternativecase):\begin{center}\begin{tabular}{l}$simp(r)$:\\\quad case $r = r_1 + r_2$\\\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\\qquad case $r_{1s} = \varnothing$: return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\\qquad case $r_{2s} = \varnothing$: return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\\qquad case $r_{1s} = r_{2s}$: return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\\qquad otherwise: return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\\end{tabular}\end{center}\noindent We first recursively call the simplification with$r_1$ and $r_2$. This gives simplified regular expressions,$r_{1s}$ and $r_{2s}$, as well as two rectification functions$f_{1s}$ and $f_{2s}$. We next need to test whether thesimplified regular expressions are $\varnothing$ so as to makefurther simplifications. In case $r_{1s}$ is $\varnothing$,then we can return $r_{2s}$ (the other alternative). Howeverfor this we need to build a corresponding rectification function, which as said above is\begin{center}$\lambda v.\,Right(f_{2s}(v))$\end{center}\noindent The case where $r_{2s} = \varnothing$ is similar:We return $r_{1s}$ and rectify with $Left(\_)$ and theother calculated rectification function $f_{1s}$. This gives\begin{center}$\lambda v.\,Left(f_{1s}(v))$\end{center}\noindent The next case where $r_{1s} = r_{2s}$ can be treatedlike the one where $r_{2s} = \varnothing$. We return $r_{1s}$and rectify with $Left(\_)$ and so on.The otherwise-case is slightly more complicated. In this caseneither $r_{1s}$ nor $r_{2s}$ are $\varnothing$ and also$r_{1s} \not= r_{2s}$, which means no further simplificationcan be applied. Accordingly, we return $r_{1s} + r_{2s}$ asthe simplified regular expression. In principle we also do nothave to do any rectification, because no simplification wasdone in this case. But this is actually not true: There mighthave been simplifications inside $r_{1s}$ and $r_{2s}$. Wetherefore need to take into account the calculatedrectification functions $f_{1s}$ and $f_{2s}$. We can do thisby defining a rectification function $f_{alt}$ which takes tworectification functions as arguments and applies themaccording to whether the value is of the form $Left(\_)$ or$Right(\_)$:\begin{center}\begin{tabular}{l@{\hspace{1mm}}l}$f_{alt}(f_1, f_2) \dn$\\\qquad $\lambda v.\,$ case $v = Left(v')$: & return $Left(f_1(v'))$\\\qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: & return $Right(f_2(v'))$\\ \end{tabular}\end{center}The other interesting case with simplification is the sequencecase. In this case the main simplification function is asfollows\begin{center}\begin{tabular}{l}$simp(r)$:\qquad\qquad (continued)\\\quad case $r = r_1 \cdot r_2$\\\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\\qquad case $r_{1s} = \varnothing$: return $(\varnothing, f_{error})$\\\qquad case $r_{2s} = \varnothing$: return $(\varnothing, f_{error})$\\\qquad case $r_{1s} = \epsilon$: return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\\qquad case $r_{2s} = \epsilon$: return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\\qquad otherwise: return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\\end{tabular}\end{center}\noindent whereby in the last line $f_{seq}$ is again pushingthe two rectification functions into the two components of theSeq-value:\begin{center}\begin{tabular}{l@{\hspace{1mm}}l}$f_{seq}(f_1, f_2) \dn$\\\qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: & return $Seq(f_1(v_1), f_2(v_2))$\\\end{tabular}\end{center}\noindent Note that in the case of $r_{1s} = \varnothing$(similarly $r_{2s}$) we use the function $f_{error}$ forrectification. If you think carefully, then you will realisethat this function will actually never been called. This isbecause a sequence with $\varnothing$ will never recognise anystring and therefore the second phase of the algorithm wouldnever been called. The simplification function still expectsus to give a function. So in my own implementation I justreturned a function which raises an error. In the casewhere $r_{1s} = \epsilon$ (similarly $r_{2s}$) we haveto create a sequence where the first component is a rectifiedversion of $Empty$. Therefore we call $f_{1s}$ with $Empty$.Since we only simplify regular expressions of the form $r_1 +r_2$ and $r_1 \cdot r_2$ we do not have to do anything elsein the remaining cases. The rectification function willbe just the identity, which in lambda-calculus terms isjust\begin{center}$\lambda v.\,v$\end{center}\noindent This completes the high-level version of thesimplification function, which is also shown again in Figure~\ref{simp}. This can now be used in a \emph{lexingfunction} as follows:\begin{figure}[t]\begin{center}\begin{tabular}{l}$simp(r)$:\\\quad case $r = r_1 + r_2$\\\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\\qquad case $r_{1s} = \varnothing$: return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\\qquad case $r_{2s} = \varnothing$: return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\\qquad case $r_{1s} = r_{2s}$: return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\\qquad otherwise: return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$ \medskip\\\quad case $r = r_1 \cdot r_2$\\\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\\qquad case $r_{1s} = \varnothing$: return $(\varnothing, f_{error})$\\\qquad case $r_{2s} = \varnothing$: return $(\varnothing, f_{error})$\\\qquad case $r_{1s} = \epsilon$: return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\\qquad case $r_{2s} = \epsilon$: return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\\qquad otherwise: return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$ \medskip\\\quad otherwise:\\\qquad return $(r, \lambda v.\,v)$\\\end{tabular}\end{center}\caption{The simplification function that returns a simplified regular expression and a rectification function.\label{simp}}\end{figure}\begin{center}\begin{tabular}{lcl}$lex\,r\,[]$ & $\dn$ & if $nullable(r)$ then $mkeps(r)$\\ & & else $error$\\$lex\,r\,c\!::\!s$ & $\dn$ & let $(r_{simp}, f_{rect}) = simp(der(c, r))$\\& & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$ \end{tabular}\end{center}\noindent This corresponds to the $matches$ function wehave seen in earlier lectures. In the first clause we aregiven an empty string, $[]$, and need to test wether theregular expression is $nullable$. If yes we can proceednormally and just return the value calculated by $mkeps$.The second clause is for strings where the first character is$c$ and the rest of the string is $s$. We first build thederivative of $r$ with respect to $c$; simplify the resulting regulare expression. We continue lexing with the simplifiedregular expression and the string $s$. Whatever will bereturned as value, we sill rectify using the $f_{rect}$from the simplification and finally inject $c$ back intothe (rectified) value.\subsubsection*{Records}Remember we want to tokenize input strings, that meanssplitting strings into their ``word'' components. Butfurthermore we want to classify each token as being a keywordor identifier and so on. For this one more feature will berequired, which I call \emph{record}. While values recordprecisely how a regular expression matches a string, records can be used to focus on some particularparts of the regular expression and forget about others.Let us look at an example. Suppose you have the regular expression $ab + ac$. Clearlythis regular expression can only recognise two strings. Butsuppose you are not interested whether it can recognise $ab$or $ac$, but rather if it matched, then what was the lastcharacter of the matched string\ldots either $b$ or $c$.You can do this by annotating the regular expression witha record, written $(x:r)$, where $x$ is just an identifier(in my implementation a plain string) and $r$ is a regularexpression. A record will be regarded as a regular expression.The extended definition in Scala looks as follows:{\small\lstinputlisting[language=Scala]{../progs/app03.scala}}\noindent Since we regard records as regular expressionswe need to extend the functions $nullable$ and $der$. Similarly $mkeps$ and $inj$ need to be extended and they sometimes can return a particular value for records. This means we also need to extend the definition of values.The extended definition in Scala looks as follows:{\small\lstinputlisting[language=Scala]{../progs/app04.scala}}\noindent Let us now look at the purpose of records moreclosely and lets return to our question whether the stringterminated in a $b$ or $c$. We can do this as follows: weannotate the regular expression $ab + ac$ with a recordas follows\begin{center}$a(x:b) + a(x:c)$\end{center}\noindent This regular expression can still only recognisethe strings $ab$ and $ac$, but we can now use a functionthat takes a value and returns all records. I call thisfunction \emph{env} for environment\ldots it builds a listof identifiers associated with their string. This functioncan be defined as follows:\begin{center}\begin{tabular}{lcl} $env(Empty)$ & $\dn$ & $[]$\\ $env(Char(c))$ & $\dn$ & $[]$\\ $env(Left(v))$ & $\dn$ & $env(v)$\\ $env(Right(v))$ & $\dn$ & $env(v)$\\ $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\ $env([v_1,\ldots ,v_n])$ & $\dn$ & $env(v_1) \,@\ldots @\, env(v_n)$\\ $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\\end{tabular}\end{center}\noindent where in the last clause we use the flatten function defined earlier. The function $env$ ``picks'' out all underlying strings where a record is given. Since there can be more than one, the environment will potentially containmany ``recordings''. If we now postprocess the value calculated by $lex$ extracting all recordings using $env$, we can answer the question whether the last element in thestring was an $b$ or a $c$. Lets see this in action: ifwe use $ab + ac$ and $ac$ the calculated value will be\begin{center}$Right(Seq(Char(a), Char(c)))$\end{center}\noindent If we use instead $a(x:b) + a(x:c)$ anduse the $env$ function to extract the recording for $x$ we obtain\begin{center}$[(x:c)]$\end{center}\noindent If we had given the string $ab$ instead, then therecord would have been $[(x:b)]$. The fun starts if we iterate this. Consider the regular expression \begin{center}$(a(x:b) + a(y:c))^*$\end{center}\noindent and the string $ababacabacab$. This string is clearly matched by the regular expression, but we are onlyinterested in the sequence of $b$s and $c$s. Using $env$we obtain\begin{center}$[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$\end{center}\noindent While this feature might look silly, it is in factquite useful. For example if we want to match the name ofan email we might use the regular expression\[(name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot (domain: [a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot (top\_level: [a\mbox{-}z\,.]^{\{2,6\}})\]\noindent Then if we match the email address\[\texttt{christian.urban@kcl.ac.uk}\]\noindent we can use the $env$ function and find outwhat the name, domain and top-level part of the emailaddress are:\begin{center}$[(name:\texttt{christian.urban}), (domain:\texttt{kcl}), (top\_level:\texttt{ac.uk})]$\end{center}\noindent As you will see in the next lecture, this is now allwe need to tokenise an input string and classify each token.\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: