% !TEX program = xelatex+ −
\documentclass{article}+ −
\usepackage{../style}+ −
\usepackage{../langs}+ −
\usepackage{../graphics}+ −
\usepackage{skull}+ −
+ −
\begin{document}+ −
\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2019}+ −
+ −
\section*{Handout 4 (Sulzmann \& Lu Algorithm)}+ −
+ −
So far our algorithm based on derivatives is only able to say+ −
yes or no depending on whether a string is matched by a regular+ −
expression or not. Often a more interesting question is to+ −
find out \emph{how} a regular expression matched a string?+ −
Answering this question will also help us with the problem we+ −
are after, namely tokenising an input string. + −
+ −
The algorithm we will be looking at in this lecture was designed by Sulzmann+ −
\& Lu in a rather recent research paper (from 2014). A link to it is+ −
provided on KEATS, in case you are interested.\footnote{In my+ −
humble opinion this is an interesting instance of the research+ −
literature: it contains a very neat idea, but its presentation+ −
is rather sloppy. In earlier versions of this paper, a King's+ −
undergraduate student and I found several rather annoying typos in the+ −
examples and definitions.} My former PhD student Fahad Ausaf and I even more recently + −
wrote a paper where we build on their result: we provided a + −
mathematical proof that their algorithm is really correct---the proof + −
Sulzmann \& Lu had originally given contained major flaws. Such correctness+ −
proofs are important: Kuklewicz maintains a unit-test library+ −
for the kind of algorithms we are interested in here and he showed + −
that many implementations in the ``wild'' are buggy, that is not+ −
satisfy his unit tests:+ −
+ −
\begin{center}+ −
\url{http://www.haskell.org/haskellwiki/Regex_Posix}+ −
\end{center}+ −
+ −
+ −
Coming back to the algorithm by Sulzmann \& Lu, their idea is to+ −
introduce \emph{values} for producing an answer for \emph{how} a regular+ −
expression matches a string. So rather than a boolean like in the+ −
Brzozowski algorithm, a value will be the output of the Sulzman \& Lu+ −
algorithm whenever the regular expression matches the string. If the+ −
string does not match the string, an error will be raised. The+ −
definition for values is given below. They are shown together with the+ −
regular expressions $r$ to which they correspond:+ −
+ −
\begin{center}+ −
\begin{tabular}{cc}+ −
\begin{tabular}{@{}rrl@{}}+ −
\multicolumn{3}{c}{regular expressions}\medskip\\+ −
$r$ & $::=$ & $\ZERO$\\+ −
& $\mid$ & $\ONE$ \\+ −
& $\mid$ & $c$ \\+ −
& $\mid$ & $r_1 \cdot r_2$\\+ −
& $\mid$ & $r_1 + r_2$ \\+ −
\\+ −
& $\mid$ & $r^*$ \\+ −
\end{tabular}+ −
&+ −
\begin{tabular}{@{\hspace{0mm}}rrl@{}}+ −
\multicolumn{3}{c}{values}\medskip\\+ −
$v$ & $::=$ & \\+ −
& & $Empty$ \\+ −
& $\mid$ & $Char(c)$ \\+ −
& $\mid$ & $Seq(v_1,v_2)$\\+ −
& $\mid$ & $\Left(v)$ \\+ −
& $\mid$ & $Right(v)$ \\+ −
& $\mid$ & $Stars\,[v_1,\ldots\,v_n]$ \\+ −
\end{tabular}+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent There is no value for the+ −
$\ZERO$ regular expression, since it does not match any+ −
string. Otherwise there is exactly one value corresponding to+ −
each regular expression with the exception of $r_1 + r_2$+ −
where there are two values, namely $\Left(v)$ and $Right(v)$+ −
corresponding to the two alternatives. Note that $r^*$ is+ −
associated with a list of values, one for each copy of $r$+ −
that was needed to match the string. This means we might also+ −
return the empty list $Stars\,[]$, if no copy was needed+ −
for $r^*$. For sequence, there is exactly one value, composed + −
of two component values ($v_1$ and $v_2$).+ −
+ −
My implementation of regular expressions and values in Scala is+ −
shown below. I use the convention that+ −
regular expressions are written entirely with upper-case+ −
letters, whereas values start with a single upper-case+ −
character and the rest are lower-case letters.+ −
+ −
{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=+ −
{\ifodd\value{lstnumber}\color{capri!3}\fi}]+ −
{../progs/app01.scala}}+ −
+ −
+ −
{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=+ −
{\ifodd\value{lstnumber}\color{capri!3}\fi}]+ −
{../progs/app02.scala}}+ −
+ −
+ −
Graphically the algorithm by Sulzmann \& Lu can be illustrated+ −
by the picture in Figure~\ref{Sulz} where the path from the+ −
left to the right involving $\der/\nullable$ is the first phase+ −
of the algorithm and $\textit{mkeps}/\inj$, the path from right to left,+ −
the second phase. This picture shows the steps required when a+ −
regular expression, say $r_1$, matches the string $abc$. We+ −
first build the three derivatives (according to $a$, $b$ and+ −
$c$). We then use $\nullable$ to find out whether the resulting+ −
regular expression can match the empty string. If yes, we call+ −
the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular+ −
expression has matched the empty string. Its definition+ −
is as follows:+ −
+ −
\begin{figure}[t]+ −
\begin{center}+ −
\begin{tikzpicture}[scale=2,node distance=1.2cm,+ −
every node/.style={minimum size=7mm}]+ −
\node (r1) {$r_1$};+ −
\node (r2) [right=of r1]{$r_2$};+ −
\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\der\,a$};+ −
\node (r3) [right=of r2]{$r_3$};+ −
\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\der\,b$};+ −
\node (r4) [right=of r3]{$r_4$};+ −
\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\der\,c$};+ −
\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\nullable$}};+ −
\node (v4) [below=of r4]{$v_4$};+ −
\draw[->,line width=1mm](r4) -- (v4);+ −
\node (v3) [left=of v4] {$v_3$};+ −
\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\inj\,c$};+ −
\node (v2) [left=of v3]{$v_2$};+ −
\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\inj\,b$};+ −
\node (v1) [left=of v2] {$v_1$};+ −
\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\inj\,a$};+ −
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};+ −
\end{tikzpicture}+ −
\end{center}+ −
\caption{The two phases of the algorithm by Sulzmann \& Lu.+ −
\label{Sulz}}+ −
\end{figure}+ −
+ −
+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{mkeps}(\ONE)$ & $\dn$ & $Empty$\\+ −
$\textit{mkeps}(r_1 + r_2)$ & $\dn$ & if $\nullable(r_1)$ \\+ −
& & then $\Left(\textit{mkeps}(r_1))$\\+ −
& & else $Right(\textit{mkeps}(r_2))$\\+ −
$\textit{mkeps}(r_1 \cdot r_2)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\+ −
$\textit{mkeps}(r^*)$ & $\dn$ & $Stars\,[]$ \\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent There are no cases for $\ZERO$ and $c$, since+ −
these regular expression cannot match the empty string. Note+ −
also that in case of alternatives we give preference to the+ −
regular expression on the left-hand side. This will become+ −
important later on.+ −
+ −
The second phase of the algorithm is organised so that it will+ −
calculate a value for how the derivative regular expression+ −
has matched a string. For this we need a function that+ −
reverses this ``chopping off'' for values which we did in the+ −
first phase for derivatives. The corresponding function is+ −
called $\inj$ (short for injection). This function takes three+ −
arguments: the first one is a regular expression for which we+ −
want to calculate the value, the second is the character we+ −
want to inject and the third argument is the value where we+ −
will inject the character into. The result of this function is a+ −
new value. The definition of $\inj$ is as follows: + −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}+ −
$\inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\+ −
$\inj\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\inj\,r_1\,c\,v)$\\+ −
$\inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\inj\,r_2\,c\,v)$\\+ −
$\inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\+ −
$\inj\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\+ −
$\inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\inj\,r_2\,c\,v)$\\+ −
$\inj\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\inj\,r\,c\,v\,::\,vs)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent This definition is by recursion on the regular+ −
expression and by analysing the shape of the values. Therefore+ −
there are three cases for sequence regular+ −
expressions (for all possible shapes of the value we can encounter). The last+ −
clause for the star regular expression returns a list where+ −
the first element is $\inj\,r\,c\,v$ and the other elements are+ −
$vs$. That means $\_\,::\,\_$ should be read as list cons.+ −
+ −
To understand what is going on, it might be best to do some+ −
example calculations and compare them with Figure~\ref{Sulz}.+ −
For this note that we have not yet dealt with the need of+ −
simplifying regular expressions (this will be a topic on its+ −
own later). Suppose the regular expression is $a \cdot (b+ −
\cdot c)$ and the input string is $abc$. The derivatives from+ −
the first phase are as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{ll}+ −
$r_1$: & $a \cdot (b \cdot c)$\\+ −
$r_2$: & $\ONE \cdot (b \cdot c)$\\+ −
$r_3$: & $(\ZERO \cdot (b \cdot c)) + (\ONE \cdot c)$\\+ −
$r_4$: & $(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent According to the simple algorithm, we would test+ −
whether $r_4$ is nullable, which in this case it indeed is.+ −
This means we can use the function $\textit{mkeps}$ to calculate a+ −
value for how $r_4$ was able to match the empty string.+ −
Remember that this function gives preference for alternatives+ −
on the left-hand side. However there is only $\ONE$ on the+ −
very right-hand side of $r_4$ (underlined)+ −
+ −
\begin{center}+ −
$r_4$:\;$(\ZERO \cdot (b \cdot c)) + + −
((\ZERO \cdot c) + \underline{\ONE})$\\+ −
\end{center}+ −
+ −
\noindent+ −
that matches the empty string.+ −
Therefore $\textit{mkeps}$ returns the value+ −
+ −
\begin{center}+ −
$v_4:\;Right(Right(Empty))$+ −
\end{center}+ −
+ −
\noindent If there had been a $\ONE$ on the left, then+ −
$\textit{mkeps}$ would have returned something of the form+ −
$\Left(\ldots)$. The point is that from this value we can+ −
directly read off which part of $r_4$ matched the empty+ −
string: take the right-alternative first, and then the+ −
right-alternative again, then you got to the $\ONE$.+ −
+ −
Next we have to ``inject'' the last character, that is $c$ in+ −
the running example, into this value $v_4$ in order to+ −
calculate how $r_3$ could have matched the string $c$.+ −
For this we call injection with $\inj(r_3, c, v_4)$.+ −
According to the definition of $\inj$ we obtain+ −
+ −
\begin{center}+ −
$v_3:\;Right(Seq(Empty, Char(c)))$+ −
\end{center}+ −
+ −
\noindent This is the correct result, because $r_3$ needs+ −
to use the right-hand alternative, and then $\ONE$ needs+ −
to match the empty string and $c$ needs to match $c$.+ −
Next we need to inject back the letter $b$ into $v_3$.+ −
For this we call injection with $\inj(r_2, b, v_3)$.+ −
This gives+ −
+ −
\begin{center}+ −
$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$+ −
\end{center}+ −
+ −
\noindent which is again the correct result for how $r_2$+ −
matched the string $bc$. Finally we need to inject back the+ −
letter $a$ into $v_2$ giving the final result.+ −
For this we call injection with $\inj(r_1, a, v_2)$+ −
and obtain+ −
+ −
\begin{center}+ −
$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$+ −
\end{center}+ −
+ −
\noindent This value corresponds to how the regular expression $r_1$,+ −
namely $a \cdot (b \cdot c)$, matched the string $abc$.+ −
+ −
+ −
There are a few auxiliary functions that are of interest+ −
when analysing this algorithm. One is called \emph{flatten},+ −
written $|\_|$, which extracts the string ``underlying'' a + −
value. It is defined recursively as+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$|Empty|$ & $\dn$ & $[]$\\+ −
$|Char(c)|$ & $\dn$ & $[c]$\\+ −
$|\Left(v)|$ & $\dn$ & $|v|$\\+ −
$|Right(v)|$ & $\dn$ & $|v|$\\+ −
$|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\+ −
$|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Using flatten we can see what are the strings behind + −
the values calculated by $\textit{mkeps}$ and $\inj$. In our running + −
example:+ −
+ −
\begin{center}+ −
\begin{tabular}{ll}+ −
$|v_4|$: & $[]$\\+ −
$|v_3|$: & $c$\\+ −
$|v_2|$: & $bc$\\+ −
$|v_1|$: & $abc$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent This indicates that $\inj$ indeed is injecting, or+ −
adding, back a character into the value. + −
+ −
The definition of $\inj$ might still be very puzzling and each clause+ −
might look arbitrary, but there is in fact a relatively simple idea+ −
behind it. Ultimately the $\inj$-functions is determined by the+ −
derivative functions. For this consider one of the ``squares'' from+ −
Figure~\ref{Sulz}:+ −
+ −
\begin{center}+ −
\begin{tikzpicture}[scale=2,node distance=1.2cm,+ −
every node/.style={minimum size=7mm}]+ −
\node (r) {$r$};+ −
\node (rd) [right=of r]{$r_{der}$};+ −
\draw[->,line width=1mm](r)--(rd) node[above,midway] {$\der\,c$};+ −
\node (vd) [below=of r2]{$v_{der}$};+ −
\draw[->,line width=1mm](rd) -- (vd);+ −
\node (v) [left=of vd] {$v$};+ −
\draw[->,line width=1mm](vd)--(v) node[below,midway] {$\inj\,c$};+ −
\draw[->,line width=0.5mm,dotted](r) -- (v) node[left,midway,red] {\bf ?};+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent+ −
The input to the $\inj$-function is $r$ (on the top left), $c$ (the+ −
character to be injected) and $v_{der}$ (bottom right). The output is+ −
the value $v$ for how the regular expression $r$ matched the+ −
corresponding string. How does $\inj$ calculate this value? Well, it has+ −
to analyse the value $v_{der}$ and transform it into the value $v$ for+ −
the regular expression $r$. The value $v_{der}$ corresponds to the+ −
$r_{der}$-regular expression which is the derivative of $r$. Remember+ −
that $v_{der}$ is the value for how $r_{der}$ matches the corresponding string+ −
where $c$ has been chopped off.+ −
+ −
For a concrete example, let $r$ be $r_1 + r_2$. Then $r_{der}$+ −
is of the form $(\der\,c\,r_1) + (\der\,c\,r_2)$. What are the possible+ −
values corresponding to $r_{der}$? Well, they can be only of the form+ −
$\Left(\ldots)$ and $\Right(\ldots)$. Therefore you have two+ −
cases in the $\inj$ function -- one for $\Left$ and one for $\Right$.+ −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}+ −
$\inj\,(r_1 + r_2)\,c\,\,\Left(v)$ & $\dn$ & $\ldots$\\+ −
$\inj\,(r_1 + r_2)\,c\,\,\Right(v)$ & $\dn$ & $\ldots$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
Let's look next at the sequence case where $r = r_1 \cdot r_2$. What does the+ −
derivative of $r$ look like? According to the definition it is: + −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}+ −
$\der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $\nullable (r_1)$\\+ −
& & then $(\der\,c\,r_1) \cdot r_2 + \der\, c\, r_2$\\ + −
& & else $(\der\, c\, r_1) \cdot r_2$\\+ −
\end{tabular} + −
\end{center} + −
+ −
\noindent As you can see there is a derivative in the if-branch and another+ −
in the else-branch. In the if-branch we have an alternative of the form + −
$\_ + \_$. We already know what the possible values are for such a regular + −
expression, namely $\Left$ or $\Right$. Therefore we have in $\inj$ the+ −
two cases:+ −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}+ −
$\inj\,(r_1 \cdot r_2)\,c\,\,\Left(Seq(v_1,v_2))$ & $\dn$ & $\ldots$\\+ −
$\inj\,(r_1 \cdot r_2)\,c\,\,Right(v)$ & $\dn$ & $\ldots$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
In the first case we even know that it is not just a $\Left$-value, but + −
$\Left(Seq(\ldots))$, because the corresponding regular expression + −
in the derivation is \mbox{$(\der\,c\,r_1) \cdot r_2$}. Hence we know + −
it must be a $Seq$-value enclosed inside $\Left$. + −
The third clause for $r_1\cdot r_2$ in the $\inj$-function+ −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}+ −
$\inj\,(r_1 \cdot r_2)\,c\,\,Seq(v_1,v_2)$ & $\dn$ & $\ldots$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent corresponds to the else-branch in the derivative. In this+ −
case we know the derivative is of the form $(\der\,c\,r_1) \cdot r_2$ and+ −
therefore the value must be of the form $Seq(\ldots)$.+ −
+ −
Hopefully the $inj$-function makes now more sense. I let you explain+ −
the $r^*$ case. What do the derivative of $r^*$ and+ −
the corresponding value look like? Does this explain the shape of + −
the clause?+ −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}+ −
$\inj\,(r^*)\,c\,\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\inj\,r\,c\,v\,::\,vs)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent If yes, you made sense of the left-hand sides of+ −
the $\inj$-definition. + −
+ −
How about the right-hand sides? Well, in the+ −
$r^*$ case for example we have according to the square in the picture + −
above a value $v_{der}$ which says how the derivative $r_{der}$+ −
matched the string. Since the derivative is of the form+ −
$(\der\,c\,r) \cdot (r^*)$ the corresponding value is of the+ −
form $Seq(v, Stars\,vs)$. But for $r^*$ we are looking for a value+ −
for the original (top left) regular expression --- so it cannot+ −
be a value of the shape $Seq(\ldots, Stars\ldots)$ because that is the+ −
shape for sequence regular expressions. What we need is a value+ −
of the form $Stars \ldots$ (remember the correspondence between + −
values and regular expressions). This suggests to define the right hand + −
side as+ −
+ −
\begin{center}+ −
$\ldots \dn Stars(\inj\,r\,c\,v\,::\,vs)$+ −
\end{center} + −
+ −
\noindent It is a value of the right shape, namely $Stars$. It injected + −
$c$ into the first-value, which is in fact the value where we need to + −
undo the derivative. Remember again the shape of the derivative of $r^*$.+ −
In place where we chopped off the $c$, we now need to do the $\inj$ of $c$.+ −
Therefore $\inj\,r\,c\,v$ in the definition above. That is the same with+ −
the other clauses in $\inj$. + −
+ −
+ −
Phew\ldots{}Sweat\ldots!\#@$\skull$\%\ldots Unfortunately, there is+ −
a gigantic problem with the described algorithm so far: it is very+ −
slow. To make it faster, we have to include in all this the simplification + −
from Lecture 2\ldots{}and what rotten luck: simplification messes things + −
up and we need to rectify the mess. This is what we shall do next.+ −
+ −
+ −
\subsubsection*{Simplification}+ −
+ −
Generally the matching algorithms based on derivatives do+ −
poorly unless the regular expressions are simplified after+ −
each derivative step. But this is a bit more involved in the+ −
algorithm of Sulzmann \& Lu. So what follows might require you+ −
to read several times before it makes sense and also might+ −
require that you do some example calculations yourself. As a+ −
first example consider the last derivation step in our earlier+ −
example:+ −
+ −
\begin{equation}\label{regexfour}+ −
r_4 = \der\,c\,r_3 = + −
(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)+ −
\end{equation}+ −
+ −
\noindent Simplifying this regular expression would just give+ −
us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as+ −
input, however, would give us with the value $Empty$ instead of+ −
+ −
\[Right(Right(Empty))\]+ −
+ −
\noindent+ −
that was obtained without the simplification. The problem is we need+ −
to recreate this more complicated value, rather than just return+ −
$Empty$. This will require what I call \emph{rectification functions}.+ −
They need to be calculated whenever a regular expression gets+ −
simplified.+ −
+ −
Rectification functions take a value as argument and return a+ −
(rectified) value. In the example above the argument would be $Empty$+ −
and the output $Right(Right(Empty))$. Before we define these+ −
rectification functions in general, let us first take a look again at+ −
our simplification rules:+ −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}+ −
$r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ + −
$\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ + −
$r \cdot \ONE$ & $\mapsto$ & $r$\\ + −
$\ONE \cdot r$ & $\mapsto$ & $r$\\ + −
$r + \ZERO$ & $\mapsto$ & $r$\\ + −
$\ZERO + r$ & $\mapsto$ & $r$\\ + −
$r + r$ & $\mapsto$ & $r$\\ + −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Applying them to $r_4$ in \eqref{regexfour} will require several nested+ −
simplifications in order end up with just $\ONE$. However,+ −
it is possible to apply them in a depth-first, or inside-out,+ −
manner in order to calculate this simplified regular+ −
expression.+ −
+ −
The rectification we can implement by letting simp return not+ −
just a (simplified) regular expression, but also a+ −
rectification function. Let us consider the alternative case,+ −
$r_1 + r_2$, first. By going depth-first, we first simplify+ −
the component regular expressions $r_1$ and $r_2.$ This will+ −
return simplified versions, say $r_{1s}$ and $r_{2s}$, of the+ −
component regular expressions (if they can be simplified) but+ −
also two rectification functions $f_{1s}$ and $f_{2s}$. We+ −
need to assemble them in order to obtain a rectified value for+ −
$r_1 + r_2$. In case $r_{1s}$ simplified to $\ZERO$, we+ −
continue the derivative calculation with $r_{2s}$. The+ −
Sulzmann \& Lu algorithm would return a corresponding value,+ −
say $v_{2s}$. But now this value needs to be ``rectified'' to+ −
the value + −
+ −
\begin{center}+ −
$Right(v_{2s})$+ −
\end{center}+ −
+ −
\noindent The reason is that we look for the value that tells+ −
us how $r_1 + r_2$ could have matched the string, not just+ −
$r_2$ or $r_{2s}$. Unfortunately, this is still not the right+ −
value in general because there might be some simplifications+ −
that happened inside $r_2$ and for which the simplification+ −
function returned also a rectification function $f_{2s}$. So in+ −
fact we need to apply this one too which gives+ −
+ −
\begin{center}+ −
$Right(f_{2s}(v_{2s}))$+ −
\end{center}+ −
+ −
\noindent This is now the correct, or rectified, value. Since+ −
the simplification will be done in the first phase of the+ −
algorithm, but the rectification needs to be done to the+ −
values in the second phase, it is advantageous to calculate+ −
the rectification as a function, remember this function and+ −
then apply the value to this function during the second phase.+ −
So if we want to implement the rectification as function, we + −
would need to return+ −
+ −
\begin{center}+ −
$\lambda v.\,Right(f_{2s}(v))$+ −
\end{center}+ −
+ −
\noindent which is the lambda-calculus notation for+ −
a function that expects a value $v$ and returns everything+ −
after the dot where $v$ is replaced by whatever value is + −
given.+ −
+ −
Let us package this idea with rectification functions+ −
into a single function (still only considering the alternative+ −
case):+ −
+ −
\begin{center}+ −
\begin{tabular}{ll}+ −
$_1$ & $simp(r)$:\\+ −
$_2$ & \quad case $r = r_1 + r_2$\\+ −
$_3$ & \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\+ −
$_4$ & \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\+ −
$_5$ & \qquad case $r_{1s} = \ZERO$: + −
return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\+ −
$_6$ & \qquad case $r_{2s} = \ZERO$: + −
return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\+ −
$_7$ & \qquad case $r_{1s} = r_{2s}$:+ −
return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\+ −
$_8$ & \qquad otherwise: + −
return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent We first recursively call the simplification with+ −
$r_1$ and $r_2$ (Lines 3 and 4). This gives simplified regular expressions,+ −
$r_{1s}$ and $r_{2s}$, as well as two rectification functions+ −
$f_{1s}$ and $f_{2s}$. We next need to test whether the+ −
simplified regular expressions are $\ZERO$ so as to make+ −
further simplifications. In case $r_{1s}$ is $\ZERO$ (Line 5),+ −
then we can return $r_{2s}$ (the other alternative). However+ −
for this we need to build a corresponding rectification + −
function, which as said above is+ −
+ −
\begin{center}+ −
$\lambda v.\,Right(f_{2s}(v))$+ −
\end{center}+ −
+ −
\noindent The case where $r_{2s} = \ZERO$ is similar:+ −
We return $r_{1s}$ and rectify with $\Left(\_)$ and the+ −
other calculated rectification function $f_{1s}$. This gives+ −
+ −
\begin{center}+ −
$\lambda v.\,\Left(f_{1s}(v))$+ −
\end{center}+ −
+ −
\noindent The next case where $r_{1s} = r_{2s}$ can be treated+ −
like the one where $r_{2s} = \ZERO$. We return $r_{1s}$+ −
and rectify with $\Left(\_)$ and so on.+ −
+ −
+ −
The otherwise-case is slightly more complicated. In this case+ −
neither $r_{1s}$ nor $r_{2s}$ are $\ZERO$ and also+ −
$r_{1s} \not= r_{2s}$, which means no further simplification+ −
can be applied. Accordingly, we return $r_{1s} + r_{2s}$ as+ −
the simplified regular expression. In principle we also do not+ −
have to do any rectification, because no simplification was+ −
done in this case. But this is actually not true: There might+ −
have been simplifications inside $r_{1s}$ and $r_{2s}$. We+ −
therefore need to take into account the calculated+ −
rectification functions $f_{1s}$ and $f_{2s}$. We can do this+ −
by defining a rectification function $f_{alt}$ which takes two+ −
rectification functions as arguments and applies them+ −
according to whether the value is of the form $\Left(\_)$ or+ −
$Right(\_)$:+ −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{1mm}}l}+ −
$f_{alt}(f_1, f_2) \dn$\\+ −
\qquad $\lambda v.\,$ case $v = \Left(v')$: + −
& return $\Left(f_1(v'))$\\+ −
\qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: + −
& return $Right(f_2(v'))$\\ + −
\end{tabular}+ −
\end{center}+ −
+ −
The other interesting case with simplification is the sequence+ −
case. In this case the main simplification function is as+ −
follows+ −
+ −
\begin{center}+ −
\begin{tabular}{l}+ −
$simp(r)$:\hspace{5cm} (continued)\\+ −
\quad case $r = r_1 \cdot r_2$\\+ −
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\+ −
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\+ −
\qquad case $r_{1s} = \ZERO$: + −
return $(\ZERO, f_{error})$\\+ −
\qquad case $r_{2s} = \ZERO$: + −
return $(\ZERO, f_{error})$\\+ −
\qquad case $r_{1s} = \ONE$: + −
return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\+ −
\qquad case $r_{2s} = \ONE$: + −
return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\+ −
\qquad otherwise: + −
return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent whereby in the last line $f_{seq}$ is again pushing+ −
the two rectification functions into the two components of the+ −
Seq-value:+ −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{1mm}}l}+ −
$f_{seq}(f_1, f_2) \dn$\\+ −
\qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: + −
& return $Seq(f_1(v_1), f_2(v_2))$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Note that in the case of $r_{1s} = \ZERO$+ −
(similarly $r_{2s}$) we use the function $f_{error}$ for+ −
rectification. If you think carefully, then you will realise+ −
that this function will actually never been called. This is+ −
because a sequence with $\ZERO$ will never recognise any+ −
string and therefore the second phase of the algorithm would+ −
never been called. The simplification function still expects+ −
us to give a function. So in my own implementation I just+ −
returned a function that raises an error. In the case+ −
where $r_{1s} = \ONE$ (similarly $r_{2s}$) we have+ −
to create a sequence where the first component is a rectified+ −
version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.+ −
+ −
Since we only simplify regular expressions of the form $r_1 ++ −
r_2$ and $r_1 \cdot r_2$ we do not have to do anything else+ −
in the remaining cases. The rectification function will+ −
be just the identity, which in lambda-calculus terms is+ −
just+ −
+ −
\begin{center}+ −
$\lambda v.\,v$+ −
\end{center}+ −
+ −
\noindent This completes the high-level version of the simplification+ −
function, which is shown again in Figure~\ref{simp}. The Scala code+ −
for the simplification function is in Figure~\ref{simprect}.+ −
+ −
\begin{figure}[t]+ −
\begin{center}+ −
\begin{tabular}{l}+ −
$simp(r)$:\\+ −
\quad case $r = r_1 + r_2$\\+ −
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\+ −
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\+ −
\qquad case $r_{1s} = \ZERO$: + −
return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\+ −
\qquad case $r_{2s} = \ZERO$: + −
return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\+ −
\qquad case $r_{1s} = r_{2s}$:+ −
return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\+ −
\qquad otherwise: + −
return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$+ −
\medskip\\+ −
+ −
\quad case $r = r_1 \cdot r_2$\\+ −
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\+ −
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\+ −
\qquad case $r_{1s} = \ZERO$: + −
return $(\ZERO, f_{error})$\\+ −
\qquad case $r_{2s} = \ZERO$: + −
return $(\ZERO, f_{error})$\\+ −
\qquad case $r_{1s} = \ONE$: + −
return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\+ −
\qquad case $r_{2s} = \ONE$: + −
return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\+ −
\qquad otherwise: + −
return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$+ −
\medskip\\+ −
+ −
\quad otherwise:\\+ −
\qquad return $(r, \lambda v.\,v)$\\+ −
\end{tabular}+ −
\end{center}+ −
\caption{The simplification function that returns a simplified + −
regular expression and a rectification function.\label{simp}}+ −
\end{figure}+ −
+ −
\begin{figure}[p]+ −
\small + −
\lstinputlisting[numbers=left,linebackgroundcolor=+ −
{\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app61.scala}+ −
+ −
\caption{The Scala code for the simplification function. The+ −
first part defines some auxiliary functions for the rectification.+ −
The second part give the simplification function.+ −
\label{simprect}}+ −
\end{figure}+ −
+ −
+ −
We are now in the position to define a \emph{lexing function} as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$lex\,r\,[]$ & $\dn$ & if $\nullable(r)$ then $\textit{mkeps}(r)$\\+ −
& & else $error$\\+ −
$lex\,r\,c\!::\!s$ & $\dn$ & let + −
$(r_{simp}, f_{rect}) = simp(\der(c, r))$\\+ −
& & $\inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$ + −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent This corresponds to the $matches$ function we have+ −
seen in earlier lectures. In the first clause we are given an+ −
empty string, $[]$, and need to test whether the regular+ −
expression is $nullable$. If yes, we can proceed normally and+ −
just return the value calculated by $\textit{mkeps}$. The second clause+ −
is for strings where the first character is $c$, say, and the+ −
rest of the string is $s$. We first build the derivative of+ −
$r$ with respect to $c$; simplify the resulting regular+ −
expression. We continue lexing with the simplified regular+ −
expression and the string $s$. Whatever will be returned as+ −
value, we sill need to rectify using the $f_{rect}$ from the+ −
simplification and finally inject $c$ back into the+ −
(rectified) value.+ −
+ −
+ −
\subsubsection*{Records}+ −
+ −
Remember we wanted to tokenize input strings, that means+ −
splitting strings into their ``word'' components. Furthermore+ −
we want to classify each token as being a keyword or+ −
identifier and so on. For this one more feature will be+ −
required, which I call a \emph{record} regular expression.+ −
While values encode how a regular expression matches a string,+ −
records can be used to ``focus'' on some particular parts of+ −
the regular expression and ``forget'' about others. + −
+ −
Let us look at an example. Suppose you have the regular+ −
expression $a\cdot b + a\cdot c$. Clearly this regular expression can only+ −
recognise two strings. But suppose you are not interested+ −
whether it can recognise $ab$ or $ac$, but rather if it+ −
matched, then what was the last character of the matched+ −
string\ldots either $b$ or $c$. You can do this by annotating+ −
the regular expression with a record, written in general+ −
$(x:r)$, where $x$ is just an identifier (in my implementation+ −
a plain string) and $r$ is a regular expression. A record will+ −
be regarded as a regular expression. The extended definition+ −
in Scala therefore looks as follows:+ −
+ −
{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=+ −
{\ifodd\value{lstnumber}\color{capri!3}\fi}]+ −
{../progs/app03.scala}}+ −
+ −
\noindent Since we regard records as regular expressions we+ −
need to extend the functions $\nullable$ and $\der$. Similarly+ −
$\textit{mkeps}$ and $\inj$ need to be extended. This means we also need+ −
to extend the definition of values, which in Scala looks as+ −
follows:+ −
+ −
{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=+ −
{\ifodd\value{lstnumber}\color{capri!3}\fi}]+ −
{../progs/app04.scala}}+ −
+ −
\noindent Let us now look at the purpose of records more+ −
closely and let us return to our question whether the string+ −
terminated in a $b$ or $c$. We can do this as follows: we+ −
annotate the regular expression $ab + ac$ with a record+ −
as follows+ −
+ −
\begin{center}+ −
$a\cdot (x:b) + a\cdot (x:c)$+ −
\end{center}+ −
+ −
\noindent This regular expression can still only recognise+ −
the strings $ab$ and $ac$, but we can now use a function+ −
that takes a value and returns all records. I call this+ −
function \emph{env} for environment\ldots it builds a list+ −
of identifiers associated with a string. This function+ −
can be defined as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$env(Empty)$ & $\dn$ & $[]$\\+ −
$env(Char(c))$ & $\dn$ & $[]$\\+ −
$env(\Left(v))$ & $\dn$ & $env(v)$\\+ −
$env(Right(v))$ & $\dn$ & $env(v)$\\+ −
$env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\+ −
$env(Stars\,[v_1,\ldots ,v_n])$ & $\dn$ & + −
$env(v_1) \,@\ldots @\, env(v_n)$\\+ −
$env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent where in the last clause we use the flatten function+ −
defined earlier. As can be seen, the function $env$ ``picks''+ −
out all underlying strings where a record is given. Since+ −
there can be more than one, the environment will potentially+ −
contain many ``records''. If we now postprocess the value+ −
calculated by $lex$ extracting all records using $env$, we can+ −
answer the question whether the last element in the string was+ −
an $b$ or a $c$. Lets see this in action: if we use $a\cdot b+ −
+ a\cdot c$ and $ac$ the calculated value will be+ −
+ −
\begin{center}+ −
$Right(Seq(Char(a), Char(c)))$+ −
\end{center}+ −
+ −
\noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and+ −
use the $env$ function to extract the recording for + −
$x$ we obtain+ −
+ −
\begin{center}+ −
$[(x:c)]$+ −
\end{center}+ −
+ −
\noindent If we had given the string $ab$ instead, then the+ −
record would have been $[(x:b)]$. The fun starts if we + −
iterate this. Consider the regular expression + −
+ −
\begin{center}+ −
$(a\cdot (x:b) + a\cdot (y:c))^*$+ −
\end{center}+ −
+ −
\noindent and the string $ababacabacab$. This string is + −
clearly matched by the regular expression, but we are only+ −
interested in the sequence of $b$s and $c$s. Using $env$+ −
we obtain+ −
+ −
\begin{center}+ −
$[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$+ −
\end{center}+ −
+ −
\noindent While this feature might look silly, it is in fact+ −
quite useful. For example if we want to match the name of+ −
an email we might use the regular expression+ −
+ −
\[+ −
(name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot + −
(domain: [a\mbox{-}z0\mbox{-}9\,-]^+)\cdot .\cdot + −
(top\_level: [a\mbox{-}z\,.]^{\{2,6\}})+ −
\]+ −
+ −
\noindent Then if we match the email address+ −
+ −
\[+ −
\texttt{christian.urban@kcl.ac.uk}+ −
\]+ −
+ −
\noindent we can use the $env$ function and find out+ −
what the name, domain and top-level part of the email+ −
address are:+ −
+ −
\begin{center}+ −
$[(name:\texttt{christian.urban}), + −
(domain:\texttt{kcl}), + −
(top\_level:\texttt{ac.uk})]$+ −
\end{center}+ −
+ −
Recall that we want to lex a little programming language, called the+ −
\emph{While}-language. A simple program in this language is shown in+ −
Figure~\ref{while}. The main keywords in this language are+ −
\pcode{while}, \pcode{if}, \pcode{then} and+ −
\pcode{else}.\footnote{Contrast this with the COBOL programming+ −
language, which was developed around 1960 and thought to be dead for+ −
many decades---even your friendly lecturer is not old enough to have+ −
been taught this language. Anyway, this language had over 600+ −
keywords (or what they called \emph{reserved words}). Interestingly+ −
though this language is still used in 2020: during the height of+ −
Corona crisis the State of New Jewrsey in the US was looking for+ −
COBOL programers who could fix the state's national insurance+ −
webpage. You were probably paid in gold and diamonds, if you were+ −
able to program in COBOL. If you fixed their webpage, surely you+ −
were allowed to marry the governer's son/daughter. } As usual we+ −
have syntactic categories for identifiers, operators, numbers and so+ −
on. For this we would need to design the corresponding regular+ −
expressions to recognise these syntactic categories. I let you do this+ −
design task. Having these regular expressions at our disposal, we can+ −
form the regular expression+ −
+ −
\begin{figure}[t]+ −
\begin{center}+ −
\mbox{\lstinputlisting[language=while]{../progs/while-tests/fib.while}}+ −
\end{center}+ −
\caption{The Fibonacci program in the While-language.\label{while}}+ −
\end{figure}+ −
+ −
\begin{center}+ −
$\textit{WhileRegs} \;\dn\; + −
\left(\begin{array}{l}+ −
(k, KeyWords)\; +\\+ −
(i, Ids)\;+\\+ −
(o, Ops)\;+ \\+ −
(n, Nums)\;+ \\+ −
(s, Semis)\;+ \\+ −
(p, (LParens + RParens))\;+\\ + −
(b, (Begin + End))\;+ \\+ −
(w, WhiteSpaces)+ −
\end{array}\right)^{\mbox{\LARGE{}*}}$+ −
\end{center}+ −
+ −
\noindent and ask the algorithm by Sulzmann \& Lu to lex, say+ −
the following string+ −
+ −
\begin{center}\large+ −
\code{"if true then then 42 else +"}+ −
\end{center}+ −
+ −
\noindent+ −
By using the records and extracting the environment, the + −
result is the following list:+ −
+ −
\begin{center}\tt+ −
\begin{tabular}{l}+ −
(k, if),\\ + −
(w, " "),\\ + −
(i, true),\\ + −
(w, " "),\\ + −
(k, then),\\ + −
(w, " "),\\ + −
(k, then),\\ + −
(w, " "),\\ + −
(n, 42),\\ + −
(w, " "),\\ + −
(k, else),\\ + −
(w, " "),\\ + −
(o, +)+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Typically we are not interested in the whitespaces + −
and comments and would filter them out: this gives+ −
+ −
\begin{center}\tt+ −
\begin{tabular}{l}+ −
(k, if),\\ + −
(i, true),\\ + −
(k, then),\\ + −
(k, then),\\ + −
(n, 42),\\ + −
(k, else),\\ + −
(o, +)+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
which will be the input for the next phase of our compiler.+ −
+ −
\end{document}+ −
+ −
%%% Local Variables: + −
%%% mode: latex+ −
%%% TeX-master: t+ −
%%% End: + −