\documentclass{article}+ −
\usepackage{../style}+ −
\usepackage{../langs}+ −
\usepackage{../graphics}+ −
+ −
+ −
\begin{document}+ −
\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}+ −
+ −
\section*{Handout 4 (Sulzmann \& Lu Algorithm)}+ −
+ −
So far our algorithm based on derivatives was only able to say+ −
yes or no depending on whether a string was matched by a regular+ −
expression or not. Often a more interesting question is to+ −
find out \emph{how} a regular expression matched a string?+ −
Answering this question will also help us with the problem we+ −
are after, namely tokenising an input string. + −
+ −
The algorithm we will be looking at in this lecture was designed by Sulzmann+ −
\& Lu in a rather recent research paper (from 2014). A link to it is+ −
provided on KEATS, in case you are interested.\footnote{In my+ −
humble opinion this is an interesting instance of the research+ −
literature: it contains a very neat idea, but its presentation+ −
is rather sloppy. In earlier versions of this paper, a King's+ −
undergraduate student and I found several rather annoying typos in the+ −
examples and definitions.} My PhD student Fahad Ausaf and I even more recently + −
wrote a paper where we build on their result: we provided a + −
mathematical proof that their algorithm is really correct---the proof + −
Sulzmann \& Lu had originally given contained major flaws. Such correctness+ −
proofs are important: Kuklewicz maintains a unit-test library+ −
for the kind of algorithma we are interested in here and he showed + −
that many implementations in the ``wild'' are buggy, that is not+ −
satisfy his unit tests:+ −
+ −
\begin{center}+ −
\url{http://www.haskell.org/haskellwiki/Regex_Posix}+ −
\end{center}+ −
+ −
+ −
In order to give an answer for+ −
\emph{how} a regular expression matches a string, Sulzmann and+ −
Lu use \emph{values}. A value will be the output of the+ −
algorithm whenever the regular expression matches the string.+ −
If the string does not match the string, an error will be+ −
raised. + −
The definition for values is given below. They are shown + −
together with the regular expressions $r$ to which+ −
they correspond:+ −
+ −
\begin{center}+ −
\begin{tabular}{cc}+ −
\begin{tabular}{@{}rrl@{}}+ −
\multicolumn{3}{c}{regular expressions}\medskip\\+ −
$r$ & $::=$ & $\ZERO$\\+ −
& $\mid$ & $\ONE$ \\+ −
& $\mid$ & $c$ \\+ −
& $\mid$ & $r_1 \cdot r_2$\\+ −
& $\mid$ & $r_1 + r_2$ \\+ −
\\+ −
& $\mid$ & $r^*$ \\+ −
\end{tabular}+ −
&+ −
\begin{tabular}{@{\hspace{0mm}}rrl@{}}+ −
\multicolumn{3}{c}{values}\medskip\\+ −
$v$ & $::=$ & \\+ −
& & $Empty$ \\+ −
& $\mid$ & $Char(c)$ \\+ −
& $\mid$ & $Seq(v_1,v_2)$\\+ −
& $\mid$ & $\Left(v)$ \\+ −
& $\mid$ & $Right(v)$ \\+ −
& $\mid$ & $Stars\,[v_1,\ldots\,v_n]$ \\+ −
\end{tabular}+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent There is no value for the+ −
$\ZERO$ regular expression, since it does not match any+ −
string. Otherwise there is exactly one value corresponding to+ −
each regular expression with the exception of $r_1 + r_2$+ −
where there are two values, namely $\Left(v)$ and $Right(v)$+ −
corresponding to the two alternatives. Note that $r^*$ is+ −
associated with a list of values, one for each copy of $r$+ −
that was needed to match the string. This means we might also+ −
return the empty list $Stars []$, if no copy was needed in case+ −
of $r^*$. For sequence, there is exactly one value, composed + −
of two component values ($v_1$ and $v_2$).+ −
+ −
My implementation of regular expressions and values in Scala is+ −
shown below. I have in my implementation the convention that+ −
regular expressions are written entirely with upper-case+ −
letters, while values just start with a single upper-case+ −
character and the rest are lower-case letters.+ −
+ −
{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=+ −
{\ifodd\value{lstnumber}\color{capri!3}\fi}]+ −
{../progs/app01.scala}}+ −
+ −
+ −
{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=+ −
{\ifodd\value{lstnumber}\color{capri!3}\fi}]+ −
{../progs/app02.scala}}+ −
+ −
+ −
Graphically the algorithm by Sulzmann \& Lu can be illustrated+ −
by the picture in Figure~\ref{Sulz} where the path from the+ −
left to the right involving $\textit{der}/\textit{nullable}$ is the first phase+ −
of the algorithm and $\textit{mkeps}/\textit{inj}$, the path from right to left,+ −
the second phase. This picture shows the steps required when a+ −
regular expression, say $r_1$, matches the string $abc$. We+ −
first build the three derivatives (according to $a$, $b$ and+ −
$c$). We then use $\textit{nullable}$ to find out whether the resulting+ −
regular expression can match the empty string. If yes, we call+ −
the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular+ −
expression has matched the empty string. Its definition+ −
is as follows:+ −
+ −
\begin{figure}[t]+ −
\begin{center}+ −
\begin{tikzpicture}[scale=2,node distance=1.2cm,+ −
every node/.style={minimum size=7mm}]+ −
\node (r1) {$r_1$};+ −
\node (r2) [right=of r1]{$r_2$};+ −
\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\textit{der}\,a$};+ −
\node (r3) [right=of r2]{$r_3$};+ −
\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\textit{der}\,b$};+ −
\node (r4) [right=of r3]{$r_4$};+ −
\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\textit{der}\,c$};+ −
\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\textit{nullable}$}};+ −
\node (v4) [below=of r4]{$v_4$};+ −
\draw[->,line width=1mm](r4) -- (v4);+ −
\node (v3) [left=of v4] {$v_3$};+ −
\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\textit{inj}\,c$};+ −
\node (v2) [left=of v3]{$v_2$};+ −
\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\textit{inj}\,b$};+ −
\node (v1) [left=of v2] {$v_1$};+ −
\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\textit{inj}\,a$};+ −
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};+ −
\end{tikzpicture}+ −
\end{center}+ −
\caption{The two phases of the algorithm by Sulzmann \& Lu.+ −
\label{Sulz}}+ −
\end{figure}+ −
+ −
+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{mkeps}(\ONE)$ & $\dn$ & $Empty$\\+ −
$\textit{mkeps}(r_1 + r_2)$ & $\dn$ & if $\textit{nullable}(r_1)$ \\+ −
& & then $\Left(\textit{mkeps}(r_1))$\\+ −
& & else $Right(\textit{mkeps}(r_2))$\\+ −
$\textit{mkeps}(r_1 \cdot r_2)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\+ −
$\textit{mkeps}(r^*)$ & $\dn$ & $Stars\,[]$ \\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent There is are no cases for $\ZERO$ and $c$, since+ −
these regular expression cannot match the empty string. Note+ −
also that in case of alternatives we give preference to the+ −
regular expression on the left-hand side. This will become+ −
important later on.+ −
+ −
The second phase of the algorithm is organised so that it will+ −
calculate a value for how the derivative regular expression+ −
has matched a string. For this we need a function that+ −
reverses this ``chopping off'' for values which we did in the+ −
first phase for derivatives. The corresponding function is+ −
called $\textit{inj}$ (short for injection). This function takes three+ −
arguments: the first one is a regular expression for which we+ −
want to calculate the value, the second is the character we+ −
want to inject and the third argument is the value where we+ −
will inject the character into. The result of this function is a+ −
new value. The definition of $\textit{inj}$ is as follows: + −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}+ −
$\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\+ −
$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\+ −
$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\+ −
$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\+ −
$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\+ −
$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\+ −
$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\textit{inj}\,r\,c\,v\,::\,vs)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent This definition is by recursion on the regular+ −
expression and by analysing the shape of the values. Therefore+ −
there are three cases for sequence regular+ −
expressions (for all possible shapes of the value we can encounter). The last+ −
clause for the star regular expression returns a list where+ −
the first element is $\textit{inj}\,r\,c\,v$ and the other elements are+ −
$vs$. That means $\_\,::\,\_$ should be read as list cons.+ −
+ −
To understand what is going on, it might be best to do some+ −
example calculations and compare them with Figure~\ref{Sulz}.+ −
For this note that we have not yet dealt with the need of+ −
simplifying regular expressions (this will be a topic on its+ −
own later). Suppose the regular expression is $a \cdot (b+ −
\cdot c)$ and the input string is $abc$. The derivatives from+ −
the first phase are as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{ll}+ −
$r_1$: & $a \cdot (b \cdot c)$\\+ −
$r_2$: & $\ONE \cdot (b \cdot c)$\\+ −
$r_3$: & $(\ZERO \cdot (b \cdot c)) + (\ONE \cdot c)$\\+ −
$r_4$: & $(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent According to the simple algorithm, we would test+ −
whether $r_4$ is nullable, which in this case it indeed is.+ −
This means we can use the function $\textit{mkeps}$ to calculate a+ −
value for how $r_4$ was able to match the empty string.+ −
Remember that this function gives preference for alternatives+ −
on the left-hand side. However there is only $\ONE$ on the+ −
very right-hand side of $r_4$ (underlined)+ −
+ −
\begin{center}+ −
$r_4$:\;$(\ZERO \cdot (b \cdot c)) + + −
((\ZERO \cdot c) + \underline{\ONE})$\\+ −
\end{center}+ −
+ −
\noindent+ −
that matches the empty string.+ −
Therefore $\textit{mkeps}$ returns the value+ −
+ −
\begin{center}+ −
$v_4:\;Right(Right(Empty))$+ −
\end{center}+ −
+ −
\noindent If there had been a $\ONE$ on the left, then+ −
$\textit{mkeps}$ would have returned something of the form+ −
$\Left(\ldots)$. The point is that from this value we can+ −
directly read off which part of $r_4$ matched the empty+ −
string: take the right-alternative first, and then the+ −
right-alternative again. + −
+ −
Next we have to ``inject'' the last character, that is $c$ in+ −
the running example, into this value $v_4$ in order to+ −
calculate how $r_3$ could have matched the string $c$.+ −
According to the definition of $\textit{inj}$ we obtain+ −
+ −
\begin{center}+ −
$v_3:\;Right(Seq(Empty, Char(c)))$+ −
\end{center}+ −
+ −
\noindent This is the correct result, because $r_3$ needs+ −
to use the right-hand alternative, and then $\ONE$ needs+ −
to match the empty string and $c$ needs to match $c$.+ −
Next we need to inject back the letter $b$ into $v_3$. This+ −
gives+ −
+ −
\begin{center}+ −
$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$+ −
\end{center}+ −
+ −
\noindent which is again the correct result for how $r_2$+ −
matched the string $bc$. Finally we need to inject back the+ −
letter $a$ into $v_2$ giving the final result+ −
+ −
\begin{center}+ −
$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$+ −
\end{center}+ −
+ −
\noindent This now corresponds to how the regular+ −
expression $a \cdot (b \cdot c)$ matched the string $abc$.+ −
+ −
There are a few auxiliary functions that are of interest+ −
when analysing this algorithm. One is called \emph{flatten},+ −
written $|\_|$, which extracts the string ``underlying'' a + −
value. It is defined recursively as+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$|Empty|$ & $\dn$ & $[]$\\+ −
$|Char(c)|$ & $\dn$ & $[c]$\\+ −
$|\Left(v)|$ & $\dn$ & $|v|$\\+ −
$|Right(v)|$ & $\dn$ & $|v|$\\+ −
$|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\+ −
$|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Using flatten we can see what are the strings behind + −
the values calculated by $\textit{mkeps}$ and $\textit{inj}$. In our running + −
example:+ −
+ −
\begin{center}+ −
\begin{tabular}{ll}+ −
$|v_4|$: & $[]$\\+ −
$|v_3|$: & $c$\\+ −
$|v_2|$: & $bc$\\+ −
$|v_1|$: & $abc$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent This indicates that $\textit{inj}$ indeed is injecting, or+ −
adding, back a character into the value. + −
+ −
There is a problem, however, with the described algorithm+ −
so far: it is very slow. We need to include the simplification + −
from Lecture 2. This is what we shall do next.+ −
+ −
+ −
\subsubsection*{Simplification}+ −
+ −
Generally the matching algorithms based on derivatives do+ −
poorly unless the regular expressions are simplified after+ −
each derivative step. But this is a bit more involved in the+ −
algorithm of Sulzmann \& Lu. So what follows might require you+ −
to read several times before it makes sense and also might+ −
require that you do some example calculations yourself. As a+ −
first example consider the last derivation step in our earlier+ −
example:+ −
+ −
\begin{equation}\label{regexfour}+ −
r_4 = \textit{der}\,c\,r_3 = + −
(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)+ −
\end{equation}+ −
+ −
\noindent Simplifying this regular expression would just give+ −
us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as+ −
input, however, would give us with the value $Empty$ instead of+ −
+ −
\[Right(Right(Empty))\]+ −
+ −
\noindent+ −
that was obtained without the simplification. The problem is we need+ −
to recreate this more complicated value, rather than just return+ −
$Empty$. This will require what I call \emph{rectification functions}.+ −
They need to be calculated whenever a regular expression gets+ −
simplified.+ −
+ −
Rectification functions take a value as argument and return a+ −
(rectified) value. In the example above the argument would be $Empty$+ −
and the output $Right(Right(Empty))$. Before we define these+ −
rectification functions in general, let us first take a look again at+ −
our simplification rules:+ −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}+ −
$r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ + −
$\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ + −
$r \cdot \ONE$ & $\mapsto$ & $r$\\ + −
$\ONE \cdot r$ & $\mapsto$ & $r$\\ + −
$r + \ZERO$ & $\mapsto$ & $r$\\ + −
$\ZERO + r$ & $\mapsto$ & $r$\\ + −
$r + r$ & $\mapsto$ & $r$\\ + −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Applying them to $r_4$ in \eqref{regexfour} will require several nested+ −
simplifications in order end up with just $\ONE$. However,+ −
it is possible to apply them in a depth-first, or inside-out,+ −
manner in order to calculate this simplified regular+ −
expression.+ −
+ −
The rectification we can implement by letting simp return not+ −
just a (simplified) regular expression, but also a+ −
rectification function. Let us consider the alternative case,+ −
$r_1 + r_2$, first. By going depth-first, we first simplify+ −
the component regular expressions $r_1$ and $r_2.$ This will+ −
return simplified versions, say $r_{1s}$ and $r_{2s}$, of the+ −
component regular expressions (if they can be simplified) but+ −
also two rectification functions $f_{1s}$ and $f_{2s}$. We+ −
need to assemble them in order to obtain a rectified value for+ −
$r_1 + r_2$. In case $r_{1s}$ simplified to $\ZERO$, we+ −
continue the derivative calculation with $r_{2s}$. The+ −
Sulzmann \& Lu algorithm would return a corresponding value,+ −
say $v_{2s}$. But now this value needs to be ``rectified'' to+ −
the value + −
+ −
\begin{center}+ −
$Right(v_{2s})$+ −
\end{center}+ −
+ −
\noindent The reason is that we look for the value that tells+ −
us how $r_1 + r_2$ could have matched the string, not just+ −
$r_2$ or $r_{2s}$. Unfortunately, this is still not the right+ −
value in general because there might be some simplifications+ −
that happened inside $r_2$ and for which the simplification+ −
function retuned also a rectification function $f_{2s}$. So in+ −
fact we need to apply this one too which gives+ −
+ −
\begin{center}+ −
$Right(f_{2s}(v_{2s}))$+ −
\end{center}+ −
+ −
\noindent This is now the correct, or rectified, value. Since+ −
the simplification will be done in the first phase of the+ −
algorithm, but the rectification needs to be done to the+ −
values in the second phase, it is advantageous to calculate+ −
the rectification as a function, remember this function and+ −
then apply the value to this function during the second phase.+ −
So if we want to implement the rectification as function, we + −
would need to return+ −
+ −
\begin{center}+ −
$\lambda v.\,Right(f_{2s}(v))$+ −
\end{center}+ −
+ −
\noindent which is the lambda-calculus notation for+ −
a function that expects a value $v$ and returns everything+ −
after the dot where $v$ is replaced by whatever value is + −
given.+ −
+ −
Let us package this idea with rectification functions+ −
into a single function (still only considering the alternative+ −
case):+ −
+ −
\begin{center}+ −
\begin{tabular}{ll}+ −
$_1$ & $simp(r)$:\\+ −
$_2$ & \quad case $r = r_1 + r_2$\\+ −
$_3$ & \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\+ −
$_4$ & \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\+ −
$_5$ & \qquad case $r_{1s} = \ZERO$: + −
return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\+ −
$_6$ & \qquad case $r_{2s} = \ZERO$: + −
return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\+ −
$_7$ & \qquad case $r_{1s} = r_{2s}$:+ −
return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\+ −
$_8$ & \qquad otherwise: + −
return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent We first recursively call the simplification with+ −
$r_1$ and $r_2$ (Lines 3 and 4). This gives simplified regular expressions,+ −
$r_{1s}$ and $r_{2s}$, as well as two rectification functions+ −
$f_{1s}$ and $f_{2s}$. We next need to test whether the+ −
simplified regular expressions are $\ZERO$ so as to make+ −
further simplifications. In case $r_{1s}$ is $\ZERO$ (Line 5),+ −
then we can return $r_{2s}$ (the other alternative). However+ −
for this we need to build a corresponding rectification + −
function, which as said above is+ −
+ −
\begin{center}+ −
$\lambda v.\,Right(f_{2s}(v))$+ −
\end{center}+ −
+ −
\noindent The case where $r_{2s} = \ZERO$ is similar:+ −
We return $r_{1s}$ and rectify with $\Left(\_)$ and the+ −
other calculated rectification function $f_{1s}$. This gives+ −
+ −
\begin{center}+ −
$\lambda v.\,\Left(f_{1s}(v))$+ −
\end{center}+ −
+ −
\noindent The next case where $r_{1s} = r_{2s}$ can be treated+ −
like the one where $r_{2s} = \ZERO$. We return $r_{1s}$+ −
and rectify with $\Left(\_)$ and so on.+ −
+ −
+ −
The otherwise-case is slightly more complicated. In this case+ −
neither $r_{1s}$ nor $r_{2s}$ are $\ZERO$ and also+ −
$r_{1s} \not= r_{2s}$, which means no further simplification+ −
can be applied. Accordingly, we return $r_{1s} + r_{2s}$ as+ −
the simplified regular expression. In principle we also do not+ −
have to do any rectification, because no simplification was+ −
done in this case. But this is actually not true: There might+ −
have been simplifications inside $r_{1s}$ and $r_{2s}$. We+ −
therefore need to take into account the calculated+ −
rectification functions $f_{1s}$ and $f_{2s}$. We can do this+ −
by defining a rectification function $f_{alt}$ which takes two+ −
rectification functions as arguments and applies them+ −
according to whether the value is of the form $\Left(\_)$ or+ −
$Right(\_)$:+ −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{1mm}}l}+ −
$f_{alt}(f_1, f_2) \dn$\\+ −
\qquad $\lambda v.\,$ case $v = \Left(v')$: + −
& return $\Left(f_1(v'))$\\+ −
\qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: + −
& return $Right(f_2(v'))$\\ + −
\end{tabular}+ −
\end{center}+ −
+ −
The other interesting case with simplification is the sequence+ −
case. In this case the main simplification function is as+ −
follows+ −
+ −
\begin{center}+ −
\begin{tabular}{l}+ −
$simp(r)$:\hspace{5cm} (continued)\\+ −
\quad case $r = r_1 \cdot r_2$\\+ −
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\+ −
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\+ −
\qquad case $r_{1s} = \ZERO$: + −
return $(\ZERO, f_{error})$\\+ −
\qquad case $r_{2s} = \ZERO$: + −
return $(\ZERO, f_{error})$\\+ −
\qquad case $r_{1s} = \ONE$: + −
return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\+ −
\qquad case $r_{2s} = \ONE$: + −
return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\+ −
\qquad otherwise: + −
return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent whereby in the last line $f_{seq}$ is again pushing+ −
the two rectification functions into the two components of the+ −
Seq-value:+ −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{1mm}}l}+ −
$f_{seq}(f_1, f_2) \dn$\\+ −
\qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: + −
& return $Seq(f_1(v_1), f_2(v_2))$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Note that in the case of $r_{1s} = \ZERO$+ −
(similarly $r_{2s}$) we use the function $f_{error}$ for+ −
rectification. If you think carefully, then you will realise+ −
that this function will actually never been called. This is+ −
because a sequence with $\ZERO$ will never recognise any+ −
string and therefore the second phase of the algorithm would+ −
never been called. The simplification function still expects+ −
us to give a function. So in my own implementation I just+ −
returned a function that raises an error. In the case+ −
where $r_{1s} = \ONE$ (similarly $r_{2s}$) we have+ −
to create a sequence where the first component is a rectified+ −
version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.+ −
+ −
Since we only simplify regular expressions of the form $r_1 ++ −
r_2$ and $r_1 \cdot r_2$ we do not have to do anything else+ −
in the remaining cases. The rectification function will+ −
be just the identity, which in lambda-calculus terms is+ −
just+ −
+ −
\begin{center}+ −
$\lambda v.\,v$+ −
\end{center}+ −
+ −
\noindent This completes the high-level version of the simplification+ −
function, which is shown again in Figure~\ref{simp}. The Scala code+ −
for the simplification function is in Figure~\ref{simprect}.+ −
+ −
\begin{figure}[t]+ −
\begin{center}+ −
\begin{tabular}{l}+ −
$simp(r)$:\\+ −
\quad case $r = r_1 + r_2$\\+ −
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\+ −
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\+ −
\qquad case $r_{1s} = \ZERO$: + −
return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\+ −
\qquad case $r_{2s} = \ZERO$: + −
return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\+ −
\qquad case $r_{1s} = r_{2s}$:+ −
return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\+ −
\qquad otherwise: + −
return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$+ −
\medskip\\+ −
+ −
\quad case $r = r_1 \cdot r_2$\\+ −
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\+ −
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\+ −
\qquad case $r_{1s} = \ZERO$: + −
return $(\ZERO, f_{error})$\\+ −
\qquad case $r_{2s} = \ZERO$: + −
return $(\ZERO, f_{error})$\\+ −
\qquad case $r_{1s} = \ONE$: + −
return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\+ −
\qquad case $r_{2s} = \ONE$: + −
return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\+ −
\qquad otherwise: + −
return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$+ −
\medskip\\+ −
+ −
\quad otherwise:\\+ −
\qquad return $(r, \lambda v.\,v)$\\+ −
\end{tabular}+ −
\end{center}+ −
\caption{The simplification function that returns a simplified + −
regular expression and a rectification function.\label{simp}}+ −
\end{figure}+ −
+ −
\begin{figure}[p]+ −
\small + −
\lstinputlisting[numbers=left,linebackgroundcolor=+ −
{\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app61.scala}+ −
+ −
\caption{The Scala code for the simplification function. The+ −
first part defines some auxillary functions for the rectification.+ −
The second part give the simplification function.+ −
\label{simprect}}+ −
\end{figure}+ −
+ −
+ −
We are now in the position to define a \emph{lexing function} as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$lex\,r\,[]$ & $\dn$ & if $\textit{nullable}(r)$ then $\textit{mkeps}(r)$\\+ −
& & else $error$\\+ −
$lex\,r\,c\!::\!s$ & $\dn$ & let + −
$(r_{simp}, f_{rect}) = simp(\textit{der}(c, r))$\\+ −
& & $\textit{inj}\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$ + −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent This corresponds to the $matches$ function we have+ −
seen in earlier lectures. In the first clause we are given an+ −
empty string, $[]$, and need to test wether the regular+ −
expression is $nullable$. If yes, we can proceed normally and+ −
just return the value calculated by $\textit{mkeps}$. The second clause+ −
is for strings where the first character is $c$, say, and the+ −
rest of the string is $s$. We first build the derivative of+ −
$r$ with respect to $c$; simplify the resulting regular+ −
expression. We continue lexing with the simplified regular+ −
expression and the string $s$. Whatever will be returned as+ −
value, we sill need to rectify using the $f_{rect}$ from the+ −
simplification and finally inject $c$ back into the+ −
(rectified) value.+ −
+ −
+ −
\subsubsection*{Records}+ −
+ −
Remember we wanted to tokenize input strings, that means+ −
splitting strings into their ``word'' components. Furthermore+ −
we want to classify each token as being a keyword or+ −
identifier and so on. For this one more feature will be+ −
required, which I call a \emph{record} regular expression.+ −
While values encode how a regular expression matches a string,+ −
records can be used to ``focus'' on some particular parts of+ −
the regular expression and ``forget'' about others. + −
+ −
Let us look at an example. Suppose you have the regular+ −
expression $a\cdot b + a\cdot c$. Clearly this regular expression can only+ −
recognise two strings. But suppose you are not interested+ −
whether it can recognise $ab$ or $ac$, but rather if it+ −
matched, then what was the last character of the matched+ −
string\ldots either $b$ or $c$. You can do this by annotating+ −
the regular expression with a record, written in general+ −
$(x:r)$, where $x$ is just an identifier (in my implementation+ −
a plain string) and $r$ is a regular expression. A record will+ −
be regarded as a regular expression. The extended definition+ −
in Scala therefore looks as follows:+ −
+ −
{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=+ −
{\ifodd\value{lstnumber}\color{capri!3}\fi}]+ −
{../progs/app03.scala}}+ −
+ −
\noindent Since we regard records as regular expressions we+ −
need to extend the functions $\textit{nullable}$ and $\textit{der}$. Similarly+ −
$\textit{mkeps}$ and $\textit{inj}$ need to be extended. This means we also need+ −
to extend the definition of values, which in Scala looks as+ −
follows:+ −
+ −
{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=+ −
{\ifodd\value{lstnumber}\color{capri!3}\fi}]+ −
{../progs/app04.scala}}+ −
+ −
\noindent Let us now look at the purpose of records more+ −
closely and let us return to our question whether the string+ −
terminated in a $b$ or $c$. We can do this as follows: we+ −
annotate the regular expression $ab + ac$ with a record+ −
as follows+ −
+ −
\begin{center}+ −
$a\cdot (x:b) + a\cdot (x:c)$+ −
\end{center}+ −
+ −
\noindent This regular expression can still only recognise+ −
the strings $ab$ and $ac$, but we can now use a function+ −
that takes a value and returns all records. I call this+ −
function \emph{env} for environment\ldots it builds a list+ −
of identifiers associated with a string. This function+ −
can be defined as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$env(Empty)$ & $\dn$ & $[]$\\+ −
$env(Char(c))$ & $\dn$ & $[]$\\+ −
$env(\Left(v))$ & $\dn$ & $env(v)$\\+ −
$env(Right(v))$ & $\dn$ & $env(v)$\\+ −
$env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\+ −
$env(Stars\,[v_1,\ldots ,v_n])$ & $\dn$ & + −
$env(v_1) \,@\ldots @\, env(v_n)$\\+ −
$env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent where in the last clause we use the flatten function+ −
defined earlier. As can be seen, the function $env$ ``picks''+ −
out all underlying strings where a record is given. Since+ −
there can be more than one, the environment will potentially+ −
contain many ``records''. If we now postprocess the value+ −
calculated by $lex$ extracting all records using $env$, we can+ −
answer the question whether the last element in the string was+ −
an $b$ or a $c$. Lets see this in action: if we use $a\cdot b+ −
+ a\cdot c$ and $ac$ the calculated value will be+ −
+ −
\begin{center}+ −
$Right(Seq(Char(a), Char(c)))$+ −
\end{center}+ −
+ −
\noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and+ −
use the $env$ function to extract the recording for + −
$x$ we obtain+ −
+ −
\begin{center}+ −
$[(x:c)]$+ −
\end{center}+ −
+ −
\noindent If we had given the string $ab$ instead, then the+ −
record would have been $[(x:b)]$. The fun starts if we + −
iterate this. Consider the regular expression + −
+ −
\begin{center}+ −
$(a\cdot (x:b) + a\cdot (y:c))^*$+ −
\end{center}+ −
+ −
\noindent and the string $ababacabacab$. This string is + −
clearly matched by the regular expression, but we are only+ −
interested in the sequence of $b$s and $c$s. Using $env$+ −
we obtain+ −
+ −
\begin{center}+ −
$[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$+ −
\end{center}+ −
+ −
\noindent While this feature might look silly, it is in fact+ −
quite useful. For example if we want to match the name of+ −
an email we might use the regular expression+ −
+ −
\[+ −
(name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot + −
(domain: [a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot + −
(top\_level: [a\mbox{-}z\,.]^{\{2,6\}})+ −
\]+ −
+ −
\noindent Then if we match the email address+ −
+ −
\[+ −
\texttt{christian.urban@kcl.ac.uk}+ −
\]+ −
+ −
\noindent we can use the $env$ function and find out+ −
what the name, domain and top-level part of the email+ −
address are:+ −
+ −
\begin{center}+ −
$[(name:\texttt{christian.urban}), + −
(domain:\texttt{kcl}), + −
(top\_level:\texttt{ac.uk})]$+ −
\end{center}+ −
+ −
Recall that we want to lex a little programming language,+ −
called the \emph{While}-language. A simple program in this+ −
language is shown in Figure~\ref{while}. The main keywords in+ −
this language are \pcode{while}, \pcode{if}, \pcode{then} and+ −
\pcode{else}. As usual we have syntactic categories for+ −
identifiers, operators, numbers and so on. For this we would+ −
need to design the corresponding regular expressions to+ −
recognise these syntactic categories. I let you do this design+ −
task. Having these regular expressions at our disposal, we can+ −
form the regular expression+ −
+ −
\begin{figure}[t]+ −
\begin{center}+ −
\mbox{\lstinputlisting[language=while]{../progs/fib.while}}+ −
\end{center}+ −
\caption{The Fibonacci program in the While-language.\label{while}}+ −
\end{figure}+ −
+ −
\begin{center}+ −
$\textit{WhileRegs} \;\dn\; + −
\left(\begin{array}{l}+ −
(k, KeyWords)\; +\\+ −
(i, Ids)\;+\\+ −
(o, Ops)\;+ \\+ −
(n, Nums)\;+ \\+ −
(s, Semis)\;+ \\+ −
(p, (LParens + RParens))\;+\\ + −
(b, (Begin + End))\;+ \\+ −
(w, WhiteSpacess)+ −
\end{array}\right)^{\mbox{\LARGE{}*}}$+ −
\end{center}+ −
+ −
\noindent and ask the algorithm by Sulzmann \& Lu to lex, say+ −
the following string+ −
+ −
\begin{center}\large+ −
\code{"if true then then 42 else +"}+ −
\end{center}+ −
+ −
\noindent+ −
By using the records and extracting the environment, the + −
result is the following list:+ −
+ −
\begin{center}\tt+ −
\begin{tabular}{l}+ −
KEYWORD(if),\\ + −
WHITESPACE,\\ + −
IDENT(true),\\ + −
WHITESPACE,\\ + −
KEYWORD(then),\\ + −
WHITESPACE,\\ + −
KEYWORD(then),\\ + −
WHITESPACE,\\ + −
NUM(42),\\ + −
WHITESPACE,\\ + −
KEYWORD(else),\\ + −
WHITESPACE,\\ + −
OP(+)+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Typically we are not interested in the whitespaces + −
and comments and would filter them out: this gives+ −
+ −
\begin{center}\tt+ −
\begin{tabular}{l}+ −
KEYWORD(if),\\ + −
IDENT(true),\\ + −
KEYWORD(then),\\ + −
KEYWORD(then),\\ + −
NUM(42),\\ + −
KEYWORD(else),\\ + −
OP(+)+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
which will be the input for the next phase of our compiler.+ −
+ −
\end{document}+ −
+ −
%%% Local Variables: + −
%%% mode: latex+ −
%%% TeX-master: t+ −
%%% End: + −