handouts/ho04.tex
author Christian Urban <christian dot urban at kcl dot ac dot uk>
Fri, 01 May 2015 21:30:13 +0100
changeset 326 94700593a2d5
parent 319 e7b110f93697
child 350 c4e7caa06c74
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../graphics}


\begin{document}

\section*{Handout 4 (Sulzmann \& Lu Algorithm)}

So far our algorithm based on derivatives was only able to say
yes or no depending on whether a string was matched by regular
expression or not. Often a more interesting question is to
find out \emph{how} a regular expression matched a string?
Answering this question will also help us with the problem we
are after, namely tokenising an input string. 

The algorithm we will be looking at was designed by Sulzmann \& Lu in
a rather recent paper. A link to it is provided on KEATS, in case you
are interested.\footnote{In my humble opinion this is an interesting
  instance of the research literature: it contains a very neat idea,
  but its presentation is rather sloppy. In earlier versions of their
  paper, students and I found several rather annoying typos in their
  examples and definitions.}  In order to give an answer for how a
regular expression matched a string, Sulzmann and Lu introduce
\emph{values}. A value will be the output of the algorithm whenever
the regular expression matches the string. If the string does not
match the string, an error will be raised. Since the first phase of
the algorithm by Sulzmann \& Lu is identical to the derivative based
matcher from the first coursework, the function $nullable$ will be
used to decide whether as string is matched by a regular
expression. If $nullable$ says yes, then values are constructed that
reflect how the regular expression matched the string. The definitions
for regular expressions $r$ and values $v$ is shown next to each other
below:

\begin{center}
\begin{tabular}{cc}
\begin{tabular}{@{}rrl@{}}
\multicolumn{3}{c}{regular expressions}\medskip\\
  $r$ & $::=$  & $\varnothing$\\
      & $\mid$ & $\epsilon$   \\
      & $\mid$ & $c$          \\
      & $\mid$ & $r_1 \cdot r_2$\\
      & $\mid$ & $r_1 + r_2$   \\
  \\
      & $\mid$ & $r^*$         \\
\end{tabular}
&
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
\multicolumn{3}{c}{values}\medskip\\
   $v$ & $::=$  & \\
      &        & $Empty$   \\
      & $\mid$ & $Char(c)$          \\
      & $\mid$ & $Seq(v_1,v_2)$\\
      & $\mid$ & $Left(v)$   \\
      & $\mid$ & $Right(v)$  \\
      & $\mid$ & $[v_1,\ldots\,v_n]$ \\
\end{tabular}
\end{tabular}
\end{center}

\noindent The point is that there is a very strong
correspondence between them. There is no value for the
$\varnothing$ regular expression, since it does not match any
string. Otherwise there is exactly one value corresponding to
each regular expression with the exception of $r_1 + r_2$
where there are two values, namely $Left(v)$ and $Right(v)$
corresponding to the two alternatives. Note that $r^*$ is
associated with a list of values, one for each copy of $r$
that was needed to match the string. This means we might also
return the empty list $[]$, if no copy was needed.

To emphasise the connection between regular expressions and
values, I have in my implementation the convention that
regular expressions are written entirely with upper-case
letters, while values just start with a single upper-case
character. My definition of values in Scala is below. I use 
this in the REPL of Scala; when I use the Scala compiler
I need to rename some constructors, because Scala on Macs
does not like classes that are called \pcode{EMPTY} and
\pcode{Empty}.
 
{\small\lstinputlisting[language=Scala,numbers=none]
{../progs/app02.scala}}


Graphically the algorithm by Sulzmann \& Lu can be illustrated
by the picture in Figure~\ref{Sulz} where the path from the
left to the right involving $der/nullable$ is the first phase
of the algorithm and $mkeps/inj$, the path from right to left,
the second phase. This picture shows the steps required when a
regular expression, say $r_1$, matches the string $abc$. We
first build the three derivatives (according to $a$, $b$ and
$c$). We then use $nullable$ to find out whether the resulting
regular expression can match the empty string. If yes, we call
the function $mkeps$.

\begin{figure}[t]
\begin{center}
\begin{tikzpicture}[scale=2,node distance=1.2cm,
                    every node/.style={minimum size=7mm}]
\node (r1)  {$r_1$};
\node (r2) [right=of r1]{$r_2$};
\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$};
\node (r3) [right=of r2]{$r_3$};
\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$};
\node (r4) [right=of r3]{$r_4$};
\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$};
\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}};
\node (v4) [below=of r4]{$v_4$};
\draw[->,line width=1mm](r4) -- (v4);
\node (v3) [left=of v4] {$v_3$};
\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$};
\node (v2) [left=of v3]{$v_2$};
\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$};
\node (v1) [left=of v2] {$v_1$};
\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$};
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}};
\end{tikzpicture}
\end{center}
\caption{The two phases of the algorithm by Sulzmann \& Lu.
\label{Sulz}}
\end{figure}

The $mkeps$ function calculates a value for how a regular
expression has matched the empty string. Its definition
is as follows:

\begin{center}
\begin{tabular}{lcl}
  $mkeps(\epsilon)$       & $\dn$ & $Empty$\\
  $mkeps(r_1 + r_2)$      & $\dn$ & if $nullable(r_1)$  \\
                          &       & then $Left(mkeps(r_1))$\\
                          &       & else $Right(mkeps(r_2))$\\
  $mkeps(r_1 \cdot r_2)$  & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\
  $mkeps(r^*)$            & $\dn$ & $[]$  \\
\end{tabular}
\end{center}

\noindent There are no cases for $\varnothing$ and $c$, since
these regular expression cannot match the empty string. Note
also that in case of alternatives we give preference to the
regular expression on the left-hand side. This will become
important later on.

The second phase of the algorithm is organised so that it will
calculate a value for how the derivative regular expression
has matched a string whose first character has been chopped
off. Now we need a function that reverses this ``chopping
off'' for values. The corresponding function is called $inj$
for injection. This function takes three arguments: the first
one is a regular expression for which we want to calculate the
value, the second is the character we want to inject and the
third argument is the value where we will inject the
character. The result of this function is a new value. The
definition of $inj$ is as follows: 

\begin{center}
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
  $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
  $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
  $inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$  & $Right(inj\,r_2\,c\,v)$\\
  $inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
  $inj\,(r_1 \cdot r_2)\,c\,Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
  $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\
  $inj\,(r^*)\,c\,Seq(v,vs)$         & $\dn$  & $inj\,r\,c\,v\,::\,vs$\\
\end{tabular}
\end{center}

\noindent This definition is by recursion on the regular
expression and by analysing the shape of the values. Therefore
there are, for example, three cases for sequence regular
expressions. The last clause for the star regular expression
returns a list where the first element is $inj\,r\,c\,v$ and
the other elements are $vs$. That means $\_\,::\,\_$ should be 
read as list cons.

To understand what is going on, it might be best to do some
example calculations and compare them with Figure~\ref{Sulz}.
For this note that we have not yet dealt with the need of
simplifying regular expressions (this will be a topic on its
own later). Suppose the regular expression is $a \cdot (b
\cdot c)$ and the input string is $abc$. The derivatives from
the first phase are as follows:

\begin{center}
\begin{tabular}{ll}
$r_1$: & $a \cdot (b \cdot c)$\\
$r_2$: & $\epsilon \cdot (b \cdot c)$\\
$r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\
$r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\
\end{tabular}
\end{center}

\noindent According to the simple algorithm, we would test
whether $r_4$ is nullable, which in this case it is. This
means we can use the function $mkeps$ to calculate a value for
how $r_4$ was able to match the empty string. Remember that
this function gives preference for alternatives on the
left-hand side. However there is only $\epsilon$ on the very
right-hand side of $r_4$ that matches the empty string.
Therefore $mkeps$ returns the value

\begin{center}
$v_4:\;Right(Right(Empty))$
\end{center}

\noindent The point is that from this value we can directly
read off which part of $r_4$ matched the empty string: take
the right-alternative first, and then the right-alternative
again. 

Next we have to ``inject'' the last character, that is $c$ in
the running example, into this value $v_4$ in order to
calculate how $r_3$ could have matched the string $c$.
According to the definition of $inj$ we obtain

\begin{center}
$v_3:\;Right(Seq(Empty, Char(c)))$
\end{center}

\noindent This is the correct result, because $r_3$ needs
to use the right-hand alternative, and then $\epsilon$ needs
to match the empty string and $c$ needs to match $c$.
Next we need to inject back the letter $b$ into $v_3$. This
gives

\begin{center}
$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
\end{center}

\noindent which is again the correct result for how $r_2$
matched the string $bc$. Finally we need to inject back the
letter $a$ into $v_2$ giving the final result

\begin{center}
$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
\end{center}

\noindent This now corresponds to how the regular
expression $a \cdot (b \cdot c)$ matched the string $abc$.

There are a few auxiliary functions that are of interest
when analysing this algorithm. One is called \emph{flatten},
written $|\_|$, which extracts the string ``underlying'' a 
value. It is defined recursively as

\begin{center}
\begin{tabular}{lcl}
  $|Empty|$     & $\dn$ & $[]$\\
  $|Char(c)|$   & $\dn$ & $[c]$\\
  $|Left(v)|$   & $\dn$ & $|v|$\\
  $|Right(v)|$  & $\dn$ & $|v|$\\
  $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
  $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
\end{tabular}
\end{center}

\noindent Using flatten we can see what is the string behind 
the values calculated by $mkeps$ and $inj$ in our running 
example are:

\begin{center}
\begin{tabular}{ll}
$|v_4|$: & $[]$\\
$|v_3|$: & $c$\\
$|v_2|$: & $bc$\\
$|v_1|$: & $abc$
\end{tabular}
\end{center}

\noindent This indicates that $inj$ indeed is injecting, or
adding, back a character into the value. Next we look at how
simplification can be included into this algorithm.


\subsubsection*{Simplification}

Generally the matching algorithms based on derivatives do
poorly unless the regular expressions are simplified after
each derivative step. But this is a bit more involved in the
algorithm of Sulzmann \& Lu. So what follows might require you
to read several times before it makes sense and also might
require that you do some example calculations yourself. As a
first example consider the last derivation step in our earlier
example:

\begin{center}
$r_4 = der\,c\,r_3 = 
(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$
\end{center}

\noindent Simplifying this regular expression would just give
us $\epsilon$. Running $mkeps$ with this regular expression as
input, however, would then provide us with $Empty$ instead of
$Right(Right(Empty))$ that was obtained without the
simplification. The problem is we need to recreate this more
complicated value, rather than just return $Empty$.

This will require what I call \emph{rectification functions}.
They need to be calculated whenever a regular expression gets
simplified. Rectification functions take a value as argument
and return a (rectified) value. Let us first take a look again
at our simplification rules:

\begin{center}
\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
$r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\ 
$\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\ 
$r \cdot \epsilon$ & $\mapsto$ & $r$\\ 
$\epsilon \cdot r$ & $\mapsto$ & $r$\\ 
$r + \varnothing$ & $\mapsto$ & $r$\\ 
$\varnothing + r$ & $\mapsto$ & $r$\\ 
$r + r$ & $\mapsto$ & $r$\\ 
\end{tabular}
\end{center}

\noindent Applying them to $r_4$ will require several nested
simplifications in order end up with just $\epsilon$. However,
it is possible to apply them in a depth-first, or inside-out,
manner in order to calculate this simplified regular
expression.

The rectification we can implement by letting simp return
not just a (simplified) regular expression, but also a
rectification function. Let us consider the alternative case,
$r_1 + r_2$, first. By going depth-first, we first simplify
the component regular expressions $r_1$ and $r_2.$ This will
return simplified versions (if they can be simplified), say
$r_{1s}$ and $r_{2s}$, but also two rectification functions
$f_{1s}$ and $f_{2s}$. We need to assemble them in order to
obtain a rectified value for $r_1 + r_2$. In case $r_{1s}$
simplified to $\varnothing$, we continue the derivative
calculation with $r_{2s}$. The Sulzmann \& Lu algorithm would
return a corresponding value, say $v_{2s}$. But now this value
needs to be ``rectified'' to the value 

\begin{center}
$Right(v_{2s})$
\end{center}

\noindent The reason is that we look for the value that tells
us how $r_1 + r_2$ could have matched the string, not just
$r_2$ or $r_{2s}$. Unfortunately, this is still not the right
value in general because there might be some simplifications
that happened inside $r_2$ and for which the simplification
function retuned also a rectification function $f_{2s}$. So in
fact we need to apply this one too which gives

\begin{center}
$Right(f_{2s}(v_{2s}))$
\end{center}

\noindent This is now the correct, or rectified, value. Since
the simplification will be done in the first phase of the
algorithm, but the rectification needs to be done to the
values in the second phase, it is advantageous to calculate
the rectification as a function, remember this function and
then apply the value to this function during the second phase.
So if we want to implement the rectification as function, we 
would need to return

\begin{center}
$\lambda v.\,Right(f_{2s}(v))$
\end{center}

\noindent which is the lambda-calculus notation for
a function that expects a value $v$ and returns everything
after the dot where $v$ is replaced by whatever value is 
given.

Let us package this idea with rectification functions
into a single function (still only considering the alternative
case):

\begin{center}
\begin{tabular}{l}
$simp(r)$:\\
\quad case $r = r_1 + r_2$\\
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
\qquad case $r_{1s} = \varnothing$: 
       return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
\qquad case $r_{2s} = \varnothing$: 
       return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
\qquad case $r_{1s} = r_{2s}$:
       return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
\qquad otherwise: 
       return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
\end{tabular}
\end{center}

\noindent We first recursively call the simplification with
$r_1$ and $r_2$. This gives simplified regular expressions,
$r_{1s}$ and $r_{2s}$, as well as two rectification functions
$f_{1s}$ and $f_{2s}$. We next need to test whether the
simplified regular expressions are $\varnothing$ so as to make
further simplifications. In case $r_{1s}$ is $\varnothing$,
then we can return $r_{2s}$ (the other alternative). However
for this we need to build a corresponding rectification 
function, which as said above is

\begin{center}
$\lambda v.\,Right(f_{2s}(v))$
\end{center}

\noindent The case where $r_{2s} = \varnothing$ is similar:
We return $r_{1s}$ and rectify with $Left(\_)$ and the
other calculated rectification function $f_{1s}$. This gives

\begin{center}
$\lambda v.\,Left(f_{1s}(v))$
\end{center}

\noindent The next case where $r_{1s} = r_{2s}$ can be treated
like the one where $r_{2s} = \varnothing$. We return $r_{1s}$
and rectify with $Left(\_)$ and so on.


The otherwise-case is slightly more complicated. In this case
neither $r_{1s}$ nor $r_{2s}$ are $\varnothing$ and also
$r_{1s} \not= r_{2s}$, which means no further simplification
can be applied. Accordingly, we return $r_{1s} + r_{2s}$ as
the simplified regular expression. In principle we also do not
have to do any rectification, because no simplification was
done in this case. But this is actually not true: There might
have been simplifications inside $r_{1s}$ and $r_{2s}$. We
therefore need to take into account the calculated
rectification functions $f_{1s}$ and $f_{2s}$. We can do this
by defining a rectification function $f_{alt}$ which takes two
rectification functions as arguments and applies them
according to whether the value is of the form $Left(\_)$ or
$Right(\_)$:

\begin{center}
\begin{tabular}{l@{\hspace{1mm}}l}
$f_{alt}(f_1, f_2) \dn$\\
\qquad $\lambda v.\,$ case $v = Left(v')$: 
      & return $Left(f_1(v'))$\\
\qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: 
      & return $Right(f_2(v'))$\\      
\end{tabular}
\end{center}

The other interesting case with simplification is the sequence
case. In this case the main simplification function is as
follows

\begin{center}
\begin{tabular}{l}
$simp(r)$:\qquad\qquad (continued)\\
\quad case $r = r_1 \cdot r_2$\\
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
\qquad case $r_{1s} = \varnothing$: 
       return $(\varnothing, f_{error})$\\
\qquad case $r_{2s} = \varnothing$: 
       return $(\varnothing, f_{error})$\\
\qquad case $r_{1s} = \epsilon$: 
return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
\qquad case $r_{2s} = \epsilon$: 
return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
\qquad otherwise: 
       return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\
\end{tabular}
\end{center}

\noindent whereby in the last line $f_{seq}$ is again pushing
the two rectification functions into the two components of the
Seq-value:

\begin{center}
\begin{tabular}{l@{\hspace{1mm}}l}
$f_{seq}(f_1, f_2) \dn$\\
\qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: 
      & return $Seq(f_1(v_1), f_2(v_2))$\\
\end{tabular}
\end{center}

\noindent Note that in the case of $r_{1s} = \varnothing$
(similarly $r_{2s}$) we use the function $f_{error}$ for
rectification. If you think carefully, then you will realise
that this function will actually never been called. This is
because a sequence with $\varnothing$ will never recognise any
string and therefore the second phase of the algorithm would
never been called. The simplification function still expects
us to give a function. So in my own implementation I just
returned a function that raises an error. In the case
where $r_{1s} = \epsilon$ (similarly $r_{2s}$) we have
to create a sequence where the first component is a rectified
version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.

Since we only simplify regular expressions of the form $r_1 +
r_2$ and $r_1 \cdot r_2$ we do not have to do anything else
in the remaining cases. The rectification function will
be just the identity, which in lambda-calculus terms is
just

\begin{center}
$\lambda v.\,v$
\end{center}

\noindent This completes the high-level version of the
simplification function, which is also shown again in 
Figure~\ref{simp}. This can now be used in a \emph{lexing
function} as follows:

\begin{figure}[t]
\begin{center}
\begin{tabular}{l}
$simp(r)$:\\
\quad case $r = r_1 + r_2$\\
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
\qquad case $r_{1s} = \varnothing$: 
       return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
\qquad case $r_{2s} = \varnothing$: 
       return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
\qquad case $r_{1s} = r_{2s}$:
       return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
\qquad otherwise: 
       return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$
       \medskip\\

\quad case $r = r_1 \cdot r_2$\\
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
\qquad case $r_{1s} = \varnothing$: 
       return $(\varnothing, f_{error})$\\
\qquad case $r_{2s} = \varnothing$: 
       return $(\varnothing, f_{error})$\\
\qquad case $r_{1s} = \epsilon$: 
return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
\qquad case $r_{2s} = \epsilon$: 
return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
\qquad otherwise: 
       return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$
       \medskip\\

\quad otherwise:\\
\qquad return $(r, \lambda v.\,v)$\\
\end{tabular}
\end{center}
\caption{The simplification function that returns a simplified 
regular expression and a rectification function.\label{simp}}
\end{figure}

\begin{center}
\begin{tabular}{lcl}
$lex\,r\,[]$ & $\dn$ & if $nullable(r)$ then $mkeps(r)$\\
             &       & else $error$\\
$lex\,r\,c\!::\!s$ & $\dn$ & let 
   $(r_{simp}, f_{rect}) = simp(der(c, r))$\\
& & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
\end{tabular}
\end{center}

\noindent This corresponds to the $matches$ function we have
seen in earlier lectures. In the first clause we are given an
empty string, $[]$, and need to test wether the regular
expression is $nullable$. If yes, we can proceed normally and
just return the value calculated by $mkeps$. The second clause
is for strings where the first character is $c$, say, and the
rest of the string is $s$. We first build the derivative of
$r$ with respect to $c$; simplify the resulting regular
expression. We continue lexing with the simplified regular
expression and the string $s$. Whatever will be returned as
value, we sill need to rectify using the $f_{rect}$ from the
simplification and finally inject $c$ back into the
(rectified) value.


\subsubsection*{Records}

Remember we wanted to tokenize input strings, that means
splitting strings into their ``word'' components. Furthermore
we want to classify each token as being a keyword or
identifier and so on. For this one more feature will be
required, which I call a \emph{record} regular expression.
While values encode how a regular expression matches a string,
records can be used to ``focus'' on some particular parts of
the regular expression and ``forget'' about others. 

Let us look at an example. Suppose you have the regular
expression $a\cdot b + a\cdot c$. Clearly this regular expression can only
recognise two strings. But suppose you are not interested
whether it can recognise $ab$ or $ac$, but rather if it
matched, then what was the last character of the matched
string\ldots either $b$ or $c$. You can do this by annotating
the regular expression with a record, written in general
$(x:r)$, where $x$ is just an identifier (in my implementation
a plain string) and $r$ is a regular expression. A record will
be regarded as a regular expression. The extended definition
in Scala therefore looks as follows:

{\small\lstinputlisting[language=Scala]
{../progs/app03.scala}}

\noindent Since we regard records as regular expressions we
need to extend the functions $nullable$ and $der$. Similarly
$mkeps$ and $inj$ need to be extended. This means we also need
to extend the definition of values, which in Scala looks as
follows:

{\small\lstinputlisting[language=Scala]
{../progs/app04.scala}}

\noindent Let us now look at the purpose of records more
closely and let us return to our question whether the string
terminated in a $b$ or $c$. We can do this as follows: we
annotate the regular expression $ab + ac$ with a record
as follows

\begin{center}
$a\cdot (x:b) + a\cdot (x:c)$
\end{center}

\noindent This regular expression can still only recognise
the strings $ab$ and $ac$, but we can now use a function
that takes a value and returns all records. I call this
function \emph{env} for environment\ldots it builds a list
of identifiers associated with a string. This function
can be defined as follows:

\begin{center}
\begin{tabular}{lcl}
  $env(Empty)$     & $\dn$ & $[]$\\
  $env(Char(c))$   & $\dn$ & $[]$\\
  $env(Left(v))$   & $\dn$ & $env(v)$\\
  $env(Right(v))$  & $\dn$ & $env(v)$\\
  $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\
  $env([v_1,\ldots ,v_n])$ & $\dn$ & 
     $env(v_1) \,@\ldots @\, env(v_n)$\\
  $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\
\end{tabular}
\end{center}

\noindent where in the last clause we use the flatten function
defined earlier. As can be seen, the function $env$ ``picks''
out all underlying strings where a record is given. Since
there can be more than one, the environment will potentially
contain many ``records''. If we now postprocess the value
calculated by $lex$ extracting all records using $env$, we can
answer the question whether the last element in the string was
an $b$ or a $c$. Lets see this in action: if we use $a\cdot b
+ a\cdot c$ and $ac$ the calculated value will be

\begin{center}
$Right(Seq(Char(a), Char(c)))$
\end{center}

\noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and
use the $env$ function to extract the recording for 
$x$ we obtain

\begin{center}
$[(x:c)]$
\end{center}

\noindent If we had given the string $ab$ instead, then the
record would have been $[(x:b)]$. The fun starts if we 
iterate this. Consider the regular expression 

\begin{center}
$(a\cdot (x:b) + a\cdot (y:c))^*$
\end{center}

\noindent and the string $ababacabacab$. This string is 
clearly matched by the regular expression, but we are only
interested in the sequence of $b$s and $c$s. Using $env$
we obtain

\begin{center}
$[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$
\end{center}

\noindent While this feature might look silly, it is in fact
quite useful. For example if we want to match the name of
an email we might use the regular expression

\[
(name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot 
(domain: [a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot 
(top\_level: [a\mbox{-}z\,.]^{\{2,6\}})
\]

\noindent Then if we match the email address

\[
\texttt{christian.urban@kcl.ac.uk}
\]

\noindent we can use the $env$ function and find out
what the name, domain and top-level part of the email
address are:

\begin{center}
$[(name:\texttt{christian.urban}), 
  (domain:\texttt{kcl}), 
  (top\_level:\texttt{ac.uk})]$
\end{center}

\noindent As you will see in the next lecture, this is now all
we need to tokenise an input string and classify each token.
\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: