afl-material: handouts/ho04.tex@39b7ff2cf1bc

\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../graphics}


\begin{document}
\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}

\section*{Handout 4 (Sulzmann \& Lu Algorithm)}

So far our algorithm based on derivatives was only able to say
yes or no depending on whether a string was matched by a regular
expression or not. Often a more interesting question is to
find out \emph{how} a regular expression matched a string?
Answering this question will also help us with the problem we
are after, namely tokenising an input string. 

The algorithm we will be looking at in this lecture was designed by Sulzmann
\& Lu in a rather recent research paper (from 2014). A link to it is
provided on KEATS, in case you are interested.\footnote{In my
humble opinion this is an interesting instance of the research
literature: it contains a very neat idea, but its presentation
is rather sloppy. In earlier versions of this paper, a King's
undergraduate student and I found several rather annoying typos in the
examples and definitions.} My PhD student Fahad Ausaf and I even more recently 
wrote a paper where we build on their result: we provided a 
mathematical proof that their algorithm is really correct---the proof 
Sulzmann \& Lu had originally given contained major flaws. Such correctness
proofs are important: Kuklewicz maintains a unit-test library
for the kind of algorithma we are interested in here and he showed 
that many implementations in the ``wild'' are buggy, that is not
satisfy his unit tests:

\begin{center}
\url{http://www.haskell.org/haskellwiki/Regex_Posix}
\end{center}


In order to give an answer for
\emph{how} a regular expression matches a string, Sulzmann and
Lu use \emph{values}. A value will be the output of the
algorithm whenever the regular expression matches the string.
If the string does not match the string, an error will be
raised. 
The definition for values is given below. They are shown 
together with the regular expressions $r$ to which
they correspond:

\begin{center}
\begin{tabular}{cc}
\begin{tabular}{@{}rrl@{}}
\multicolumn{3}{c}{regular expressions}\medskip\\
  $r$ & $::=$  & $\ZERO$\\
      & $\mid$ & $\ONE$   \\
      & $\mid$ & $c$          \\
      & $\mid$ & $r_1 \cdot r_2$\\
      & $\mid$ & $r_1 + r_2$   \\
  \\
      & $\mid$ & $r^*$         \\
\end{tabular}
&
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
\multicolumn{3}{c}{values}\medskip\\
   $v$ & $::=$  & \\
      &        & $Empty$   \\
      & $\mid$ & $Char(c)$          \\
      & $\mid$ & $Seq(v_1,v_2)$\\
      & $\mid$ & $\Left(v)$   \\
      & $\mid$ & $Right(v)$  \\
      & $\mid$ & $Stars\,[v_1,\ldots\,v_n]$ \\
\end{tabular}
\end{tabular}
\end{center}

\noindent There is no value for the
$\ZERO$ regular expression, since it does not match any
string. Otherwise there is exactly one value corresponding to
each regular expression with the exception of $r_1 + r_2$
where there are two values, namely $\Left(v)$ and $Right(v)$
corresponding to the two alternatives. Note that $r^*$ is
associated with a list of values, one for each copy of $r$
that was needed to match the string. This means we might also
return the empty list $Stars []$, if no copy was needed in case
of $r^*$. For sequence, there is exactly one value, composed 
of two component values ($v_1$ and $v_2$).

My implementation of regular expressions and values in Scala is
shown below. I have in my implementation the convention that
regular expressions are written entirely with upper-case
letters, while values just start with a single upper-case
character and the rest are lower-case letters.
 
{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/app01.scala}}

 
{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/app02.scala}}


Graphically the algorithm by Sulzmann \& Lu can be illustrated
by the picture in Figure~\ref{Sulz} where the path from the
left to the right involving $\textit{der}/\textit{nullable}$ is the first phase
of the algorithm and $\textit{mkeps}/\textit{inj}$, the path from right to left,
the second phase. This picture shows the steps required when a
regular expression, say $r_1$, matches the string $abc$. We
first build the three derivatives (according to $a$, $b$ and
$c$). We then use $\textit{nullable}$ to find out whether the resulting
regular expression can match the empty string. If yes, we call
the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular
expression has matched the empty string. Its definition
is as follows:

\begin{figure}[t]
\begin{center}
\begin{tikzpicture}[scale=2,node distance=1.2cm,
                    every node/.style={minimum size=7mm}]
\node (r1)  {$r_1$};
\node (r2) [right=of r1]{$r_2$};
\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\textit{der}\,a$};
\node (r3) [right=of r2]{$r_3$};
\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\textit{der}\,b$};
\node (r4) [right=of r3]{$r_4$};
\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\textit{der}\,c$};
\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\textit{nullable}$}};
\node (v4) [below=of r4]{$v_4$};
\draw[->,line width=1mm](r4) -- (v4);
\node (v3) [left=of v4] {$v_3$};
\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\textit{inj}\,c$};
\node (v2) [left=of v3]{$v_2$};
\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\textit{inj}\,b$};
\node (v1) [left=of v2] {$v_1$};
\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\textit{inj}\,a$};
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};
\end{tikzpicture}
\end{center}
\caption{The two phases of the algorithm by Sulzmann \& Lu.
\label{Sulz}}
\end{figure}



\begin{center}
\begin{tabular}{lcl}
  $\textit{mkeps}(\ONE)$       & $\dn$ & $Empty$\\
  $\textit{mkeps}(r_1 + r_2)$      & $\dn$ & if $\textit{nullable}(r_1)$  \\
                          &       & then $\Left(\textit{mkeps}(r_1))$\\
                          &       & else $Right(\textit{mkeps}(r_2))$\\
  $\textit{mkeps}(r_1 \cdot r_2)$  & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\
  $\textit{mkeps}(r^*)$            & $\dn$ & $Stars\,[]$  \\
\end{tabular}
\end{center}

\noindent There is are no cases for $\ZERO$ and $c$, since
these regular expression cannot match the empty string. Note
also that in case of alternatives we give preference to the
regular expression on the left-hand side. This will become
important later on.

The second phase of the algorithm is organised so that it will
calculate a value for how the derivative regular expression
has matched a string. For this we need a function that
reverses this ``chopping off'' for values which we did in the
first phase for derivatives. The corresponding function is
called $\textit{inj}$ (short for injection). This function takes three
arguments: the first one is a regular expression for which we
want to calculate the value, the second is the character we
want to inject and the third argument is the value where we
will inject the character into. The result of this function is a
new value. The definition of $\textit{inj}$ is as follows: 

\begin{center}
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars(\textit{inj}\,r\,c\,v\,::\,vs)$\\
\end{tabular}
\end{center}

\noindent This definition is by recursion on the regular
expression and by analysing the shape of the values. Therefore
there are three cases for sequence regular
expressions (for all possible shapes of the value we can encounter). The last
clause for the star regular expression returns a list where
the first element is $\textit{inj}\,r\,c\,v$ and the other elements are
$vs$. That means $\_\,::\,\_$ should be read as list cons.

To understand what is going on, it might be best to do some
example calculations and compare them with Figure~\ref{Sulz}.
For this note that we have not yet dealt with the need of
simplifying regular expressions (this will be a topic on its
own later). Suppose the regular expression is $a \cdot (b
\cdot c)$ and the input string is $abc$. The derivatives from
the first phase are as follows:

\begin{center}
\begin{tabular}{ll}
$r_1$: & $a \cdot (b \cdot c)$\\
$r_2$: & $\ONE \cdot (b \cdot c)$\\
$r_3$: & $(\ZERO \cdot (b \cdot c)) + (\ONE \cdot c)$\\
$r_4$: & $(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$\\
\end{tabular}
\end{center}

\noindent According to the simple algorithm, we would test
whether $r_4$ is nullable, which in this case it indeed is.
This means we can use the function $\textit{mkeps}$ to calculate a
value for how $r_4$ was able to match the empty string.
Remember that this function gives preference for alternatives
on the left-hand side. However there is only $\ONE$ on the
very right-hand side of $r_4$ (underlined)

\begin{center}
$r_4$:\;$(\ZERO \cdot (b \cdot c)) + 
         ((\ZERO \cdot c) + \underline{\ONE})$\\
\end{center}

\noindent
that matches the empty string.
Therefore $\textit{mkeps}$ returns the value

\begin{center}
$v_4:\;Right(Right(Empty))$
\end{center}

\noindent If there had been a $\ONE$ on the left, then
$\textit{mkeps}$ would have returned something of the form
$\Left(\ldots)$. The point is that from this value we can
directly read off which part of $r_4$ matched the empty
string: take the right-alternative first, and then the
right-alternative again. 

Next we have to ``inject'' the last character, that is $c$ in
the running example, into this value $v_4$ in order to
calculate how $r_3$ could have matched the string $c$.
According to the definition of $\textit{inj}$ we obtain

\begin{center}
$v_3:\;Right(Seq(Empty, Char(c)))$
\end{center}

\noindent This is the correct result, because $r_3$ needs
to use the right-hand alternative, and then $\ONE$ needs
to match the empty string and $c$ needs to match $c$.
Next we need to inject back the letter $b$ into $v_3$. This
gives

\begin{center}
$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
\end{center}

\noindent which is again the correct result for how $r_2$
matched the string $bc$. Finally we need to inject back the
letter $a$ into $v_2$ giving the final result

\begin{center}
$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
\end{center}

\noindent This now corresponds to how the regular
expression $a \cdot (b \cdot c)$ matched the string $abc$.

There are a few auxiliary functions that are of interest
when analysing this algorithm. One is called \emph{flatten},
written $|\_|$, which extracts the string ``underlying'' a 
value. It is defined recursively as

\begin{center}
\begin{tabular}{lcl}
  $|Empty|$     & $\dn$ & $[]$\\
  $|Char(c)|$   & $\dn$ & $[c]$\\
  $|\Left(v)|$   & $\dn$ & $|v|$\\
  $|Right(v)|$  & $\dn$ & $|v|$\\
  $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
  $|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
\end{tabular}
\end{center}

\noindent Using flatten we can see what are the strings behind 
the values calculated by $\textit{mkeps}$ and $\textit{inj}$. In our running 
example:

\begin{center}
\begin{tabular}{ll}
$|v_4|$: & $[]$\\
$|v_3|$: & $c$\\
$|v_2|$: & $bc$\\
$|v_1|$: & $abc$
\end{tabular}
\end{center}

\noindent This indicates that $\textit{inj}$ indeed is injecting, or
adding, back a character into the value. 

There is a problem, however, with the described algorithm
so far: it is very slow. We need to include the simplification 
from Lecture 2. This is what we shall do next.


\subsubsection*{Simplification}

Generally the matching algorithms based on derivatives do
poorly unless the regular expressions are simplified after
each derivative step. But this is a bit more involved in the
algorithm of Sulzmann \& Lu. So what follows might require you
to read several times before it makes sense and also might
require that you do some example calculations yourself. As a
first example consider the last derivation step in our earlier
example:

\begin{equation}\label{regexfour}
r_4 = \textit{der}\,c\,r_3 = 
(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)
\end{equation}

\noindent Simplifying this regular expression would just give
us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as
input, however, would give us with the value $Empty$ instead of

\[Right(Right(Empty))\]

\noindent
that was obtained  without the simplification. The problem  is we need
to  recreate this  more  complicated value,  rather  than just  return
$Empty$. This will require what I call \emph{rectification functions}.
They  need  to  be  calculated  whenever  a  regular  expression  gets
simplified.

Rectification functions take a value as argument and return a
(rectified) value. In the example above the argument would be $Empty$
and the output $Right(Right(Empty))$.  Before we define these
rectification functions in general, let us first take a look again at
our simplification rules:

\begin{center}
\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
$r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ 
$\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ 
$r \cdot \ONE$ & $\mapsto$ & $r$\\ 
$\ONE \cdot r$ & $\mapsto$ & $r$\\ 
$r + \ZERO$ & $\mapsto$ & $r$\\ 
$\ZERO + r$ & $\mapsto$ & $r$\\ 
$r + r$ & $\mapsto$ & $r$\\ 
\end{tabular}
\end{center}

\noindent Applying them to $r_4$ in \eqref{regexfour} will require several nested
simplifications in order end up with just $\ONE$. However,
it is possible to apply them in a depth-first, or inside-out,
manner in order to calculate this simplified regular
expression.

The rectification we can implement by letting simp return not
just a (simplified) regular expression, but also a
rectification function. Let us consider the alternative case,
$r_1 + r_2$, first. By going depth-first, we first simplify
the component regular expressions $r_1$ and $r_2.$ This will
return simplified versions, say $r_{1s}$ and $r_{2s}$, of the
component regular expressions (if they can be simplified) but
also two rectification functions $f_{1s}$ and $f_{2s}$. We
need to assemble them in order to obtain a rectified value for
$r_1 + r_2$. In case $r_{1s}$ simplified to $\ZERO$, we
continue the derivative calculation with $r_{2s}$. The
Sulzmann \& Lu algorithm would return a corresponding value,
say $v_{2s}$. But now this value needs to be ``rectified'' to
the value 

\begin{center}
$Right(v_{2s})$
\end{center}

\noindent The reason is that we look for the value that tells
us how $r_1 + r_2$ could have matched the string, not just
$r_2$ or $r_{2s}$. Unfortunately, this is still not the right
value in general because there might be some simplifications
that happened inside $r_2$ and for which the simplification
function retuned also a rectification function $f_{2s}$. So in
fact we need to apply this one too which gives

\begin{center}
$Right(f_{2s}(v_{2s}))$
\end{center}

\noindent This is now the correct, or rectified, value. Since
the simplification will be done in the first phase of the
algorithm, but the rectification needs to be done to the
values in the second phase, it is advantageous to calculate
the rectification as a function, remember this function and
then apply the value to this function during the second phase.
So if we want to implement the rectification as function, we 
would need to return

\begin{center}
$\lambda v.\,Right(f_{2s}(v))$
\end{center}

\noindent which is the lambda-calculus notation for
a function that expects a value $v$ and returns everything
after the dot where $v$ is replaced by whatever value is 
given.

Let us package this idea with rectification functions
into a single function (still only considering the alternative
case):

\begin{center}
\begin{tabular}{ll}
$_1$ & $simp(r)$:\\
$_2$ & \quad case $r = r_1 + r_2$\\
$_3$ & \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
$_4$ & \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
$_5$ & \qquad case $r_{1s} = \ZERO$: 
       return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
$_6$ & \qquad case $r_{2s} = \ZERO$: 
       return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
$_7$ & \qquad case $r_{1s} = r_{2s}$:
       return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
$_8$ & \qquad otherwise: 
       return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
\end{tabular}
\end{center}

\noindent We first recursively call the simplification with
$r_1$ and $r_2$ (Lines 3 and 4). This gives simplified regular expressions,
$r_{1s}$ and $r_{2s}$, as well as two rectification functions
$f_{1s}$ and $f_{2s}$. We next need to test whether the
simplified regular expressions are $\ZERO$ so as to make
further simplifications. In case $r_{1s}$ is $\ZERO$ (Line 5),
then we can return $r_{2s}$ (the other alternative). However
for this we need to build a corresponding rectification 
function, which as said above is

\begin{center}
$\lambda v.\,Right(f_{2s}(v))$
\end{center}

\noindent The case where $r_{2s} = \ZERO$ is similar:
We return $r_{1s}$ and rectify with $\Left(\_)$ and the
other calculated rectification function $f_{1s}$. This gives

\begin{center}
$\lambda v.\,\Left(f_{1s}(v))$
\end{center}

\noindent The next case where $r_{1s} = r_{2s}$ can be treated
like the one where $r_{2s} = \ZERO$. We return $r_{1s}$
and rectify with $\Left(\_)$ and so on.


The otherwise-case is slightly more complicated. In this case
neither $r_{1s}$ nor $r_{2s}$ are $\ZERO$ and also
$r_{1s} \not= r_{2s}$, which means no further simplification
can be applied. Accordingly, we return $r_{1s} + r_{2s}$ as
the simplified regular expression. In principle we also do not
have to do any rectification, because no simplification was
done in this case. But this is actually not true: There might
have been simplifications inside $r_{1s}$ and $r_{2s}$. We
therefore need to take into account the calculated
rectification functions $f_{1s}$ and $f_{2s}$. We can do this
by defining a rectification function $f_{alt}$ which takes two
rectification functions as arguments and applies them
according to whether the value is of the form $\Left(\_)$ or
$Right(\_)$:

\begin{center}
\begin{tabular}{l@{\hspace{1mm}}l}
$f_{alt}(f_1, f_2) \dn$\\
\qquad $\lambda v.\,$ case $v = \Left(v')$: 
      & return $\Left(f_1(v'))$\\
\qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: 
      & return $Right(f_2(v'))$\\      
\end{tabular}
\end{center}

The other interesting case with simplification is the sequence
case. In this case the main simplification function is as
follows

\begin{center}
\begin{tabular}{l}
$simp(r)$:\hspace{5cm} (continued)\\
\quad case $r = r_1 \cdot r_2$\\
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
\qquad case $r_{1s} = \ZERO$: 
       return $(\ZERO, f_{error})$\\
\qquad case $r_{2s} = \ZERO$: 
       return $(\ZERO, f_{error})$\\
\qquad case $r_{1s} = \ONE$: 
return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
\qquad case $r_{2s} = \ONE$: 
return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
\qquad otherwise: 
       return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\
\end{tabular}
\end{center}

\noindent whereby in the last line $f_{seq}$ is again pushing
the two rectification functions into the two components of the
Seq-value:

\begin{center}
\begin{tabular}{l@{\hspace{1mm}}l}
$f_{seq}(f_1, f_2) \dn$\\
\qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: 
      & return $Seq(f_1(v_1), f_2(v_2))$\\
\end{tabular}
\end{center}

\noindent Note that in the case of $r_{1s} = \ZERO$
(similarly $r_{2s}$) we use the function $f_{error}$ for
rectification. If you think carefully, then you will realise
that this function will actually never been called. This is
because a sequence with $\ZERO$ will never recognise any
string and therefore the second phase of the algorithm would
never been called. The simplification function still expects
us to give a function. So in my own implementation I just
returned a function that raises an error. In the case
where $r_{1s} = \ONE$ (similarly $r_{2s}$) we have
to create a sequence where the first component is a rectified
version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.

Since we only simplify regular expressions of the form $r_1 +
r_2$ and $r_1 \cdot r_2$ we do not have to do anything else
in the remaining cases. The rectification function will
be just the identity, which in lambda-calculus terms is
just

\begin{center}
$\lambda v.\,v$
\end{center}

\noindent This completes the high-level version of the simplification
function, which is shown again in Figure~\ref{simp}. The Scala code
for the simplification function is in Figure~\ref{simprect}.

\begin{figure}[t]
\begin{center}
\begin{tabular}{l}
$simp(r)$:\\
\quad case $r = r_1 + r_2$\\
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
\qquad case $r_{1s} = \ZERO$: 
       return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
\qquad case $r_{2s} = \ZERO$: 
       return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
\qquad case $r_{1s} = r_{2s}$:
       return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
\qquad otherwise: 
       return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$
       \medskip\\

\quad case $r = r_1 \cdot r_2$\\
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
\qquad case $r_{1s} = \ZERO$: 
       return $(\ZERO, f_{error})$\\
\qquad case $r_{2s} = \ZERO$: 
       return $(\ZERO, f_{error})$\\
\qquad case $r_{1s} = \ONE$: 
return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
\qquad case $r_{2s} = \ONE$: 
return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
\qquad otherwise: 
       return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$
       \medskip\\

\quad otherwise:\\
\qquad return $(r, \lambda v.\,v)$\\
\end{tabular}
\end{center}
\caption{The simplification function that returns a simplified 
regular expression and a rectification function.\label{simp}}
\end{figure}

\begin{figure}[p]
\small  
\lstinputlisting[numbers=left,linebackgroundcolor=
   {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app61.scala}

\caption{The Scala code for the simplification function. The
first part defines some auxillary functions for the rectification.
The second part give the simplification function.
\label{simprect}}
\end{figure}


We are now in the position to define a \emph{lexing function} as follows:

\begin{center}
\begin{tabular}{lcl}
$lex\,r\,[]$ & $\dn$ & if $\textit{nullable}(r)$ then $\textit{mkeps}(r)$\\
             &       & else $error$\\
$lex\,r\,c\!::\!s$ & $\dn$ & let 
   $(r_{simp}, f_{rect}) = simp(\textit{der}(c, r))$\\
& & $\textit{inj}\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
\end{tabular}
\end{center}

\noindent This corresponds to the $matches$ function we have
seen in earlier lectures. In the first clause we are given an
empty string, $[]$, and need to test wether the regular
expression is $nullable$. If yes, we can proceed normally and
just return the value calculated by $\textit{mkeps}$. The second clause
is for strings where the first character is $c$, say, and the
rest of the string is $s$. We first build the derivative of
$r$ with respect to $c$; simplify the resulting regular
expression. We continue lexing with the simplified regular
expression and the string $s$. Whatever will be returned as
value, we sill need to rectify using the $f_{rect}$ from the
simplification and finally inject $c$ back into the
(rectified) value.


\subsubsection*{Records}

Remember we wanted to tokenize input strings, that means
splitting strings into their ``word'' components. Furthermore
we want to classify each token as being a keyword or
identifier and so on. For this one more feature will be
required, which I call a \emph{record} regular expression.
While values encode how a regular expression matches a string,
records can be used to ``focus'' on some particular parts of
the regular expression and ``forget'' about others. 

Let us look at an example. Suppose you have the regular
expression $a\cdot b + a\cdot c$. Clearly this regular expression can only
recognise two strings. But suppose you are not interested
whether it can recognise $ab$ or $ac$, but rather if it
matched, then what was the last character of the matched
string\ldots either $b$ or $c$. You can do this by annotating
the regular expression with a record, written in general
$(x:r)$, where $x$ is just an identifier (in my implementation
a plain string) and $r$ is a regular expression. A record will
be regarded as a regular expression. The extended definition
in Scala therefore looks as follows:

{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/app03.scala}}

\noindent Since we regard records as regular expressions we
need to extend the functions $\textit{nullable}$ and $\textit{der}$. Similarly
$\textit{mkeps}$ and $\textit{inj}$ need to be extended. This means we also need
to extend the definition of values, which in Scala looks as
follows:

{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/app04.scala}}

\noindent Let us now look at the purpose of records more
closely and let us return to our question whether the string
terminated in a $b$ or $c$. We can do this as follows: we
annotate the regular expression $ab + ac$ with a record
as follows

\begin{center}
$a\cdot (x:b) + a\cdot (x:c)$
\end{center}

\noindent This regular expression can still only recognise
the strings $ab$ and $ac$, but we can now use a function
that takes a value and returns all records. I call this
function \emph{env} for environment\ldots it builds a list
of identifiers associated with a string. This function
can be defined as follows:

\begin{center}
\begin{tabular}{lcl}
  $env(Empty)$     & $\dn$ & $[]$\\
  $env(Char(c))$   & $\dn$ & $[]$\\
  $env(\Left(v))$   & $\dn$ & $env(v)$\\
  $env(Right(v))$  & $\dn$ & $env(v)$\\
  $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\
  $env(Stars\,[v_1,\ldots ,v_n])$ & $\dn$ & 
     $env(v_1) \,@\ldots @\, env(v_n)$\\
  $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\
\end{tabular}
\end{center}

\noindent where in the last clause we use the flatten function
defined earlier. As can be seen, the function $env$ ``picks''
out all underlying strings where a record is given. Since
there can be more than one, the environment will potentially
contain many ``records''. If we now postprocess the value
calculated by $lex$ extracting all records using $env$, we can
answer the question whether the last element in the string was
an $b$ or a $c$. Lets see this in action: if we use $a\cdot b
+ a\cdot c$ and $ac$ the calculated value will be

\begin{center}
$Right(Seq(Char(a), Char(c)))$
\end{center}

\noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and
use the $env$ function to extract the recording for 
$x$ we obtain

\begin{center}
$[(x:c)]$
\end{center}

\noindent If we had given the string $ab$ instead, then the
record would have been $[(x:b)]$. The fun starts if we 
iterate this. Consider the regular expression 

\begin{center}
$(a\cdot (x:b) + a\cdot (y:c))^*$
\end{center}

\noindent and the string $ababacabacab$. This string is 
clearly matched by the regular expression, but we are only
interested in the sequence of $b$s and $c$s. Using $env$
we obtain

\begin{center}
$[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$
\end{center}

\noindent While this feature might look silly, it is in fact
quite useful. For example if we want to match the name of
an email we might use the regular expression

\[
(name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot 
(domain: [a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot 
(top\_level: [a\mbox{-}z\,.]^{\{2,6\}})
\]

\noindent Then if we match the email address

\[
\texttt{christian.urban@kcl.ac.uk}
\]

\noindent we can use the $env$ function and find out
what the name, domain and top-level part of the email
address are:

\begin{center}
$[(name:\texttt{christian.urban}), 
  (domain:\texttt{kcl}), 
  (top\_level:\texttt{ac.uk})]$
\end{center}

Recall that we want to lex a little programming language,
called the \emph{While}-language. A simple program in this
language is shown in Figure~\ref{while}. The main keywords in
this language are \pcode{while}, \pcode{if}, \pcode{then} and
\pcode{else}. As usual we have syntactic categories for
identifiers, operators, numbers and so on. For this we would
need to design the corresponding regular expressions to
recognise these syntactic categories. I let you do this design
task. Having these regular expressions at our disposal, we can
form the regular expression

\begin{figure}[t]
\begin{center}
\mbox{\lstinputlisting[language=while]{../progs/fib.while}}
\end{center}
\caption{The Fibonacci program in the While-language.\label{while}}
\end{figure}

\begin{center}
$\textit{WhileRegs} \;\dn\; 
\left(\begin{array}{l}
       (k, KeyWords)\; +\\
       (i, Ids)\;+\\
       (o, Ops)\;+ \\
       (n, Nums)\;+ \\
       (s, Semis)\;+ \\
       (p, (LParens + RParens))\;+\\ 
       (b, (Begin + End))\;+ \\
       (w, WhiteSpacess)
      \end{array}\right)^{\mbox{\LARGE{}*}}$
\end{center}

\noindent and ask the algorithm by Sulzmann \& Lu to lex, say
the following string

\begin{center}\large
\code{"if true then then 42 else +"}
\end{center}

\noindent
By using the records and extracting the environment, the 
result is the following list:

\begin{center}\tt
\begin{tabular}{l}
KEYWORD(if),\\ 
WHITESPACE,\\ 
IDENT(true),\\ 
WHITESPACE,\\ 
KEYWORD(then),\\ 
WHITESPACE,\\ 
KEYWORD(then),\\ 
WHITESPACE,\\ 
NUM(42),\\ 
WHITESPACE,\\ 
KEYWORD(else),\\ 
WHITESPACE,\\ 
OP(+)
\end{tabular}
\end{center}

\noindent Typically we are not interested in the whitespaces 
and comments and would filter them out: this gives

\begin{center}\tt
\begin{tabular}{l}
KEYWORD(if),\\ 
IDENT(true),\\ 
KEYWORD(then),\\ 
KEYWORD(then),\\ 
NUM(42),\\ 
KEYWORD(else),\\ 
OP(+)
\end{tabular}
\end{center}

\noindent
which will be the input for the next phase of our compiler.

\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End:
author	Christian Urban <urbanc@in.tum.de>
	Sun, 21 May 2017 00:43:02 +0100 (2017-05-20)
changeset 492	39b7ff2cf1bc
parent 490	4fee50f38305
child 520	2849c305b12d
permissions	-rw-r--r--