handouts/ho04.tex
author Christian Urban <christian dot urban at kcl dot ac dot uk>
Fri, 17 Oct 2014 13:44:50 +0100
changeset 285 8a222559278f
parent 284 0afe43616b6a
child 286 19020b75d75e
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../graphics}


\begin{document}

\section*{Handout 4 (Sulzmann \& Lu Algorithm)}

So far our algorithm based on derivatives was only able to say
yes or no depending on whether a string was matched by regular
expression or not. Often a more interesting question is to
find out \emph{how} a regular expression matched a string?
Answering this question will help us with the problem we are
after, namely tokenising an input string. The algorithm we
will be looking at for this was designed by Sulzmann \& Lu in
a rather recent paper. A link to it is provided on KEATS, in
case you are interested.\footnote{In my humble opinion this is
an interesting instance of the research literature: it
contains a very neat idea, but its presentation is rather
sloppy. In earlier versions of their paper, students and I
found several rather annoying typos in their examples and
definitions.}

In order to give an answer for how a regular expression
matched a string, Sulzmann and Lu introduce \emph{values}. A
value will be the output of the algorithm whenever the regular
expression matches the string. If not, an error will be
raised. Since the first phase of the algorithm by Sulzmann \&
Lu is identical to the derivative based matcher from the first
coursework, the function $nullable$ will be used to decide
whether as string is matched by a regular expression. If
$nullable$ says yes, then values are constructed that reflect
how the regular expression matched the string. The definitions
for regular expressions $r$ and values $v$ is shown next to
each other below:

\begin{center}
\begin{tabular}{cc}
\begin{tabular}{@{}rrl@{}}
\multicolumn{3}{c}{regular expressions}\\
  $r$ & $::=$  & $\varnothing$\\
      & $\mid$ & $\epsilon$   \\
      & $\mid$ & $c$          \\
      & $\mid$ & $r_1 \cdot r_2$\\
      & $\mid$ & $r_1 + r_2$   \\
  \\
      & $\mid$ & $r^*$         \\
\end{tabular}
&
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
\multicolumn{3}{c}{values}\\
   $v$ & $::=$  & \\
      &        & $Empty$   \\
      & $\mid$ & $Char(c)$          \\
      & $\mid$ & $Seq(v_1,v_2)$\\
      & $\mid$ & $Left(v)$   \\
      & $\mid$ & $Right(v)$  \\
      & $\mid$ & $[v_1,\ldots\,v_n]$ \\
\end{tabular}
\end{tabular}
\end{center}

\noindent The point is that there is a very strong
correspondence between them. There is no value for the
$\varnothing$ regular expression, since it does not match any
string. Otherwise there is exactly one value corresponding to
each regular expression with the exception of $r_1 + r_2$
where there are two values, namely $Left(v)$ and $Right(v)$
corresponding to the two alternatives. Note that $r^*$ is
associated with a list of values, one for each copy of $r$
that was needed to match the string. This means we might also
return the empty list $[]$, if no copy was needed.

Graphically the algorithm by Sulzmann \& Lu can be represneted
by the picture in Figure~\ref{Sulz} where the path from the
left to the right involving $der/nullable$ is the first phase
of the algorithm and $mkeps/inj$, the path from right to left,
the second phase. This picture shows the steps required when a
regular expression, say $r_1$, matches the string $abc$. We
first build the three derivatives (according to $a$, $b$ and
$c$). We then use $nullable$ to find out whether the resulting
regular expression can match the empty string. If yes we call
the function $mkeps$.

\begin{figure}[t]
\begin{center}
\begin{tikzpicture}[scale=2,node distance=1.2cm,
                    every node/.style={minimum size=7mm}]
\node (r1)  {$r_1$};
\node (r2) [right=of r1]{$r_2$};
\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$};
\node (r3) [right=of r2]{$r_3$};
\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$};
\node (r4) [right=of r3]{$r_4$};
\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$};
\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}};
\node (v4) [below=of r4]{$v_4$};
\draw[->,line width=1mm](r4) -- (v4);
\node (v3) [left=of v4] {$v_3$};
\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$};
\node (v2) [left=of v3]{$v_2$};
\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$};
\node (v1) [left=of v2] {$v_1$};
\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$};
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}};
\end{tikzpicture}
\end{center}
\caption{The two phases of the algorithm by Sulzmann \& Lu.
\label{Sulz}}
\end{figure}

The $mkeps$ function calculates a value for how a regular
expression has matched the empty string. Its definition
is as follows:

\begin{center}
\begin{tabular}{lcl}
  $mkeps(\epsilon)$       & $\dn$ & $Empty$\\
  $mkeps(r_1 + r_2)$      & $\dn$ & if $nullable(r_1)$  \\
                          &       & then $Left(mkeps(r_1))$\\
                          &       & else $Right(mkeps(r_2))$\\
  $mkeps(r_1 \cdot r_2)$  & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\
  $mkeps(r^*)$            & $\dn$ & $[]$  \\
\end{tabular}
\end{center}

\noindent There are no cases for $\epsilon$ and $c$, since
these regular expression cannot match the empty string. Note
also that in case of alternatives we give preference to the
regular expression on the left-hand side. This will become
important later on.

The second phase of the algorithm is organised recursively
such that it will calculate a value for how the derivative
regular expression has matched a string whose first character
has been chopped off. Now we need a function that reverses
this ``chopping off'' for values. The corresponding function
is called $inj$ for injection. This function takes three
arguments: the first one is a regular expression for which we
want to calculate the value, the second is the character we
want to inject and the third argument is the value where we
will inject the character. The result of this function is a
new value. The definition of $inj$ is as follows: 

\begin{center}
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
  $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
  $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
  $inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$  & $Right(inj\,r_2\,c\,v)$\\
  $inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
  $inj\,(r_1 \cdot r_2)\,c\,Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
  $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\
  $inj\,(r^*)\,c\,Seq(v,vs)$         & $\dn$  & $inj\,r\,c\,v\,::\,vs$\\
\end{tabular}
\end{center}

\noindent This definition is by recursion on the regular
expression and by analysing the shape of the values. Therefore
there are, for example, three cases for sequnece regular
expressions. The last clause for the star regular expression
returns a list where the first element is $inj\,r\,c\,v$ and
the other elements are $vs$. That mean $\_\,::\,\_$ should be 
read as list cons.

To understand what is going on, it might be best to do some
example calculations and compare with Figure~\ref{Sulz}. For
this note that we have not yet dealt with the need of
simplifying regular expressions (this will be a topic on its
own later). Suppose the regular expression is $a \cdot (b
\cdot c)$ and the input string is $abc$. The derivatives from
the first phase are as follows:

\begin{center}
\begin{tabular}{ll}
$r_1$: & $a \cdot (b \cdot c)$\\
$r_2$: & $\epsilon \cdot (b \cdot c)$\\
$r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\
$r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\
\end{tabular}
\end{center}

\noindent According to the simple algorithm, we would test
whether $r_4$ is nullable, which in this case it is. This
means we can use the function $mkeps$ to calculate a value for
how $r_4$ was able to match the empty string. Remember that
this function gives preference for alternatives on the
left-hand side. However there is only $\epsilon$ on the very
right-hand side of $r_4$ that matches the empty string.
Therefore $mkeps$ returns the value

\begin{center}
$v_4:\;Right(Right(Empty))$
\end{center}

\noindent The point is that from this value we can directly
read off which part of $r_4$ matched the empty string. Next we
have to ``inject'' the last character, that is $c$ in the
running example, into this value $v_4$ in order to calculate
how $r_3$ could have matched the string $c$. According to the
definition of $inj$ we obtain

\begin{center}
$v_3:\;Right(Seq(Empty, Char(c)))$
\end{center}

\noindent This is the correct result, because $r_3$ needs
to use the right-hand alternative, and then $\epsilon$ needs
to match the empty string and $c$ needs to match $c$.
Next we need to inject back the letter $b$ into $v_3$. This
gives

\begin{center}
$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
\end{center}

\noindent Finally we need to inject back the letter $a$ into
$v_2$ giving the final result

\begin{center}
$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
\end{center}

\noindent This now corresponds to how the regular
expression $a \cdot (b \cdot c)$ matched the string $abc$.

There are a few auxiliary functions that are of interest
in analysing this algorithm. One is called \emph{flatten},
written $|\_|$, which extracts the string ``underlying'' a 
value. It is defined recursively as

\begin{center}
\begin{tabular}{lcl}
  $|Empty|$     & $\dn$ & $[]$\\
  $|Char(c)|$   & $\dn$ & $[c]$\\
  $|Left(v)|$   & $\dn$ & $|v|$\\
  $|Right(v)|$  & $\dn$ & $|v|$\\
  $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
  $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
\end{tabular}
\end{center}

\noindent Using flatten we can see what is the string behind 
the values calculated by $mkeps$ and $inj$ in our running 
example are:

\begin{center}
\begin{tabular}{ll}
$|v_4|$: & $[]$\\
$|v_3|$: & $c$\\
$|v_2|$: & $bc$\\
$|v_1|$: & $abc$
\end{tabular}
\end{center}

\noindent 
This indicates that $inj$ indeed is injecting, or adding, back
a character into the value.


\subsubsection*{Simplification}

Generally the matching algorithms based on derivatives do
poorly unless the regular expressions are simplified after
each derivatives step. But this is a bit more involved in 
algorithm of Sulzmann \& Lu. Consider the last derivation
step in our running example

\begin{center}
$r_4 = der\,c\,r_3 = (\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$
\end{center}

\noindent Simplifying the result would just give us
$\epsilon$. Running $mkeps$ on this regular expression would
then provide us with $Empty$ instead of $Right(Right(Empty))$
that was obtained without the simplification. The problem is
we need to recreate this more complicated value, rather than 
just $Empty$.

This requires what I call \emph{rectification functions}. They
need to be calculated whenever a regular expression gets
simplified. Rectification functions take a value as argument
and return a (rectified) value. Our simplification rules so
far are

\begin{center}
\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
$r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\ 
$\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\ 
$r \cdot \epsilon$ & $\mapsto$ & $r$\\ 
$\epsilon \cdot r$ & $\mapsto$ & $r$\\ 
$r + \varnothing$ & $\mapsto$ & $r$\\ 
$\varnothing + r$ & $\mapsto$ & $r$\\ 
$r + r$ & $\mapsto$ & $r$\\ 
\end{tabular}
\end{center}

\noindent Applying them to $r_4$ will require several nested
simplifications in order end up with just $\epsilon$. 

We can implement this by letting simp return not just a
(simplified) regular expression, but also a rectification
function. Let us consider the alternative case, say $r_1 +
r_2$, first. We would first simplify the component regular
expressions $r_1$ and $r_2.$ This will return simplified
versions (if they can be simplified), say $r_{1s}$ and
$r_{2s}$, but also two rectification functions $f_{1s}$ and
$f_{2s}$. We need to assemble them in order to obtain a
rectified value for $r_1 + r_2$. In case $r_{1s}$ simplified
to $\varnothing$, we would continue the derivative calculation
with $r_{2s}$. The Sulzmann \& Lu algorithm would return a
corresponding value, say $v_{2s}$. But now this needs to
be ``rectified'' to the value 

\begin{center}
$Right(v_{2s})$
\end{center}

\noindent Unfortunately, this is not enough because there
might be some simplifications that happened inside $r_2$ 
and for which the simplification function retuned also
a rectification function $f_{2s}$. So in fact we need to
apply this one too which gives

\begin{center}
$Right(f_{2s}(v_{2s}))$
\end{center}

\noindent So if we want to return this as function,
we would need to return

\begin{center}
$\lambda v.\,Right(f_{2s}(v))$
\end{center}

\noindent which is the lambda-calculus notation for
a function that expects a value $v$ and returns everything
after the dot where $v$ is replaced by whatever value is 
given.

Let us package these ideas into a single function (still only
considering the alternative case):

\begin{center}
\begin{tabular}{l}
$simp(r)$:\\
\quad case $r = r_1 + r_2$\\
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
\qquad case $r_{1s} = \varnothing$: 
       return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
\qquad case $r_{2s} = \varnothing$: 
       return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
\qquad otherwise: 
       return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
\end{tabular}
\end{center}

\noindent We first recursively call the simlification with $r_1$
and $r_2$. This gives simplified regular expressions, $r_{1s}$
and $r_{2s}$, as well as two rectification functions $f_{1s}$
and $f_{2s}$. We next need to test whether the simplified 
regular expressions are $\varnothing$ so as to make further 
simplifications. In case $r_{1s}$ is $\varnothing$ then 
we can return $r_{2s}$ (the other alternative). However
we need to now build a rectification function, which as
said above is

\begin{center}
$\lambda v.\,Right(f_{2s}(v))$
\end{center}

\noindent The case where $r_{2s} = \varnothing$ is similar.
We return $r_{1s}$ but now have to rectify such that we return

\begin{center}
$\lambda v.\,Left(f_{1s}(v))$
\end{center}

\noindent Note that in this case we have to apply $f_{1s}$,
not $f_{2s}$, which is responsible to rectify the inner parts
of $v$. The otherwise-case is slightly interesting. In this
case neither $r_{1s}$ nor $r_{2s}$ are $\varnothing$ and no
further simplification can be applied. Accordingly, we return
$r_{1s} + r_{2s}$ as the simplified regular expression. In
principle we also do not have to do any rectification, because
no simplification was done in this case. But this is actually
not true: There might have been simplifications inside
$r_{1s}$ and $r_2s$. We therefore need to take into account
the calculated rectification functions $f_{1s}$ and $f_{2s}$.
We can do this by defining a rectification function $f_{alt}$
which takes two rectification functions as arguments

\begin{center}
\begin{tabular}{l@{\hspace{1mm}}l}
$f_{alt}(f_1, f_2) \dn$\\
\qquad $\lambda v.\,$ case $v = Left(v')$: 
      & return $Left(f_1(v'))$\\
\qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: 
      & return $Right(f_2(v'))$\\      
\end{tabular}
\end{center}

\noindent In essence we need to apply in this case 
the appropriate rectification function to the inner part
of the value $v$, wherevy $v$ can only be of the form 
$Right(\_)$ or $Left(\_)$.

\subsubsection*{Records and Tokenisation}

\newpage
Algorithm by Sulzmann, Lexing

\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: