afl-material: handouts/ho04.tex@0afe43616b6a


\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../graphics}


\begin{document}

\section*{Handout 4 (Sulzmann \& Lu Algorithm)}

So far our algorithm based on derivatives was only able to say
yes or no depending on whether a string was matched by regular
expression or not. Often a more interesting question is to
find out \emph{how} a regular expression matched a string?
Answering this question will help us with the problem we are
after, namely tokenising an input string. The algorithm we
will be looking at for this was designed by Sulzmann \& Lu in
a rather recent paper. A link to it is provided on KEATS, in
case you are interested.\footnote{In my humble opinion this is
an interesting instance of the research literature: it
contains a very neat idea, but its presentation is rather
sloppy. In earlier versions of their paper, students and I
found several rather annoying typos in their examples and
definitions.}

In order to give an answer for how a regular expression
matched a string, Sulzmann and Lu introduce \emph{values}. A
value will be the output of the algorithm whenever the regular
expression matches the string. If not, an error will be
raised. Since the first phase of the algorithm by Sulzmann \&
Lu is identical to the derivative based matcher from the first
coursework, the function $nullable$ will be used to decide
whether as string is matched by a regular expression. If
$nullable$ says yes, then values are constructed that reflect
how the regular expression matched the string. The definitions
for regular expressions $r$ and values $v$ is shown next to
each other below:

\begin{center}
\begin{tabular}{cc}
\begin{tabular}{@{}rrl@{}}
\multicolumn{3}{c}{regular expressions}\\
  $r$ & $::=$  & $\varnothing$\\
      & $\mid$ & $\epsilon$   \\
      & $\mid$ & $c$          \\
      & $\mid$ & $r_1 \cdot r_2$\\
      & $\mid$ & $r_1 + r_2$   \\
  \\
      & $\mid$ & $r^*$         \\
\end{tabular}
&
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
\multicolumn{3}{c}{values}\\
   $v$ & $::=$  & \\
      &        & $Empty$   \\
      & $\mid$ & $Char(c)$          \\
      & $\mid$ & $Seq(v_1,v_2)$\\
      & $\mid$ & $Left(v)$   \\
      & $\mid$ & $Right(v)$  \\
      & $\mid$ & $[v_1,\ldots\,v_n]$ \\
\end{tabular}
\end{tabular}
\end{center}

\noindent The point is that there is a very strong
correspondence between them. There is no value for the
$\varnothing$ regular expression, since it does not match any
string. Otherwise there is exactly one value corresponding to
each regular expression with the exception of $r_1 + r_2$
where there are two values, namely $Left(v)$ and $Right(v)$
corresponding to the two alternatives. Note that $r^*$ is
associated with a list of values, one for each copy of $r$
that was needed to match the string. This means we might also
return the empty list $[]$, if no copy was needed.

Graphically the algorithm by Sulzmann \& Lu can be represneted
by the picture in Figure~\ref{Sulz} where the path from the
left to the right involving $der/nullable$ is the first phase
of the algorithm and $mkeps/inj$, the path from right to left,
the second phase. This picture shows the steps required when a
regular expression, say $r_1$, matches the string $abc$. We
first build the three derivatives (according to $a$, $b$ and
$c$). We then use $nullable$ to find out whether the resulting
regular expression can match the empty string. If yes we call
the function $mkeps$.

\begin{figure}[t]
\begin{center}
\begin{tikzpicture}[scale=2,node distance=1.2cm,
                    every node/.style={minimum size=7mm}]
\node (r1)  {$r_1$};
\node (r2) [right=of r1]{$r_2$};
\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$};
\node (r3) [right=of r2]{$r_3$};
\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$};
\node (r4) [right=of r3]{$r_4$};
\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$};
\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}};
\node (v4) [below=of r4]{$v_4$};
\draw[->,line width=1mm](r4) -- (v4);
\node (v3) [left=of v4] {$v_3$};
\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$};
\node (v2) [left=of v3]{$v_2$};
\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$};
\node (v1) [left=of v2] {$v_1$};
\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$};
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}};
\end{tikzpicture}
\end{center}
\caption{The two phases of the algorithm by Sulzmann \& Lu.
\label{Sulz}}
\end{figure}

The $mkeps$ function calculates a value for how a regular
expression has matched the empty string. Its definition
is as follows:

\begin{center}
\begin{tabular}{lcl}
  $mkeps(\epsilon)$       & $\dn$ & $Empty$\\
  $mkeps(r_1 + r_2)$      & $\dn$ & if $nullable(r_1)$  \\
                          &       & then $Left(mkeps(r_1))$\\
                          &       & else $Right(mkeps(r_2))$\\
  $mkeps(r_1 \cdot r_2)$  & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\
  $mkeps(r^*)$            & $\dn$ & $[]$  \\
\end{tabular}
\end{center}

\noindent There are no cases for $\epsilon$ and $c$, since
these regular expression cannot match the empty string. Note
also that in case of alternatives we give preference to the
regular expression on the left-hand side. This will become
important later on.

The second phase of the algorithm is organised recursively
such that it will calculate a value for how the derivative
regular expression has matched a string whose first character
has been chopped off. Now we need a function that reverses
this ``chopping off'' for values. The corresponding function
is called $inj$ for injection. This function takes three
arguments: the first one is a regular expression for which we
want to calculate the value, the second is the character we
want to inject and the third argument is the value where we
will inject the character. The result of this function is a
new value. The definition of $inj$ is as follows: 

\begin{center}
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
  $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
  $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
  $inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$  & $Right(inj\,r_2\,c\,v)$\\
  $inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
  $inj\,(r_1 \cdot r_2)\,c\,Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
  $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\
  $inj\,(r^*)\,c\,Seq(v,vs)$         & $\dn$  & $inj\,r\,c\,v\,::\,vs$\\
\end{tabular}
\end{center}

\noindent This definition is by recursion on the regular
expression and by analysing the shape of the values. Therefore
there are, for example, three cases for sequnece regular
expressions. The last clause for the star regular expression
returns a list where the first element is $inj\,r\,c\,v$ and
the other elements are $vs$. That mean $\_\,::\,\_$ should be 
read as list cons.

To understand what is going on, it might be best to do some
example calculations and compare with Figure~\ref{Sulz}. For
this note that we have not yet dealt with the need of
simplifying regular expressions (this will be a topic on its
own later). Suppose the regular expression is $a \cdot (b
\cdot c)$ and the input string is $abc$. The derivatives from
the first phase are as follows:

\begin{center}
\begin{tabular}{ll}
$r_1$: & $a \cdot (b \cdot c)$\\
$r_2$: & $\epsilon \cdot (b \cdot c)$\\
$r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\
$r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\
\end{tabular}
\end{center}

\noindent According to the simple algorithm, we would test
whether $r_4$ is nullable, which in this case it is. This
means we can use the function $mkeps$ to calculate a value for
how $r_4$ was able to match the empty string. Remember that
this function gives preference for alternatives on the
left-hand side. However there is only $\epsilon$ on the very
right-hand side of $r_4$ that matches the empty string.
Therefore $mkeps$ returns the value

\begin{center}
$v_4:\;Right(Right(Empty))$
\end{center}

\noindent The point is that from this value we can directly
read off which part of $r_4$ matched the empty string. Next we
have to ``inject'' the last character, that is $c$ in the
running example, into this value $v_4$ in order to calculate
how $r_3$ could have matched the string $c$. According to the
definition of $inj$ we obtain

\begin{center}
$v_3:\;Right(Seq(Empty, Char(c)))$
\end{center}

\noindent This is the correct result, because $r_3$ needs
to use the right-hand alternative, and then $\epsilon$ needs
to match the empty string and $c$ needs to match $c$.
Next we need to inject back the letter $b$ into $v_3$. This
gives

\begin{center}
$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
\end{center}

\noindent Finally we need to inject back the letter $a$ into
$v_2$ giving the final result

\begin{center}
$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
\end{center}

\noindent This now corresponds to how the regular
expression $a \cdot (b \cdot c)$ matched the string $abc$.

There are a few auxiliary functions that are of interest
in analysing this algorithm. One is called \emph{flatten},
written $|\_|$, which extracts the string ``underlying'' a 
value. It is defined recursively as

\begin{center}
\begin{tabular}{lcl}
  $|Empty|$     & $\dn$ & $[]$\\
  $|Char(c)|$   & $\dn$ & $[c]$\\
  $|Left(v)|$   & $\dn$ & $|v|$\\
  $|Right(v)|$  & $\dn$ & $|v|$\\
  $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
  $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
\end{tabular}
\end{center}

\noindent Using flatten we can see what is the string behind 
the values calculated by $mkeps$ and $inj$ in our running 
example are:

\begin{center}
\begin{tabular}{ll}
$|v_4|$: & $[]$\\
$|v_3|$: & $c$\\
$|v_2|$: & $bc$\\
$|v_1|$: & $abc$
\end{tabular}
\end{center}

\noindent 
This indicates that $inj$ indeed is injecting, or adding, back
a character into the value.


\subsubsection*{Simplification}

Generally the matching algorithms based on derivatives do
poorly unless the regular expressions are simplified after
each derivatives step. But this is a bit more involved in 
algorithm of Sulzmann \& Lu. Consider the last derivation
step in our running example

\begin{center}
$r_4 = der\,c\,r_3$
\end{center}

\noindent 
Simplifying the result would just give use $\epsilon$.
Running $mkeps$ on this regular expression would provide 
us with $Empty$ instead of $Right(Right(Empty))$ that
was obtained without the simplification. The problem is
we need to recreate this value, rather than just $Empty$.

This requires what I call \emph{rectification functions}. They
need to be calculated whenever a regular expression gets
simplified. Rectification functions take a value as argument
and return a (rectified) value. Our simplification rules so
far are

\begin{center}
\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
$r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\ 
$\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\ 
$r \cdot \epsilon$ & $\mapsto$ & $r$\\ 
$\epsilon \cdot r$ & $\mapsto$ & $r$\\ 
$r + \varnothing$ & $\mapsto$ & $r$\\ 
$\varnothing + r$ & $\mapsto$ & $r$\\ 
$r + r$ & $\mapsto$ & $r$\\ 
\end{tabular}
\end{center}

\noindent Applying them to $r_4$ will require several nested
simplifications in order end up with just $\epsilon$. We can
implement this by letting simp return not just a (simplified)
regular expression, but also a rectification function. Let us
consider the alternative case, say $r_1 + r_2$, first. We
would first simplify the component regular expressions $r_1$
and $r_2.$ This will return simplified versions (if they can
be simplified), say $r_{1s}$ and $r_{2s}$, but also two
rectification functions $f_{1s}$ and $f_{2s}$. We need to
assemble them in order to obtain a rectified value for 
$r_1 + r_2$.

In case $r_{1s}$ simplified to $\varnothing$, we would
continue the derivative calculation with $r_{2s}$. The
Sulzmann \& Lu algorithm would return a corresponding value,
say $v_{2s}$.


\subsubsection*{Records and Tokenisation}

\newpage
Algorithm by Sulzmann, Lexing

\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End:
author	Christian Urban <christian dot urban at kcl dot ac dot uk>
	Fri, 17 Oct 2014 11:56:19 +0100
changeset 284	0afe43616b6a
parent 283	c14e5ebf0c3b
child 285	8a222559278f
permissions	-rw-r--r--