handouts/ho02.tex
author Christian Urban <christian dot urban at kcl dot ac dot uk>
Sun, 16 Nov 2014 18:05:14 +0000
changeset 307 1b86f6522697
parent 296 796b9b81ac8d
child 318 7975e4f0d4de
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../graphics}
\usepackage{../data}

\pgfplotsset{compat=1.11}
\begin{document}

\section*{Handout 2 (Regular Expression Matching)}

This lecture is about implementing a more efficient regular
expression matcher (the plots on the right)---more efficient
than the matchers from regular expression libraries in Ruby and
Python (the plots on the left). These plots show the running 
time for the evil regular expression $a?^{\{n\}}a^{\{n\}}$.
Note the different scales of the $x$-axes. 


\begin{center}
\begin{tabular}{@{}cc@{}}
\begin{tikzpicture}
\begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=33,
    ymax=35,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=5cm,
    height=5cm, 
    legend entries={Python,Ruby},  
    legend pos=north west,
    legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] 
  table {re-python.data};
\addplot[brown,mark=triangle*, mark options={fill=white}] 
  table {re-ruby.data};  
\end{axis}
\end{tikzpicture}
&
\begin{tikzpicture}
  \begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},
    enlargelimits=false,
    xtick={0,3000,...,12000},
    xmax=12500,
    ymax=35,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=6.5cm,
    height=5cm]
\addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
\end{axis}
\end{tikzpicture}
\end{tabular}
\end{center}\medskip


\noindent Having specified in the previous lecture what
problem our regular expression matcher, which we will call
\pcode{matches}, is supposed to solve, namely for any given
regular expression $r$ and string $s$ answer \textit{true} if
and only if

\[
s \in L(r)
\]

\noindent we can look at an algorithm to solve this problem.
Clearly we cannot use the function $L$ directly for this,
because in general the set of strings $L$ returns is infinite
(recall what $L(a^*)$ is). In such cases there is no way we
can implement an exhaustive test for whether a string is
member of this set or not. In contrast our matching algorithm
will mainly operate on the regular expression $r$ and string
$s$, which are both finite. Before we come to the matching
algorithm, however, let us have a closer look at what it means
when two regular expressions are equivalent.

\subsection*{Regular Expression Equivalences}

We already defined in Handout 1 what it means for two regular
expressions to be equivalent, namely if their meaning is the
same language:

\[
r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
\]

\noindent
It is relatively easy to verify that some concrete equivalences
hold, for example

\begin{center}
\begin{tabular}{rcl}
$(a + b)  + c$ & $\equiv$ & $a + (b + c)$\\
$a + a$        & $\equiv$ & $a$\\
$a + b$        & $\equiv$ & $b + a$\\
$(a \cdot b) \cdot c$ & $\equiv$ & $a \cdot (b \cdot c)$\\
$c \cdot (a + b)$ & $\equiv$ & $(c \cdot a) + (c \cdot b)$\\
\end{tabular}
\end{center}

\noindent
but also easy to verify that the following regular expressions
are \emph{not} equivalent

\begin{center}
\begin{tabular}{rcl}
$a \cdot a$ & $\not\equiv$ & $a$\\
$a + (b \cdot c)$ & $\not\equiv$ & $(a + b) \cdot (a + c)$\\
\end{tabular}
\end{center}

\noindent I leave it to you to verify these equivalences and
non-equivalences. It is also interesting to look at some
corner cases involving $\epsilon$ and $\varnothing$:

\begin{center}
\begin{tabular}{rcl}
$a \cdot \varnothing$ & $\not\equiv$ & $a$\\
$a + \epsilon$ & $\not\equiv$ & $a$\\
$\epsilon$ & $\equiv$ & $\varnothing^*$\\
$\epsilon^*$ & $\equiv$ & $\epsilon$\\
$\varnothing^*$ & $\not\equiv$ & $\varnothing$\\
\end{tabular}
\end{center}

\noindent Again I leave it to you to make sure you agree
with these equivalences and non-equivalences. 


For our matching algorithm however the following six
equivalences will play an important role:

\begin{center}
\begin{tabular}{rcl}
$r + \varnothing$  & $\equiv$ & $r$\\
$\varnothing + r$  & $\equiv$ & $r$\\
$r \cdot \epsilon$ & $\equiv$ & $r$\\
$\epsilon \cdot r$     & $\equiv$ & $r$\\
$r \cdot \varnothing$ & $\equiv$ & $\varnothing$\\
$\varnothing \cdot r$ & $\equiv$ & $\varnothing$\\
$r + r$ & $\equiv$ & $r$
\end{tabular}
\end{center}

\noindent which always hold no matter what the regular
expression $r$ looks like. The first two are easy to verify since
$L(\varnothing)$ is the empty set. The next two are also easy 
to verify since $L(\epsilon) = \{[]\}$ and appending the empty 
string to every string of another set, leaves the set 
unchanged. Be careful to fully comprehend the fifth and 
sixth equivalence: if you concatenate two sets of strings
and one is the empty set, then the concatenation will also be
the empty set. Check the definition of \pcode{_ @ _}.
The last equivalence is again trivial.

What will be important later on is that we can orient these
equivalences and read them from left to right. In this way we
can view them as \emph{simplification rules}. Suppose for 
example the regular expression 

\begin{equation}
(r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot (r_4 \cdot \varnothing)
\label{big}
\end{equation}

\noindent If we can find an equivalent regular expression that
is simpler (smaller for example), then this might potentially 
make our matching algorithm run faster. The reason is that 
whether a string $s$ is in $L(r)$ or in $L(r')$ with $r\equiv r'$
will always give the same answer. In the example above you 
will see that the regular expression is equivalent to $r_1$
if you iteratively apply the simplification rules from above:

\begin{center}
\begin{tabular}{ll}
 & $(r_1 + \varnothing) \cdot \epsilon + ((\epsilon + r_2) + r_3) \cdot 
(\underline{r_4 \cdot \varnothing})$\smallskip\\
$\equiv$ & $(r_1 + \varnothing) \cdot \epsilon + \underline{((\epsilon + r_2) + r_3) \cdot 
\varnothing}$\smallskip\\
$\equiv$ & $\underline{(r_1 + \varnothing) \cdot \epsilon} + \varnothing$\smallskip\\
$\equiv$ & $(\underline{r_1 + \varnothing}) + \varnothing$\smallskip\\
$\equiv$ & $\underline{r_1 + \varnothing}$\smallskip\\
$\equiv$ & $r_1$\
\end{tabular}
\end{center}

\noindent In each step, I underlined where a simplification
rule is applied. Our matching algorithm in the next section
will often generate such ``useless'' $\epsilon$s and
$\varnothing$s, therefore simplifying them away will make the
algorithm quite a bit faster.

\subsection*{The Matching Algorithm}

The algorithm we will define below consists of two parts. One
is the function $nullable$ which takes a regular expression as
argument and decides whether it can match the empty string
(this means it returns a boolean in Scala). This can be easily
defined recursively as follows:

\begin{center}
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
$nullable(\varnothing)$      & $\dn$ & $\textit{false}$\\
$nullable(\epsilon)$         & $\dn$ & $true$\\
$nullable(c)$                & $\dn$ & $\textit{false}$\\
$nullable(r_1 + r_2)$     & $\dn$ &  $nullable(r_1) \vee nullable(r_2)$\\ 
$nullable(r_1 \cdot r_2)$ & $\dn$ &  $nullable(r_1) \wedge nullable(r_2)$\\
$nullable(r^*)$              & $\dn$ & $true$ \\
\end{tabular}
\end{center}

\noindent The idea behind this function is that the following
property holds:

\[
nullable(r) \;\;\text{if and only if}\;\; []\in L(r)
\]

\noindent Note on the left-hand side we have a function we can
implement; on the right we have its specification (which we
cannot implement in a programming language).

The other function of our matching algorithm calculates a
\emph{derivative} of a regular expression. This is a function
which will take a regular expression, say $r$, and a
character, say $c$, as argument and return a new regular
expression. Be careful that the intuition behind this function
is not so easy to grasp on first reading. Essentially this
function solves the following problem: if $r$ can match a
string of the form $c\!::\!s$, what does the regular
expression look like that can match just $s$. The definition
of this function is as follows:

\begin{center}
\begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
  $der\, c\, (\varnothing)$      & $\dn$ & $\varnothing$\\
  $der\, c\, (\epsilon)$         & $\dn$ & $\varnothing$ \\
  $der\, c\, (d)$                & $\dn$ & if $c = d$ then $\epsilon$ else $\varnothing$\\
  $der\, c\, (r_1 + r_2)$        & $\dn$ & $der\, c\, r_1 + der\, c\, r_2$\\
  $der\, c\, (r_1 \cdot r_2)$  & $\dn$  & if $nullable (r_1)$\\
  & & then $(der\,c\,r_1) \cdot r_2 + der\, c\, r_2$\\ 
  & & else $(der\, c\, r_1) \cdot r_2$\\
  $der\, c\, (r^*)$          & $\dn$ & $(der\,c\,r) \cdot (r^*)$
  \end{tabular}
\end{center}

\noindent The first two clauses can be rationalised as
follows: recall that $der$ should calculate a regular
expression, if the ``input'' regular expression can match a
string of the form $c\!::\!s$. Since neither $\varnothing$ nor
$\epsilon$ can match such a string we return $\varnothing$. In
the third case we have to make a case-distinction: In case the
regular expression is $c$, then clearly it can recognise a
string of the form $c\!::\!s$, just that $s$ is the empty
string. Therefore we return the $\epsilon$-regular expression.
In the other case we again return $\varnothing$ since no
string of the $c\!::\!s$ can be matched. Next come the
recursive cases. Fortunately, the $+$-case is still relatively
straightforward: all strings of the form $c\!::\!s$ are either
matched by the regular expression $r_1$ or $r_2$. So we just
have to recursively call $der$ with these two regular
expressions and compose the results again with $+$. Yes, makes
sense? The $\cdot$-case is more complicated: if $r_1\cdot r_2$
matches a string of the form $c\!::\!s$, then the first part
must be matched by $r_1$. Consequently, it makes sense to
construct the regular expression for $s$ by calling $der$ with
$r_1$ and ``appending'' $r_2$. There is however one exception
to this simple rule: if $r_1$ can match the empty string, then
all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is
nullable (that is can match the empty string) we have to allow
the choice $der\,c\,r_2$ for calculating the regular
expression that can match $s$. Therefore we have to 
add the regular expression $der\,c\,r_2$.
The $*$-case is again simple:
if $r^*$ matches a string of the form $c\!::\!s$, then the
first part must be ``matched'' by a single copy of $r$.
Therefore we call recursively $der\,c\,r$ and ``append'' $r^*$
in order to match the rest of $s$.

If this did not make sense, here is another way to rationalise
the definition of $der$ by considering the following operation
on sets:

\[
Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
\]

\noindent This operation essentially transforms a set of
strings $A$ by filtering out all strings that do not start
with $c$ and then strips off the $c$ from all the remaining
strings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then
\[ Der\,f\,A = \{oo, rak\}\quad,\quad Der\,b\,A = \{ar\} \quad
\text{and} \quad Der\,a\,A = \varnothing \]
 
\noindent
Note that in the last case $Der$ is empty, because no string in $A$
starts with $a$. With this operation we can state the following
property about $der$:

\[
L(der\,c\,r) = Der\,c\,(L(r))
\]

\noindent
This property clarifies what regular expression $der$ calculates,
namely take the set of strings that $r$ can match (that is $L(r)$),
filter out all strings not starting with $c$ and strip off the $c$
from the remaining strings---this is exactly the language that
$der\,c\,r$ can match.

If we want to find out whether the string $abc$ is matched by
the regular expression $r_1$ then we can iteratively apply $der$
as follows

\begin{center}
\begin{tabular}{rll}
Input: $r_1$, $abc$\medskip\\
Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = der\,a\,r_1)$\smallskip\\
Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = der\,b\,r_2)$\smallskip\\
Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = der\,b\,r_3)$\smallskip\\
Step 4: & the string is exhausted; test & ($nullable(r_4)$)\\
        & whether $r_4$ can recognise the\\
        & empty string\smallskip\\
Output: & result of the test $\Rightarrow true \,\text{or}\, \textit{false}$\\        
\end{tabular}
\end{center}

\noindent Again the operation $Der$ might help to rationalise
this algorithm. We want to know whether $abc \in L(r_1)$. We
do not know yet. But lets assume it is. Then $Der\,a\,L(r_1)$
builds the set where all the strings not starting with $a$ are
filtered out. Of the remaining strings, the $a$ is stripped
off. Then we continue with filtering out all strings not
starting with $b$ and stripping off the $b$ from the remaining
strings, that means we build $Der\,b\,(Der\,a\,(L(r_1)))$.
Finally we filter out all strings not starting with $c$ and
strip off $c$ from the remaining string. This is
$Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$. Now if $abc$ was in the 
original set ($L(r_1)$), then in $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$ 
must be the empty string. If not then $abc$ was not in the 
language we started with.

Our matching algorithm using $der$ and $nullable$ works
similarly, just using regular expression instead of sets. For
this we need to extend the notion of derivatives from
characters to strings. This can be done using the following
function, taking a string and regular expression as input and
a regular expression as output.

\begin{center}
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}}
  $\textit{ders}\, []\, r$     & $\dn$ & $r$ & \\
  $\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(der\,c\,r)$ & \\
  \end{tabular}
\end{center}

\noindent This function essentially iterates $der$ taking one
character at the time from the original string until it is
exhausted. Having $ders$ in place, we can finally define our
matching algorithm:

\[
matches\,s\,r = nullable(ders\,s\,r)
\]

\noindent
We can claim that

\[
matches\,s\,r\quad\text{if and only if}\quad s\in L(r)
\]

\noindent holds, which means our algorithm satisfies the
specification. Of course we can claim many things\ldots
whether the claim holds any water is a different question,
which for example is the point of the Strand-2 Coursework.

This algorithm was introduced by Janus Brzozowski in 1964. Its
main attractions are simplicity and being fast, as well as
being easily extendable for other regular expressions such as
$r^{\{n\}}$, $r^?$, $\sim{}r$ and so on (this is subject of
Strand-1 Coursework 1).

\subsection*{The Matching Algorithm in Scala}

Another attraction of the algorithm is that it can be easily
implemented in a functional programming language, like Scala.
Given the implementation of regular expressions in Scala shown
in the first lecture and handout, the functions and subfunctions
for \pcode{matches} are shown in Figure~\ref{scala1}.

\begin{figure}[p]
\lstinputlisting{../progs/app5.scala}
\caption{Scala implementation of the nullable and 
  derivatives functions.\label{scala1}}
\end{figure}

For running the algorithm with our favourite example, the evil
regular expression $a?^{\{n\}}a^{\{n\}}$, we need to implement
the optional regular expression and the exactly $n$-times
regular expression. This can be done with the translations

\lstinputlisting[numbers=none]{../progs/app51.scala}

\noindent Running the matcher with the example, we find it is
slightly worse then the matcher in Ruby and Python.
Ooops\ldots

\begin{center}
\begin{tikzpicture}
\begin{axis}[
    xlabel={\pcode{a}s},
    ylabel={time in secs},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=30,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=6cm,
    height=5cm, 
    legend entries={Python,Ruby,Scala V1},  
    legend pos=outer north east,
    legend cell align=left  
]
\addplot[blue,mark=*, mark options={fill=white}] 
  table {re-python.data};
\addplot[brown,mark=pentagon*, mark options={fill=white}] 
  table {re-ruby.data};  
\addplot[red,mark=triangle*,mark options={fill=white}] 
  table {re1.data};  
\end{axis}
\end{tikzpicture}
\end{center}

\noindent Analysing this failure a bit we notice that 
for $a^{\{n\}}$ we generate quite big regular expressions:

\begin{center}
\begin{tabular}{rl}
1: & $a$\\
2: & $a\cdot a$\\
3: & $a\cdot a\cdot a$\\
& \ldots\\
13: & $a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a$\\
& \ldots
\end{tabular}
\end{center}

\noindent Our algorithm traverses such regular expressions at
least once every time a derivative is calculated. So having
large regular expressions will cause problems. This problem
is aggravated by $a?$ being represented as $a + \epsilon$.

We can fix this by having an explicit constructor for 
$r^{\{n\}}$. In Scala we would introduce a constructor like

\begin{center}
\code{case class NTIMES(r: Rexp, n: Int) extends Rexp}
\end{center}

\noindent With this we have a constant ``size'' regular
expression for our running example no matter how large $n$ 
is. This means we have to also add cases for $nullable$ and
$der$. Does the change have any effect?

\begin{center}
\begin{tikzpicture}
\begin{axis}[
    xlabel={\pcode{a}s},
    ylabel={time in secs},
    enlargelimits=false,
    xtick={0,100,...,1000},
    xmax=1000,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=9.5cm,
    height=5cm, 
    legend entries={Python,Ruby,Scala V1,Scala V2},  
    legend pos=outer north east,
    legend cell align=left  
]
\addplot[blue,mark=*, mark options={fill=white}] 
  table {re-python.data};
\addplot[brown,mark=pentagon*, mark options={fill=white}] 
  table {re-ruby.data};  
\addplot[red,mark=triangle*,mark options={fill=white}] 
  table {re1.data};  
\addplot[green,mark=square*,mark options={fill=white}] 
  table {re2b.data};
\end{axis}
\end{tikzpicture}
\end{center}

\noindent Now we are talking business! The modified matcher 
can within 30 seconds handle regular expressions up to
$n = 950$ before a StackOverflow is raised.

The moral is that our algorithm is rather sensitive to the
size of regular expressions it needs to handle. This is of
course obvious because both $nullable$ and $der$ need to
traverse the whole regular expression. There seems to be one
more source of making the algorithm run faster. The derivative
function often produces ``useless'' $\varnothing$s and
$\epsilon$s. To see this, consider $r = ((a \cdot b) + b)^*$
and the following two derivatives

\begin{center}
\begin{tabular}{l}
$der\,a\,r = ((\epsilon \cdot b) + \varnothing) \cdot r$\\
$der\,b\,r = ((\varnothing \cdot b) + \epsilon)\cdot r$\\
$der\,c\,r = ((\varnothing \cdot b) + \varnothing)\cdot r$
\end{tabular}
\end{center}

\noindent 
If we simplify them according to the simple rules from the
beginning, we can replace the right-hand sides by the 
smaller equivalent regular expressions

\begin{center}
\begin{tabular}{l}
$der\,a\,r \equiv b \cdot r$\\
$der\,b\,r \equiv r$\\
$der\,c\,r \equiv \varnothing$
\end{tabular}
\end{center}

\noindent I leave it to you to contemplate whether such a
simplification can have any impact on the correctness of our
algorithm (will it change any answers?). Figure~\ref{scala2}
gives a simplification function that recursively traverses a
regular expression and simplifies it according to the rules
given at the beginning. There are only rules for $+$, $\cdot$
and $n$-times (the latter because we added it in the second
version of our matcher). There is no rule for a star, because
empirical data and also a little thought showed that
simplifying under a star is waste of computation time. The
simplification function will be called after every derivation.
This additional step removes all the ``junk'' the derivative
function introduced. Does this improve the speed? You bet!!

\begin{figure}[p]
\lstinputlisting{../progs/app6.scala}
\caption{The simplification function and modified 
\texttt{ders}-function.\label{scala2}}
\end{figure}

\begin{center}
\begin{tikzpicture}
\begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},
    enlargelimits=false,
    xtick={0,2000,...,12000},
    xmax=12000,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=9cm,
    height=4cm, 
    legend entries={Scala V2,Scala V3}]
\addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
\end{axis}
\end{tikzpicture}
\end{center}

\end{document}




%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: