handouts/ho02.tex
author Christian Urban <christian.urban@kcl.ac.uk>
Fri, 24 Oct 2025 10:52:05 +0100
changeset 1018 ab6c61f82c91
parent 1004 99e89ad35d76
permissions -rw-r--r--
updated

% !TEX program = xelatex
\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../graphics}
\usepackage{../data}


\begin{document}
\fnote{\copyright{} Christian Urban, King's College London, 
  2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024}


\section*{Handout 2 (Regular Expression Matching)}

%\noindent
%{\bf Checklist}
%
%\begin{itemize}
%  \item You have understood the derivative-based matching algorithm.
%  \item You know how the derivative is related to the meaning of regular
%  expressions.
%  \item You can extend the algorithm to non-basic regular expressions.
%\end{itemize}\bigskip\bigskip\bigskip

\noindent
This lecture is about implementing a more efficient regular expression
matcher (the plots on the right below)---more efficient than the
matchers from regular expression libraries in Ruby, Python, JavaScript, Swift
and Java (the plots on the left).\footnote{Have a look at KEATS: students
last year contributed also data for the Dart language.}\medskip

\noindent
To start with let us look more closely at the experimental data: The
first pair of plots shows the running time for the regular expression
$(a^*)^*\cdot b$ and strings composed of $n$ \pcode{a}s, like
\[
\underbrace{\pcode{a}\ldots\pcode{a}}_{n} 
\]

\noindent
This means the regular expression actually does not match the strings.
The second pair of plots shows the running time for the regular
expressions of the form $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and corresponding
strings composed of $n$ \pcode{a}s (this time the regular expressions
match the strings).  To see the substantial differences in the left and
right plots below, note the different scales of the $x$-axis.

  
\begin{center}
Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
\begin{tabular}{@{}cc@{}}
\begin{tikzpicture}[baseline=(current bounding box.north)]
  \begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,0.0)}},
    ylabel={time in secs},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=33,
    ymax=35,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=5cm,
    height=4.5cm, 
    legend entries={Java 8, Python, JavaScript, Swift},  
    legend pos=north west,
    legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
\addplot[magenta,mark=*, mark options={fill=white}] table {re-swift.data};
\end{axis}
\end{tikzpicture}
&
\begin{tikzpicture}[baseline=(current bounding box.north)]
  \begin{axis}[
    xlabel={$n$},
    x label style={at={(1.1,0.0)}},
    %%xtick={0,1000000,...,5000000}, 
    ylabel={time in secs},
    enlargelimits=false,
    ymax=35,
    ytick={0,5,...,30},
    axis lines=left,
    %scaled ticks=false,
    width=6.5cm,
    height=4.5cm,
    legend entries={Our matcher},  
    legend pos=north east,
    legend cell align=left]
%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};    
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
\end{axis}
\end{tikzpicture}
\end{tabular}
\end{center}\bigskip

\begin{center}
Graphs: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\
\begin{tabular}{@{}cc@{}}
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,0.0)}},
    ylabel={\small time in secs},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=33,
    ymax=35,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=5cm,
    height=4.55cm, 
    legend entries={Python,Ruby},  
    legend pos=north west,
    legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};  
\end{axis}
\end{tikzpicture}
&
\begin{tikzpicture}
  \begin{axis}[
    xlabel={$n$},
    x label style={at={(1.1,0.05)}},
    ylabel={\small time in secs},
    enlargelimits=false,
    xtick={0,2500,...,11000},
    xmax=12000,
    ymax=35,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=6.5cm,
    height=4.5cm,
    legend entries={Our matcher},  
    legend pos=north east,
    legend cell align=left]
%\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
\end{axis}
\end{tikzpicture}
\end{tabular}
\end{center}
\bigskip

\noindent
In what follows we will use these regular expressions and strings as
running examples. There will be several versions (V1, V2, V3,\ldots)
of our matcher.\footnote{The corresponding files are
  \texttt{re1.sc}, \texttt{re2.sc} and so on. As usual, you can
  find the code on KEATS.}

Having specified in the previous lecture what
problem our regular expression matcher is supposed to solve,
namely for any given regular expression $r$ and string $s$
answer \textit{true} if and only if

\[
s \in L(r)
\]

\noindent we can look for an algorithm to solve this problem. Clearly
we cannot use the function $L$ directly for this, because in general
the set of strings $L$ returns is infinite (recall what $L(a^*)$ is).
In such cases there is no way we can implement an exhaustive test for
whether a string is member of this set or not. In contrast our
matching algorithm will operate on the regular expression $r$ and
string $s$, only, which are both finite objects. Before we explain 
the matching algorithm, let us have a closer look at what it
means when two regular expressions are equivalent.

\subsection*{Regular Expression Equivalences}

We already defined in Handout 1 what it means for two regular
expressions to be equivalent, namely whether their 
\emph{meaning} is the same language:

\[
r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
\]

\noindent
It is relatively easy to verify that some concrete equivalences
hold, for example

\begin{center}
\begin{tabular}{rcl}
$(a + b)  + c$ & $\equiv$ & $a + (b + c)$\\
$a + a$        & $\equiv$ & $a$\\
$a + b$        & $\equiv$ & $b + a$\\
$(a \cdot b) \cdot c$ & $\equiv$ & $a \cdot (b \cdot c)$\\
$c \cdot (a + b)$ & $\equiv$ & $(c \cdot a) + (c \cdot b)$\\
\end{tabular}
\end{center}

\noindent
but also easy to verify that the following regular expressions
are \emph{not} equivalent

\begin{center}
\begin{tabular}{rcl}
$a \cdot a$ & $\not\equiv$ & $a$\\
$a + (b \cdot c)$ & $\not\equiv$ & $(a + b) \cdot (a + c)$\\
\end{tabular}
\end{center}

\noindent I leave it to you to verify these equivalences and
non-equivalences. It is also interesting to look at some
corner cases involving $\ONE$ and $\ZERO$:

\begin{center}
\begin{tabular}{rcl}
$a \cdot \ZERO$ & $\not\equiv$ & $a$\\
$a + \ONE$ & $\not\equiv$ & $a$\\
$\ONE$ & $\equiv$ & $\ZERO^*$\\
$\ONE^*$ & $\equiv$ & $\ONE$\\
$\ZERO^*$ & $\not\equiv$ & $\ZERO$\\
\end{tabular}
\end{center}

\noindent Again I leave it to you to make sure you agree
with these equivalences and non-equivalences. 


For our matching algorithm however the following seven
equivalences will play an important role:

\begin{center}
\begin{tabular}{rcl}
$r + \ZERO$  & $\equiv$ & $r$\\
$\ZERO + r$  & $\equiv$ & $r$\\
$r \cdot \ONE$ & $\equiv$ & $r$\\
$\ONE \cdot r$     & $\equiv$ & $r$\\
$r \cdot \ZERO$ & $\equiv$ & $\ZERO$\\
$\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\
$r + r$ & $\equiv$ & $r$
\end{tabular}
\end{center}

\noindent They always hold no matter what the regular expression $r$
looks like. The first two are easy to verify since $L(\ZERO)$ is the
empty set. The next two are also easy to verify since $L(\ONE) =
\{[]\}$ and appending the empty string to every string of another set,
leaves the set unchanged. Be careful to fully comprehend the fifth and
sixth equivalence: if you concatenate two sets of strings and one is
the empty set, then the concatenation will also be the empty set. To
see this, check the definition of $\_ @ \_$ for sets. The last
equivalence is again trivial.

What will be critical later on is that we can orient these
equivalences and read them from left to right. In this way we
can view them as \emph{simplification rules}. Consider for 
example the regular expression 

\begin{equation}
(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)
\label{bbbig}
\end{equation}

\noindent If we can find an equivalent regular expression that is
simpler (that usually means smaller), then this might potentially make
our matching algorithm run faster. We can look for such a simpler, but
equivalent, regular expression $r'$ because whether a string $s$ is in
$L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes? \footnote{You have checked this for yourself? Your friendly lecturer might talk rubbish\ldots{}one never knows.}

In the example above you will see that the regular expression in
\eqref{bbbig} is equivalent to just $r_1$. You can verify this by
iteratively applying the simplification rules from above:

\begin{center}
\begin{tabular}{ll}
 & $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot 
(\underline{r_4 \cdot \ZERO})$\smallskip\\
$\equiv$ & $(r_1 + \ZERO) \cdot \ONE + \underline{((\ONE + r_2) + r_3) \cdot 
\ZERO}$\smallskip\\
$\equiv$ & $\underline{(r_1 + \ZERO) \cdot \ONE} + \ZERO$\smallskip\\
$\equiv$ & $(\underline{r_1 + \ZERO}) + \ZERO$\smallskip\\
$\equiv$ & $\underline{r_1 + \ZERO}$\smallskip\\
$\equiv$ & $r_1$\
\end{tabular}
\end{center}

\noindent In each step, I underlined where a simplification
rule is applied. Our matching algorithm in the next section
will often generate such ``useless'' $\ONE$s and
$\ZERO$s, therefore simplifying them away will make the
algorithm quite a bit faster.

Finally here are three equivalences between regular expressions which are
not so obvious:

\begin{center}
\begin{tabular}{rcl}
$r^*$  & $\equiv$ & $\ONE + r\cdot r^*$\\
$(r_1 + r_2)^*$  & $\equiv$ & $r_1^* \cdot (r_2\cdot r_1^*)^*$\\
$(r_1 \cdot r_2)^*$ & $\equiv$ & $\ONE + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\
\end{tabular}
\end{center}

\noindent
We will not use them in our algorithm, but feel free to convince
yourself that they actually hold. As an aside, there has been a lot of
research on questions like: Can one always decide whether two regular
expressions are equivalent or not? What does an algorithm look like to
decide this efficiently? Surprisingly, many of such questions
turn out to be non-trivial problems.


\subsection*{The Matching Algorithm}

The regular expression matching algorithm we will introduce below
consists of two parts: One is the function $\textit{nullable}$ which
takes a regular expression as an argument and decides whether it can
match the empty string (this means it returns a boolean in
Scala). This can be easily defined recursively as follows:

\begin{center}
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
$\textit{nullable}(\ZERO)$      & $\dn$ & $\textit{false}$\\
$\textit{nullable}(\ONE)$         & $\dn$ & $\textit{true}$\\
$\textit{nullable}(c)$                & $\dn$ & $\textit{false}$\\
$\textit{nullable}(r_1 + r_2)$     & $\dn$ &  $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\ 
$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ &  $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\
$\textit{nullable}(r^*)$              & $\dn$ & $\textit{true}$ \\
\end{tabular}
\end{center}

\noindent The idea behind this function is that the following
property holds:

\[
\textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)
\]

\noindent Note on the left-hand side of the if-and-only-if we have a
function we can implement, for example in Scala; on the right we have
its specification (which we cannot implement in a programming language).

The other function of our matching algorithm calculates a
\emph{derivative} of a regular expression. This is a function
which will take a regular expression, say $r$, and a
character, say $c$, as arguments and returns a new regular
expression. Be mindful that the intuition behind this function
is not so easy to grasp on first reading. Essentially this
function solves the following problem: if $r$ can match a
string of the form $c\!::\!s$, what does a regular
expression look like that can match just $s$? The definition
of this function is as follows:

\begin{center}
\begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
  $\textit{der}\, c\, (\ZERO)$      & $\dn$ & $\ZERO$\\
  $\textit{der}\, c\, (\ONE)$         & $\dn$ & $\ZERO$ \\
  $\textit{der}\, c\, (d)$                & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\
  $\textit{der}\, c\, (r_1 + r_2)$        & $\dn$ & $\textit{der}\, c\, r_1 + \textit{der}\, c\, r_2$\\
  $\textit{der}\, c\, (r_1 \cdot r_2)$  & $\dn$  & if $\textit{nullable} (r_1)$\\
  & & then $(\textit{der}\,c\,r_1) \cdot r_2 + \textit{der}\, c\, r_2$\\ 
  & & else $(\textit{der}\, c\, r_1) \cdot r_2$\\
  $\textit{der}\, c\, (r^*)$          & $\dn$ & $(\textit{der}\,c\,r) \cdot (r^*)$
  \end{tabular}
\end{center}

\noindent The first two clauses can be rationalised as follows: recall
that $\textit{der}$ should calculate a regular expression so that
provided the ``input'' regular expression can match a string of the
form $c\!::\!s$, we want a regular expression for $s$. Since neither
$\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we
return $\ZERO$. In the third case we have to make a case-distinction:
In case the regular expression is $c$, then clearly it can recognise a
string of the form $c\!::\!s$, just that $s$ is the empty
string. Therefore we return the $\ONE$-regular expression in this
case, as it can match the empty string.  In the other case we again
return $\ZERO$ since no string of the $c\!::\!s$ can be matched. Next
come the recursive cases, which are a bit more involved. Fortunately,
the $+$-case is still relatively straightforward: all strings of the
form $c\!::\!s$ are either matched by the regular expression $r_1$ or
$r_2$. So we just have to recursively call $\textit{der}$ with these
two regular expressions and compose the results again with $+$. Makes
sense?


The $\cdot$-case is more complicated: if $r_1\cdot r_2$
matches a string of the form $c\!::\!s$, then the first part
must be matched by $r_1$. Consequently, it makes sense to
construct the regular expression for $s$ by calling $\textit{der}$ with
$r_1$ and ``appending'' $r_2$. There is however one exception
to this simple rule: if $r_1$ can match the empty string, then
all of $c\!::\!s$ is matched by $r_2$. Therefore in case $r_1$ is
nullable (that is can match the empty string) we have to allow
the choice $\textit{der}\,c\,r_2$ for calculating the regular
expression that can match $s$. This means we have to add the
regular expression $\textit{der}\,c\,r_2$ in the result. The $*$-case
is again simple: if $r^*$ matches a string of the form
$c\!::\!s$, then the first part must be ``matched'' by a
single copy of $r$. Therefore we call recursively $\textit{der}\,c\,r$
and ``append'' $r^*$ in order to match the rest of $s$. Still
makes sense?

If all this did not make sense yet, here is another way to explain the
definition of $\textit{der}$ by considering the following operation on
sets:

\begin{equation}\label{Der}
\textit{Der}\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
\end{equation}

\noindent This operation essentially transforms a set of
strings $A$ by filtering out all strings that do not start
with $c$ and then strips off the $c$ from all the remaining
strings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then

\[ \textit{Der}\,f\,A = \{oo, rak\}\quad,\quad 
   \textit{Der}\,b\,A = \{ar\} \quad \text{and} \quad 
   \textit{Der}\,a\,A = \{\} 
\]
 
\noindent
Note that in the last case $\textit{Der}$ is empty, because no string in $A$
starts with $a$. With this operation we can state the following
property about $\textit{der}$:

\[
L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r))
\]

\noindent
This property clarifies what regular expression $\textit{der}$ calculates,
namely take the set of strings that $r$ can match (that is $L(r)$),
filter out all strings not starting with $c$ and strip off the $c$
from the remaining strings---this is exactly the language that
$\textit{der}\,c\,r$ can match.

If we want to find out whether the string $abc$ is matched by
the regular expression $r_1$ then we can iteratively apply $\textit{der}$
as follows

\begin{center}
\begin{tabular}{rll}
Input: $r_1$, $abc$\medskip\\
Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = \textit{der}\,a\,r_1)$\smallskip\\
Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = \textit{der}\,b\,r_2)$\smallskip\\
Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = \textit{der}\,c\,r_3)$\smallskip\\
Step 4: & the string is exhausted: & $(\textit{nullable}(r_4))$\\
        & test whether $r_4$ can recognise the\\
        & empty string\smallskip\\
Output: & result of this test $\Rightarrow \textit{true} \,\text{or}\, \textit{false}$\\        
\end{tabular}
\end{center}

\noindent Again the operation $\textit{Der}$ might help to rationalise
this algorithm. We want to know whether $abc \in L(r_1)$. We
do not know yet---but let us assume it is. Then $\textit{Der}\,a\,L(r_1)$
builds the set where all the strings not starting with $a$ are
filtered out. Of the remaining strings, the $a$ is stripped
off. So we should still have $bc$ in the set.
Then we continue with filtering out all strings not
starting with $b$ and stripping off the $b$ from the remaining
strings, that means we build $\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1)))$.
Finally we filter out all strings not starting with $c$ and
strip off $c$ from the remaining string. This is
$\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$. Now if $abc$ was in the 
original set ($L(r_1)$), then $\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$ 
must contain the empty string. If not, then $abc$ was not in the 
language we started with.

Our matching algorithm using $\textit{der}$ and $\textit{nullable}$ works
similarly, just using regular expressions instead of sets. In order to
define our algorithm we need to extend the notion of derivatives from single
characters to strings. This can be done using the following
function, taking a string and a regular expression as input and
a regular expression as output.

\begin{center}
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}}
  $\textit{ders}\, []\, r$     & $\dn$ & $r$ & \\
  $\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(\textit{der}\,c\,r)$ & \\
  \end{tabular}
\end{center}

\noindent This function iterates $\textit{der}$ taking one character at
the time from the original string until the string is exhausted.
Having $\textit{der}s$ in place, we can finally define our matching
algorithm:

\[
\textit{matcher}\,r\,s \dn \textit{nullable}(\textit{ders}\,s\,r)
\]

\noindent
and we can claim that

\[
\textit{matcher}\,r\,s\quad\text{if and only if}\quad s\in L(r)
\]

\noindent holds, which means our algorithm satisfies the
specification. Of course we can claim many things\ldots
whether the claim holds any water is a different question.

This algorithm was introduced by Janusz Brzozowski in 1964, but 
is more widely known only in the last 10 or so years. Its
main attractions are simplicity and being fast, as well as
being easily extendible for other regular expressions such as
$r^{\{n\}}$, $r^?$, $\sim{}r$ and so on (this is subject of
Coursework 1). 

\subsection*{The Matching Algorithm in Scala}

Another attraction of the algorithm is that it can be easily
implemented in a functional programming language, like Scala.
Given the implementation of regular expressions in Scala shown
in the first lecture and handout, the functions and subfunctions
for \pcode{matcher} are shown in Figure~\ref{scala1}.

\begin{figure}[p]
  \lstinputlisting[numbers=left,linebackgroundcolor=
                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
                  {../progs/app5.scala}
\caption{A Scala implementation of the \textit{nullable} and 
  the derivative function. These functions are easy to
  implement in functional programming languages. This is because pattern 
  matching and recursion allow us to mimic the mathematical
  definitions very closely. Nearly all functional
  programming languages support pattern matching and
  recursion out of the box.\label{scala1}}
\end{figure}


%Remember our second example involving the regular expression
%$(a^*)^* \cdot b$ which could not match strings of $n$ \texttt{a}s. 
%Java needed around 30 seconds to find this out a string with $n=28$.
%It seems our algorithm is doing rather well in comparison:
%
%\begin{center}
%\begin{tikzpicture}
%\begin{axis}[
%    title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
%    xlabel={$n$},
%    x label style={at={(1.05,0.0)}},
%    ylabel={time in secs},
%    enlargelimits=false,
%    xtick={0,1000,...,6500},
%    xmax=6800,
%    ytick={0,5,...,30},
%    ymax=34,
%    scaled ticks=false,
%    axis lines=left,
%    width=8cm,
%    height=4.5cm, 
%    legend entries={Java,Scala V1},  
%    legend pos=north east,
%    legend cell align=left]
%\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
%\addplot[red,mark=triangle*,mark options={fill=white}] table {re1a.data};
%\end{axis}
%\end{tikzpicture}
%\end{center}
%
%\noindent
%This is not an error: it hardly takes more than half a second for
%strings up to the length of 6500. After that we receive a
%StackOverflow exception, but still\ldots

For running the algorithm with our first example, the evil
regular expression $a^?{}^{\{n\}}\cdot a^{\{n\}}$, we need to implement
the optional regular expression and the `exactly $n$-times
regular expression'. This can be done with the translations

\lstinputlisting[numbers=none]{../progs/app51.scala}

\noindent Running the matcher with this example, we find it is
slightly worse then the matcher in Ruby and Python.
Ooops\ldots

\begin{center}
\begin{tikzpicture}
\begin{axis}[    
    title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
    xlabel={$n$},
    x label style={at={(1.05,0.0)}},
    ylabel={time in secs},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=32,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=6cm,
    height=5cm, 
    legend entries={Python,Ruby,Scala V1},  
    legend pos=outer north east,
    legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
\addplot[brown,mark=pentagon*, mark options={fill=white}] table {re-ruby.data};  
\addplot[red,mark=triangle*,mark options={fill=white}] table {re1.data};  
\end{axis}
\end{tikzpicture}
\end{center}

\noindent Analysing this failure we notice that for $a^{\{n\}}$, for
example, we generate quite big regular expressions:

\begin{center}
\begin{tabular}{rl}
1: & $a$\\
2: & $a\cdot a$\\
3: & $a\cdot a\cdot a$\\
& \ldots\\
13: & $a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a$\\
& \ldots
\end{tabular}
\end{center}

\noindent Our algorithm traverses such regular expressions at
least once every time a derivative is calculated. So having
large regular expressions will cause problems. This problem
is aggravated by $a^?$ being represented as $a + \ONE$.

We can however fix this easily by having an explicit constructor for
$r^{\{n\}}$. In Scala we would introduce a constructor like

\begin{center}
\code{case NTIMES(r: Rexp, n: Int)}
\end{center}

\noindent With this fix we have a constant ``size'' regular expression
for our running example no matter how large $n$ is (see the
\texttt{size} section in the implementations).  This means we have to
also add cases for \pcode{NTIMES} in the functions $\textit{nullable}$
and $\textit{der}$. Does the change have any effect?

\begin{center}
\begin{tikzpicture}
\begin{axis}[
    title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
    xlabel={$n$},
    x label style={at={(1.01,0.0)}},
    ylabel={time in secs},
    enlargelimits=false,
    xtick={0,200,...,1100},
    xmax=1200,
    ytick={0,5,...,30},
    scaled ticks=false,
    axis lines=left,
    width=10cm,
    height=5cm, 
    legend entries={Python,Ruby,Scala V1,Scala V2},  
    legend pos=outer north east,
    legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
\addplot[brown,mark=pentagon*, mark options={fill=white}] table {re-ruby.data};  
\addplot[red,mark=triangle*,mark options={fill=white}] table {re1.data};  
\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
\end{axis}
\end{tikzpicture}
\end{center}

\noindent Now we are talking business! The modified matcher can within
25 seconds handle regular expressions up to $n = 1,100$ before a
StackOverflow is raised. Recall that Python and Ruby (and our first
version, Scala V1) could only handle $n = 27$ or so in 30
seconds. We have not tried our algorithm on the second example $(a^*)^* \cdot
b$---I leave this to you.


The moral is that our algorithm is rather sensitive to the
size of regular expressions it needs to handle. This is of
course obvious because both $\textit{nullable}$ and $\textit{der}$ frequently
need to traverse the whole regular expression. There seems,
however, one more issue for making the algorithm run faster.
The derivative function often produces ``useless''
$\ZERO$s and $\ONE$s. To see this, consider $r = ((a
\cdot b) + b)^*$ and the following three derivatives

\begin{center}
\begin{tabular}{l}
$\textit{der}\,a\,r = ((\ONE \cdot b) + \ZERO) \cdot r$\\
$\textit{der}\,b\,r = ((\ZERO \cdot b) + \ONE)\cdot r$\\
$\textit{der}\,c\,r = ((\ZERO \cdot b) + \ZERO)\cdot r$
\end{tabular}
\end{center}

\noindent 
If we simplify them according to the simplification rules from the
beginning, we can replace the right-hand sides by the smaller
equivalent regular expressions

\begin{center}
\begin{tabular}{l}
$\textit{der}\,a\,r \equiv b \cdot r$\\
$\textit{der}\,b\,r \equiv r$\\
$\textit{der}\,c\,r \equiv \ZERO$
\end{tabular}
\end{center}

\noindent I leave it to you to contemplate whether such a
simplification can have any impact on the correctness of our algorithm
(will it change any answers?). Figure~\ref{scala2} gives a
simplification function that recursively traverses a regular
expression and simplifies it according to the rules given at the
beginning. There are only rules for $+$ and $\cdot$. There is 
no simplification rule for a star, because
empirical data and also a little thought showed that simplifying under
a star is a waste of computation time. The simplification function
will be called after every derivation.  This additional step removes
all the ``junk'' the derivative function introduced. Does this improve
the speed? You bet!!

\begin{figure}[p]
\lstinputlisting[numbers=left,linebackgroundcolor=
  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
                {../progs/app6.scala}
\caption{The simplification function and modified 
\texttt{ders}-function; this function now
calls \texttt{der} first, but then simplifies
the resulting derivative regular expressions before
building the next derivative, see
Line~24.\label{scala2}}
\end{figure}

\begin{center}
\begin{tikzpicture}
\begin{axis}[
    title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
    xlabel={$n$},
    x label style={at={(1.04,0.0)}},
    ylabel={time in secs},
    enlargelimits=false,
    xtick={0,2500,...,10000},
    xmax=12000,
    ytick={0,5,...,30},
    ymax=32,
    scaled ticks=false,
    axis lines=left,
    width=9cm,
    height=5cm, 
    legend entries={Scala V2,Scala V3},
    legend pos=outer north east,
    legend cell align=left]
\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
\end{axis}
\end{tikzpicture}
\end{center}

\noindent
To recap, Python and Ruby needed approximately 30 seconds to match a
string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot
a^{\{n\}}$.  We need a third of this time to do the same with strings
up to 11,000 \texttt{a}s.  Similarly, Java 8 and Python needed 30
seconds to find out the regular expression $(a^*)^* \cdot b$ does not
match the string of 28 \texttt{a}s. In Java 9 and later this has been 
cranked up to 39,000 \texttt{a}s, but we can do the same in the same 
amount of time for strings composed of nearly 6,000,000 \texttt{a}s. 
This is shown in the following plot.


\begin{center}
\begin{tikzpicture}
\begin{axis}[
    title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
    xlabel={$n$},
    ylabel={time in secs},
    enlargelimits=false,
    ymax=35,
    ytick={0,5,...,30},
    axis lines=left,
    %%scaled ticks=false,
    x label style={at={(1.09,0.0)}},
    %%xmax=7700000,
    width=9cm,
    height=5cm, 
    legend entries={Scala V3},
    legend pos=outer north east,
    legend cell align=left]
%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
\end{axis}
\end{tikzpicture}
\end{center}

\subsection*{Epilogue}

(23/Aug/2016) I found another place where this algorithm can
be sped up (this idea is not integrated with what is coming next, but
I present it nonetheless). The idea is to not define \texttt{ders}
so that it iterates the derivative character-by-character, but in bigger
chunks. The resulting code for \texttt{ders2} looks as follows:

\lstinputlisting[numbers=none]{../progs/app52.scala} 

\noindent
I have not fully understood why this version is much faster,
but it seems it is a combination of the clauses for \texttt{ALT}
and \texttt{SEQ}. In the latter case we call \texttt{der} with 
a single character and this potentially produces an alternative.
The derivative of such an alternative can then be more efficiently  
calculated by \texttt{ders2} since it pushes a whole string
under an \texttt{ALT}. The numbers are that in the second case  
$(a^*)^* \cdot b$ both versions are pretty much the same, but in the 
first case $a^{?\{n\}} \cdot a^{\{n\}}$ the improvement gives 
another factor of 100 speedup. Nice!

\begin{center}
\begin{tabular}{cc}
\begin{tikzpicture}
\begin{axis}[
    title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
    xlabel={$n$},
    x label style={at={(1.04,0.0)}},
    ylabel={time in secs},
    enlargelimits=false,
    xmax=7100000,
    ytick={0,5,...,30},
    ymax=33,
    %scaled ticks=false,
    axis lines=left,
    width=5.3cm,
    height=5cm, 
    legend entries={Scala V3, Scala V4},
    legend style={at={(0.1,-0.2)},anchor=north}]
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
\addplot[purple,mark=square*,mark options={fill=white}] table {re4.data};
\end{axis}
\end{tikzpicture}
&
\begin{tikzpicture}
\begin{axis}[
    title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
    xlabel={$n$},
    x label style={at={(1.09,0.0)}},
    ylabel={time in secs},
    enlargelimits=false,
    xmax=8200000,
    ytick={0,5,...,30},
    ymax=33,
    %scaled ticks=false,
    axis lines=left,
    width=5.3cm,
    height=5cm, 
    legend entries={Scala V3, Scala V4},
    legend style={at={(0.1,-0.2)},anchor=north}]
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
\addplot[purple,mark=square*,mark options={fill=white}] table {re4a.data};
\end{axis}
\end{tikzpicture}
\end{tabular}
\end{center}


\section*{Proofs}

You might not like doing proofs. But they serve a very
important purpose in Computer Science: How can we be sure that
our algorithm matches its specification? We can try to test
the algorithm, but that often overlooks corner cases and an
exhaustive testing is impossible (since there are infinitely
many inputs). Proofs allow us to ensure that an algorithm
really meets its specification. 

For the programs we look at in this module, the proofs will
mostly by some form of induction. Remember that regular
expressions are defined as 

\begin{center}
\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
  $r$ & $::=$ &   $\ZERO$         & nothing\\
        & $\mid$ & $\ONE$           & empty string / \texttt{""} / []\\
        & $\mid$ & $c$                  & single character\\
        & $\mid$ & $r_1 + r_2$          & alternative / choice\\
        & $\mid$ & $r_1 \cdot r_2$      & sequence\\
        & $\mid$ & $r^*$                & star (zero or more)\\
  \end{tabular}
\end{center}

\noindent If you want to show a property $P(r)$ for \emph{all} 
regular expressions $r$, then you have to follow essentially 
the recipe:

\begin{itemize}
\item $P$ has to hold for $\ZERO$, $\ONE$ and $c$
 (these are the base cases).
\item $P$ has to hold for $r_1 + r_2$ under the assumption 
   that $P$ already holds for $r_1$ and $r_2$.
\item $P$ has to hold for $r_1 \cdot r_2$ under the 
  assumption that $P$ already holds for $r_1$ and $r_2$.
\item $P$ has to hold for $r^*$ under the assumption 
  that $P$ already holds for $r$.
\end{itemize}

\noindent 
A simple proof is for example showing the following 
property:

\begin{equation}
\textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)
\label{nullableprop}
\end{equation}

\noindent
Let us say that this property is $P(r)$, then the first case
we need to check is whether $P(\ZERO)$ (see recipe 
above). So we have to show that

\[
\textit{nullable}(\ZERO) \;\;\text{if and only if}\;\; 
[]\in L(\ZERO)
\]

\noindent whereby $\textit{nullable}(\ZERO)$ is by definition of
the function $\textit{nullable}$ always $\textit{false}$. We also have
that $L(\ZERO)$ is by definition $\{\}$. It is
impossible that the empty string $[]$ is in the empty set.
Therefore also the right-hand side is false. Consequently we
verified this case: both sides are false. We would still need
to do this for $P(\ONE)$ and $P(c)$. I leave this to
you to verify.

Next we need to check the inductive cases, for example
$P(r_1 + r_2)$, which is

\begin{equation}
\textit{nullable}(r_1 + r_2) \;\;\text{if and only if}\;\; 
[]\in L(r_1 + r_2)
\label{propalt}
\end{equation}

\noindent The difference to the base cases is that in the inductive
cases we can already assume we proved $P$ for the components, that is
we can assume.

\begin{center}
\begin{tabular}{l}
$\textit{nullable}(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\
$\textit{nullable}(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\
\end{tabular}
\end{center}

\noindent These are called the induction hypotheses. To check this 
case, we can start from $\textit{nullable}(r_1 + r_2)$, which by 
definition of $\textit{nullable}$ is

\[
\textit{nullable}(r_1) \vee \textit{nullable}(r_2)
\]

\noindent Using the two induction hypotheses from above,
we can transform this into 

\[
[] \in L(r_1) \vee []\in L(r_2)
\]

\noindent We just replaced the $\textit{nullable}(\ldots)$ parts by
the equivalent $[] \in L(\ldots)$ from the induction 
hypotheses. A bit of thinking convinces you that if
$[] \in L(r_1) \vee []\in L(r_2)$ then the empty string
must be in the union $L(r_1)\cup L(r_2)$, that is

\[
[] \in L(r_1)\cup L(r_2)
\]

\noindent but this is by definition of $L$ exactly $[] \in L(r_1 +
r_2)$, which we needed to establish according to statement in
\eqref{propalt}. What we have shown is that starting from
$\textit{nullable}(r_1 + r_2)$ we have done equivalent transformations
to end up with $[] \in L(r_1 + r_2)$. Consequently we have established
that $P(r_1 + r_2)$ holds.

In order to complete the proof we would now need to look 
at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you
check the details.

You might also have to do induction proofs over strings. 
That means you want to establish a property $P(s)$ for all
strings $s$. For this remember strings are lists of 
characters. These lists can be either the empty list or a
list of the form $c::s$. If you want to perform an induction
proof for strings you need to consider the cases

\begin{itemize}
\item $P$ has to hold for $[]$ (this is the base case).
\item $P$ has to hold for $c::s$ under the assumption 
   that $P$ already holds for $s$.
\end{itemize}

\noindent
Given this recipe, I let you show

\begin{equation}
\textit{Ders}\,s\,(L(r)) = L(\textit{ders}\,s\,r)
\label{dersprop}
\end{equation}

\noindent by induction on $s$. Recall $\textit{Der}$ is defined for 
character---see \eqref{Der}; $\textit{Ders}$ is similar, but for strings:

\[
\textit{Ders}\,s\,A\;\dn\;\{s'\,|\,s @ s' \in A\}
\]

\noindent In this proof you can assume the following property
for $der$ and $\textit{Der}$ has already been proved, that is you can
assume

\[
L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r))
\]

\noindent holds (this would be of course another property that needs
to be proved in a side-lemma by induction on $r$). This is a bit
more challenging, but not impossible.

To sum up, using reasoning like the one shown above allows us 
to show the correctness of our algorithm. To see this,
start from the specification

\[
s \in L(r)
\]

\noindent That is the problem we want to solve. Thinking a 
little, you will see that this problem is equivalent to the 
following problem

\begin{equation}
[] \in \textit{Ders}\,s\,(L(r))
\label{dersstep}
\end{equation}

\noindent You agree?  But we have shown above in \eqref{dersprop},
that the $\textit{Ders}$ can be replaced by
$L(\textit{ders}\ldots)$. That means \eqref{dersstep} is equivalent to

\begin{equation}
[] \in L(\textit{ders}\,s\,r)
\label{prefinalstep}
\end{equation}

\noindent We have also shown that testing whether the empty
string is in a language is equivalent to the $\textit{nullable}$
function; see \eqref{nullableprop}. That means
\eqref{prefinalstep} is equivalent with
 
\[
\textit{nullable}(\textit{ders}\,s\,r)
\] 

\noindent But this is just the definition of $matcher$

\[
matcher\,s\,r \dn nullable(\textit{ders}\,s\,r)
\]

\noindent In effect we have shown

\[
matcher\,s\,r\;\;\text{if and only if}\;\;
s\in L(r)
\]

\noindent which is the property we set out to prove: our algorithm
meets its specification. To have done so, requires a few induction
proofs about strings and regular expressions. Following the
\emph{induction recipes} is already a big step in actually performing
these proofs.  If you do not believe it, proofs have helped me to make
sure my code is correct and in several instances prevented me of
letting slip embarrassing mistakes into the `wild'. In fact I have
found a number of mistakes in the brilliant work by Sulzmann and Lu,
which we are going to have a look at in Lecture 4. But in Lecture 3 we
should first find out why on earth are existing regular expression
matchers so abysmally slow. Are the people in Python, Ruby, Swift,
JavaScript, Java and also in Rust\footnote{Interestingly the regex
  engine in Rust says it guarantees a linear behaviour when deciding
  when a regular expression matches a
  string. \here{https://youtu.be/3N_ywhx6_K0?t=31}} just idiots? For
example could it possibly be that what we have implemented here in
Scala is faster than the regex engine that has been implemented in
Rust? See you next week\ldots

\end{document}



% !TeX program = latexmk -xelatex
%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: