% !TEX program = xelatex+ −
\documentclass{article}+ −
\usepackage{../style}+ −
\usepackage{../langs}+ −
\usepackage{../graphics}+ −
\usepackage{../data}+ −
+ −
+ −
\begin{document}+ −
\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019}+ −
+ −
+ −
\section*{Handout 2 (Regular Expression Matching)}+ −
+ −
This lecture is about implementing a more efficient regular expression+ −
matcher (the plots on the right below)---more efficient than the+ −
matchers from regular expression libraries in Ruby, Python, JavaScript+ −
and Java (the plots on the left). For this consider some experimental+ −
data: The first pair of plots shows the running time for the+ −
regular expression $(a^*)^*\cdot b$ and strings composed of $n$+ −
\pcode{a}s (meaning this regular expression actually does not match+ −
the strings). The second pair of plots shows the running time for the+ −
regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings also+ −
composed of $n$ \pcode{a}s (this time the regular expressions match+ −
the strings). To see the substantial differences in the left and+ −
right plots below, note the different scales of the $x$-axes.+ −
+ −
+ −
\begin{center}+ −
Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$+ −
\begin{tabular}{@{}cc@{}}+ −
\begin{tikzpicture}[baseline=(current bounding box.north)]+ −
\begin{axis}[+ −
xlabel={$n$},+ −
x label style={at={(1.05,0.0)}},+ −
ylabel={time in secs},+ −
enlargelimits=false,+ −
xtick={0,5,...,30},+ −
xmax=33,+ −
ymax=35,+ −
ytick={0,5,...,30},+ −
scaled ticks=false,+ −
axis lines=left,+ −
width=5cm,+ −
height=5cm, + −
legend entries={Java 8, Python, JavaScript}, + −
legend pos=north west,+ −
legend cell align=left]+ −
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};+ −
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};+ −
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};+ −
\end{axis}+ −
\end{tikzpicture}+ −
&+ −
\begin{tikzpicture}[baseline=(current bounding box.north)]+ −
\begin{axis}[+ −
xlabel={$n$},+ −
x label style={at={(1.1,0.0)}},+ −
%%xtick={0,1000000,...,5000000}, + −
ylabel={time in secs},+ −
enlargelimits=false,+ −
ymax=35,+ −
ytick={0,5,...,30},+ −
axis lines=left,+ −
%scaled ticks=false,+ −
width=6.5cm,+ −
height=5cm,+ −
legend entries={Our matcher}, + −
legend pos=north east,+ −
legend cell align=left]+ −
%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data}; + −
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};+ −
\end{axis}+ −
\end{tikzpicture}+ −
\end{tabular}+ −
\end{center}\bigskip+ −
+ −
\begin{center}+ −
Graphs: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\+ −
\begin{tabular}{@{}cc@{}}+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
xlabel={$n$},+ −
x label style={at={(1.05,0.0)}},+ −
ylabel={\small time in secs},+ −
enlargelimits=false,+ −
xtick={0,5,...,30},+ −
xmax=33,+ −
ymax=35,+ −
ytick={0,5,...,30},+ −
scaled ticks=false,+ −
axis lines=left,+ −
width=5cm,+ −
height=5cm, + −
legend entries={Python,Ruby}, + −
legend pos=north west,+ −
legend cell align=left]+ −
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};+ −
\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data}; + −
\end{axis}+ −
\end{tikzpicture}+ −
&+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
xlabel={$n$},+ −
x label style={at={(1.1,0.05)}},+ −
ylabel={\small time in secs},+ −
enlargelimits=false,+ −
xtick={0,2500,...,11000},+ −
xmax=12000,+ −
ymax=35,+ −
ytick={0,5,...,30},+ −
scaled ticks=false,+ −
axis lines=left,+ −
width=6.5cm,+ −
height=5cm,+ −
legend entries={Our matcher}, + −
legend pos=north east,+ −
legend cell align=left]+ −
%\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};+ −
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};+ −
\end{axis}+ −
\end{tikzpicture}+ −
\end{tabular}+ −
\end{center}+ −
\bigskip+ −
+ −
\noindent+ −
In what follows we will use these regular expressions and strings as+ −
running examples. There will be several versions (V1, V2, V3,\ldots)+ −
of our matcher.\footnote{The corresponding files are+ −
\texttt{re1.scala}, \texttt{re2.scala} and so on. As usual, you can+ −
find the code on KEATS.}\bigskip+ −
+ −
\noindent+ −
Having specified in the previous lecture what+ −
problem our regular expression matcher is supposed to solve,+ −
namely for any given regular expression $r$ and string $s$+ −
answer \textit{true} if and only if+ −
+ −
\[+ −
s \in L(r)+ −
\]+ −
+ −
\noindent we can look for an algorithm to solve this problem. Clearly+ −
we cannot use the function $L$ directly for this, because in general+ −
the set of strings $L$ returns is infinite (recall what $L(a^*)$ is).+ −
In such cases there is no way we can implement an exhaustive test for+ −
whether a string is member of this set or not. In contrast our+ −
matching algorithm will operate on the regular expression $r$ and+ −
string $s$, only, which are both finite objects. Before we explain + −
the matching algorithm, let us have a closer look at what it+ −
means when two regular expressions are equivalent.+ −
+ −
\subsection*{Regular Expression Equivalences}+ −
+ −
We already defined in Handout 1 what it means for two regular+ −
expressions to be equivalent, namely if their meaning is the+ −
same language:+ −
+ −
\[+ −
r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)+ −
\]+ −
+ −
\noindent+ −
It is relatively easy to verify that some concrete equivalences+ −
hold, for example+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
$(a + b) + c$ & $\equiv$ & $a + (b + c)$\\+ −
$a + a$ & $\equiv$ & $a$\\+ −
$a + b$ & $\equiv$ & $b + a$\\+ −
$(a \cdot b) \cdot c$ & $\equiv$ & $a \cdot (b \cdot c)$\\+ −
$c \cdot (a + b)$ & $\equiv$ & $(c \cdot a) + (c \cdot b)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
but also easy to verify that the following regular expressions+ −
are \emph{not} equivalent+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
$a \cdot a$ & $\not\equiv$ & $a$\\+ −
$a + (b \cdot c)$ & $\not\equiv$ & $(a + b) \cdot (a + c)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent I leave it to you to verify these equivalences and+ −
non-equivalences. It is also interesting to look at some+ −
corner cases involving $\ONE$ and $\ZERO$:+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
$a \cdot \ZERO$ & $\not\equiv$ & $a$\\+ −
$a + \ONE$ & $\not\equiv$ & $a$\\+ −
$\ONE$ & $\equiv$ & $\ZERO^*$\\+ −
$\ONE^*$ & $\equiv$ & $\ONE$\\+ −
$\ZERO^*$ & $\not\equiv$ & $\ZERO$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Again I leave it to you to make sure you agree+ −
with these equivalences and non-equivalences. + −
+ −
+ −
For our matching algorithm however the following seven+ −
equivalences will play an important role:+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
$r + \ZERO$ & $\equiv$ & $r$\\+ −
$\ZERO + r$ & $\equiv$ & $r$\\+ −
$r \cdot \ONE$ & $\equiv$ & $r$\\+ −
$\ONE \cdot r$ & $\equiv$ & $r$\\+ −
$r \cdot \ZERO$ & $\equiv$ & $\ZERO$\\+ −
$\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\+ −
$r + r$ & $\equiv$ & $r$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent which always hold no matter what the regular expression $r$+ −
looks like. The first two are easy to verify since $L(\ZERO)$ is the+ −
empty set. The next two are also easy to verify since $L(\ONE) =+ −
\{[]\}$ and appending the empty string to every string of another set,+ −
leaves the set unchanged. Be careful to fully comprehend the fifth and+ −
sixth equivalence: if you concatenate two sets of strings and one is+ −
the empty set, then the concatenation will also be the empty set. To+ −
see this, check the definition of $\_ @ \_$ for sets. The last+ −
equivalence is again trivial.+ −
+ −
What will be important later on is that we can orient these+ −
equivalences and read them from left to right. In this way we+ −
can view them as \emph{simplification rules}. Consider for + −
example the regular expression + −
+ −
\begin{equation}+ −
(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)+ −
\label{big}+ −
\end{equation}+ −
+ −
\noindent If we can find an equivalent regular expression that is+ −
simpler (that usually means smaller), then this might potentially make+ −
our matching algorithm run faster. We can look for such a simpler+ −
regular expression $r'$ because whether a string $s$ is in $L(r)$ or+ −
in $L(r')$ with $r\equiv r'$ will always give the same answer. Yes?+ −
+ −
In the example above you will see that the regular expression is+ −
equivalent to just $r_1$. You can verify this by iteratively applying+ −
the simplification rules from above:+ −
+ −
\begin{center}+ −
\begin{tabular}{ll}+ −
& $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot + −
(\underline{r_4 \cdot \ZERO})$\smallskip\\+ −
$\equiv$ & $(r_1 + \ZERO) \cdot \ONE + \underline{((\ONE + r_2) + r_3) \cdot + −
\ZERO}$\smallskip\\+ −
$\equiv$ & $\underline{(r_1 + \ZERO) \cdot \ONE} + \ZERO$\smallskip\\+ −
$\equiv$ & $(\underline{r_1 + \ZERO}) + \ZERO$\smallskip\\+ −
$\equiv$ & $\underline{r_1 + \ZERO}$\smallskip\\+ −
$\equiv$ & $r_1$\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent In each step, I underlined where a simplification+ −
rule is applied. Our matching algorithm in the next section+ −
will often generate such ``useless'' $\ONE$s and+ −
$\ZERO$s, therefore simplifying them away will make the+ −
algorithm quite a bit faster.+ −
+ −
Finally here are three equivalences between regular expressions which are+ −
not so obvious:+ −
+ −
\begin{center}+ −
\begin{tabular}{rcl}+ −
$r^*$ & $\equiv$ & $1 + r\cdot r^*$\\+ −
$(r_1 + r_2)^*$ & $\equiv$ & $r_1^* \cdot (r_2\cdot r_1^*)^*$\\+ −
$(r_1 \cdot r_2)^*$ & $\equiv$ & $1 + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
We will not use them in our algorithm, but feel free to convince yourself+ −
that they hold. As an aside, there has been a lot of research about+ −
questions like: Can one always decide when two regular expressions are+ −
equivalent or not? What does an algorithm look like to decide this+ −
efficiently? So in general it is not a trivial problem.+ −
+ −
\subsection*{The Matching Algorithm}+ −
+ −
The algorithm we will define below consists of two parts. One+ −
is the function $\textit{nullable}$ which takes a regular expression as+ −
argument and decides whether it can match the empty string+ −
(this means it returns a boolean in Scala). This can be easily+ −
defined recursively as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}+ −
$\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\+ −
$\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\+ −
$\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\+ −
$\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\ + −
$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\+ −
$\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$ \\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent The idea behind this function is that the following+ −
property holds:+ −
+ −
\[+ −
\textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)+ −
\]+ −
+ −
\noindent Note on the left-hand side of the if-and-only-if we+ −
have a function we can implement; on the right we have its+ −
specification (which we cannot implement in a programming+ −
language).+ −
+ −
The other function of our matching algorithm calculates a+ −
\emph{derivative} of a regular expression. This is a function+ −
which will take a regular expression, say $r$, and a+ −
character, say $c$, as arguments and returns a new regular+ −
expression. Be mindful that the intuition behind this function+ −
is not so easy to grasp on first reading. Essentially this+ −
function solves the following problem: if $r$ can match a+ −
string of the form $c\!::\!s$, what does a regular+ −
expression look like that can match just $s$? The definition+ −
of this function is as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l}+ −
$\textit{der}\, c\, (\ZERO)$ & $\dn$ & $\ZERO$\\+ −
$\textit{der}\, c\, (\ONE)$ & $\dn$ & $\ZERO$ \\+ −
$\textit{der}\, c\, (d)$ & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\+ −
$\textit{der}\, c\, (r_1 + r_2)$ & $\dn$ & $\textit{der}\, c\, r_1 + \textit{der}\, c\, r_2$\\+ −
$\textit{der}\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $\textit{nullable} (r_1)$\\+ −
& & then $(\textit{der}\,c\,r_1) \cdot r_2 + \textit{der}\, c\, r_2$\\ + −
& & else $(\textit{der}\, c\, r_1) \cdot r_2$\\+ −
$\textit{der}\, c\, (r^*)$ & $\dn$ & $(\textit{der}\,c\,r) \cdot (r^*)$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent The first two clauses can be rationalised as+ −
follows: recall that $\textit{der}$ should calculate a regular+ −
expression so that given the ``input'' regular expression can+ −
match a string of the form $c\!::\!s$, we want a regular+ −
expression for $s$. Since neither $\ZERO$ nor $\ONE$+ −
can match a string of the form $c\!::\!s$, we return+ −
$\ZERO$. In the third case we have to make a+ −
case-distinction: In case the regular expression is $c$, then+ −
clearly it can recognise a string of the form $c\!::\!s$, just+ −
that $s$ is the empty string. Therefore we return the+ −
$\ONE$-regular expression. In the other case we again+ −
return $\ZERO$ since no string of the $c\!::\!s$ can be+ −
matched. Next come the recursive cases, which are a bit more+ −
involved. Fortunately, the $+$-case is still relatively+ −
straightforward: all strings of the form $c\!::\!s$ are either+ −
matched by the regular expression $r_1$ or $r_2$. So we just+ −
have to recursively call $\textit{der}$ with these two regular+ −
expressions and compose the results again with $+$. Makes+ −
sense? + −
+ −
The $\cdot$-case is more complicated: if $r_1\cdot r_2$+ −
matches a string of the form $c\!::\!s$, then the first part+ −
must be matched by $r_1$. Consequently, it makes sense to+ −
construct the regular expression for $s$ by calling $\textit{der}$ with+ −
$r_1$ and ``appending'' $r_2$. There is however one exception+ −
to this simple rule: if $r_1$ can match the empty string, then+ −
all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is+ −
nullable (that is can match the empty string) we have to allow+ −
the choice $\textit{der}\,c\,r_2$ for calculating the regular+ −
expression that can match $s$. Therefore we have to add the+ −
regular expression $\textit{der}\,c\,r_2$ in the result. The $*$-case+ −
is again simple: if $r^*$ matches a string of the form+ −
$c\!::\!s$, then the first part must be ``matched'' by a+ −
single copy of $r$. Therefore we call recursively $\textit{der}\,c\,r$+ −
and ``append'' $r^*$ in order to match the rest of $s$. Still+ −
makes sense?+ −
+ −
If all this did not make sense yet, here is another way to explain the+ −
definition of $\textit{der}$ by considering the following operation on+ −
sets:+ −
+ −
\begin{equation}\label{Der}+ −
\textit{Der}\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}+ −
\end{equation}+ −
+ −
\noindent This operation essentially transforms a set of+ −
strings $A$ by filtering out all strings that do not start+ −
with $c$ and then strips off the $c$ from all the remaining+ −
strings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then+ −
+ −
\[ \textit{Der}\,f\,A = \{oo, rak\}\quad,\quad + −
\textit{Der}\,b\,A = \{ar\} \quad \text{and} \quad + −
\textit{Der}\,a\,A = \{\} + −
\]+ −
+ −
\noindent+ −
Note that in the last case $\textit{Der}$ is empty, because no string in $A$+ −
starts with $a$. With this operation we can state the following+ −
property about $\textit{der}$:+ −
+ −
\[+ −
L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r))+ −
\]+ −
+ −
\noindent+ −
This property clarifies what regular expression $\textit{der}$ calculates,+ −
namely take the set of strings that $r$ can match (that is $L(r)$),+ −
filter out all strings not starting with $c$ and strip off the $c$+ −
from the remaining strings---this is exactly the language that+ −
$\textit{der}\,c\,r$ can match.+ −
+ −
If we want to find out whether the string $abc$ is matched by+ −
the regular expression $r_1$ then we can iteratively apply $\textit{der}$+ −
as follows+ −
+ −
\begin{center}+ −
\begin{tabular}{rll}+ −
Input: $r_1$, $abc$\medskip\\+ −
Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = \textit{der}\,a\,r_1)$\smallskip\\+ −
Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = \textit{der}\,b\,r_2)$\smallskip\\+ −
Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = \textit{der}\,c\,r_3)$\smallskip\\+ −
Step 4: & the string is exhausted: & $(\textit{nullable}(r_4))$\\+ −
& test whether $r_4$ can recognise the\\+ −
& empty string\smallskip\\+ −
Output: & result of this test $\Rightarrow \textit{true} \,\text{or}\, \textit{false}$\\ + −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Again the operation $\textit{Der}$ might help to rationalise+ −
this algorithm. We want to know whether $abc \in L(r_1)$. We+ −
do not know yet---but let us assume it is. Then $\textit{Der}\,a\,L(r_1)$+ −
builds the set where all the strings not starting with $a$ are+ −
filtered out. Of the remaining strings, the $a$ is stripped+ −
off. So we should still have $bc$ in the set.+ −
Then we continue with filtering out all strings not+ −
starting with $b$ and stripping off the $b$ from the remaining+ −
strings, that means we build $\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1)))$.+ −
Finally we filter out all strings not starting with $c$ and+ −
strip off $c$ from the remaining string. This is+ −
$\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$. Now if $abc$ was in the + −
original set ($L(r_1)$), then $\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$ + −
must contain the empty string. If not, then $abc$ was not in the + −
language we started with.+ −
+ −
Our matching algorithm using $\textit{der}$ and $\textit{nullable}$ works+ −
similarly, just using regular expressions instead of sets. In order to+ −
define our algorithm we need to extend the notion of derivatives from single+ −
characters to strings. This can be done using the following+ −
function, taking a string and a regular expression as input and+ −
a regular expression as output.+ −
+ −
\begin{center}+ −
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}}+ −
$\textit{ders}\, []\, r$ & $\dn$ & $r$ & \\+ −
$\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(\textit{der}\,c\,r)$ & \\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent This function iterates $\textit{der}$ taking one character at+ −
the time from the original string until the string is exhausted.+ −
Having $\textit{der}s$ in place, we can finally define our matching+ −
algorithm:+ −
+ −
\[+ −
\textit{matches}\,r\,s \dn \textit{nullable}(\textit{ders}\,s\,r)+ −
\]+ −
+ −
\noindent+ −
and we can claim that+ −
+ −
\[+ −
\textit{matches}\,r\,s\quad\text{if and only if}\quad s\in L(r)+ −
\]+ −
+ −
\noindent holds, which means our algorithm satisfies the+ −
specification. Of course we can claim many things\ldots+ −
whether the claim holds any water is a different question,+ −
which for example is the point of the Strand-2 Coursework.+ −
+ −
This algorithm was introduced by Janusz Brzozowski in 1964, but + −
is more widely known only in the last 10 or so years. Its+ −
main attractions are simplicity and being fast, as well as+ −
being easily extendible for other regular expressions such as+ −
$r^{\{n\}}$, $r^?$, $\sim{}r$ and so on (this is subject of+ −
Strand-1 Coursework 1). + −
+ −
\subsection*{The Matching Algorithm in Scala}+ −
+ −
Another attraction of the algorithm is that it can be easily+ −
implemented in a functional programming language, like Scala.+ −
Given the implementation of regular expressions in Scala shown+ −
in the first lecture and handout, the functions and subfunctions+ −
for \pcode{matches} are shown in Figure~\ref{scala1}.+ −
+ −
\begin{figure}[p]+ −
\lstinputlisting[numbers=left,linebackgroundcolor=+ −
{\ifodd\value{lstnumber}\color{capri!3}\fi}]+ −
{../progs/app5.scala}+ −
\caption{A Scala implementation of \textit{nullable} and + −
derivative function. These functions are easy to+ −
implement in functional programming languages. This is because pattern + −
matching and recursion allow us to mimic the mathematical+ −
definitions very closely. Nearly all functional+ −
programming languages support pattern matching and+ −
recursion out of the box.\label{scala1}}+ −
\end{figure}+ −
+ −
+ −
%Remember our second example involving the regular expression+ −
%$(a^*)^* \cdot b$ which could not match strings of $n$ \texttt{a}s. + −
%Java needed around 30 seconds to find this out a string with $n=28$.+ −
%It seems our algorithm is doing rather well in comparison:+ −
%+ −
%\begin{center}+ −
%\begin{tikzpicture}+ −
%\begin{axis}[+ −
% title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},+ −
% xlabel={$n$},+ −
% x label style={at={(1.05,0.0)}},+ −
% ylabel={time in secs},+ −
% enlargelimits=false,+ −
% xtick={0,1000,...,6500},+ −
% xmax=6800,+ −
% ytick={0,5,...,30},+ −
% ymax=34,+ −
% scaled ticks=false,+ −
% axis lines=left,+ −
% width=8cm,+ −
% height=4.5cm, + −
% legend entries={Java,Scala V1}, + −
% legend pos=north east,+ −
% legend cell align=left]+ −
%\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};+ −
%\addplot[red,mark=triangle*,mark options={fill=white}] table {re1a.data};+ −
%\end{axis}+ −
%\end{tikzpicture}+ −
%\end{center}+ −
%+ −
%\noindent+ −
%This is not an error: it hardly takes more than half a second for+ −
%strings up to the length of 6500. After that we receive a+ −
%StackOverflow exception, but still\ldots+ −
+ −
For running the algorithm with our first example, the evil+ −
regular expression $a^?{}^{\{n\}}\cdot a^{\{n\}}$, we need to implement+ −
the optional regular expression and the `exactly $n$-times+ −
regular expression'. This can be done with the translations+ −
+ −
\lstinputlisting[numbers=none]{../progs/app51.scala}+ −
+ −
\noindent Running the matcher with this example, we find it is+ −
slightly worse then the matcher in Ruby and Python.+ −
Ooops\ldots+ −
+ −
\begin{center}+ −
\begin{tikzpicture}+ −
\begin{axis}[ + −
title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},+ −
xlabel={$n$},+ −
x label style={at={(1.05,0.0)}},+ −
ylabel={time in secs},+ −
enlargelimits=false,+ −
xtick={0,5,...,30},+ −
xmax=32,+ −
ytick={0,5,...,30},+ −
scaled ticks=false,+ −
axis lines=left,+ −
width=6cm,+ −
height=5cm, + −
legend entries={Python,Ruby,Scala V1}, + −
legend pos=outer north east,+ −
legend cell align=left]+ −
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};+ −
\addplot[brown,mark=pentagon*, mark options={fill=white}] table {re-ruby.data}; + −
\addplot[red,mark=triangle*,mark options={fill=white}] table {re1.data}; + −
\end{axis}+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent Analysing this failure we notice that for $a^{\{n\}}$, for+ −
example, we generate quite big regular expressions:+ −
+ −
\begin{center}+ −
\begin{tabular}{rl}+ −
1: & $a$\\+ −
2: & $a\cdot a$\\+ −
3: & $a\cdot a\cdot a$\\+ −
& \ldots\\+ −
13: & $a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a$\\+ −
& \ldots+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent Our algorithm traverses such regular expressions at+ −
least once every time a derivative is calculated. So having+ −
large regular expressions will cause problems. This problem+ −
is aggravated by $a^?$ being represented as $a + \ONE$.+ −
+ −
We can however fix this easily by having an explicit constructor for+ −
$r^{\{n\}}$. In Scala we would introduce a constructor like+ −
+ −
\begin{center}+ −
\code{case class NTIMES(r: Rexp, n: Int) extends Rexp}+ −
\end{center}+ −
+ −
\noindent With this fix we have a constant ``size'' regular expression+ −
for our running example no matter how large $n$ is (see the+ −
\texttt{size} section in the implementations). This means we have to+ −
also add cases for \pcode{NTIMES} in the functions $\textit{nullable}$+ −
and $\textit{der}$. Does the change have any effect?+ −
+ −
\begin{center}+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},+ −
xlabel={$n$},+ −
x label style={at={(1.01,0.0)}},+ −
ylabel={time in secs},+ −
enlargelimits=false,+ −
xtick={0,200,...,1100},+ −
xmax=1200,+ −
ytick={0,5,...,30},+ −
scaled ticks=false,+ −
axis lines=left,+ −
width=10cm,+ −
height=5cm, + −
legend entries={Python,Ruby,Scala V1,Scala V2}, + −
legend pos=outer north east,+ −
legend cell align=left]+ −
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};+ −
\addplot[brown,mark=pentagon*, mark options={fill=white}] table {re-ruby.data}; + −
\addplot[red,mark=triangle*,mark options={fill=white}] table {re1.data}; + −
\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};+ −
\end{axis}+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent Now we are talking business! The modified matcher can within+ −
25 seconds handle regular expressions up to $n = 1,100$ before a+ −
StackOverflow is raised. Recall that Python and Ruby (and our first+ −
version, Scala V1) could only handle $n = 27$ or so in 30+ −
seconds. We have not tried our algorithm on the second example $(a^*)^* \cdot+ −
b$---I leave this to you.+ −
+ −
+ −
The moral is that our algorithm is rather sensitive to the+ −
size of regular expressions it needs to handle. This is of+ −
course obvious because both $\textit{nullable}$ and $\textit{der}$ frequently+ −
need to traverse the whole regular expression. There seems,+ −
however, one more issue for making the algorithm run faster.+ −
The derivative function often produces ``useless''+ −
$\ZERO$s and $\ONE$s. To see this, consider $r = ((a+ −
\cdot b) + b)^*$ and the following three derivatives+ −
+ −
\begin{center}+ −
\begin{tabular}{l}+ −
$\textit{der}\,a\,r = ((\ONE \cdot b) + \ZERO) \cdot r$\\+ −
$\textit{der}\,b\,r = ((\ZERO \cdot b) + \ONE)\cdot r$\\+ −
$\textit{der}\,c\,r = ((\ZERO \cdot b) + \ZERO)\cdot r$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent + −
If we simplify them according to the simplification rules from the+ −
beginning, we can replace the right-hand sides by the smaller+ −
equivalent regular expressions+ −
+ −
\begin{center}+ −
\begin{tabular}{l}+ −
$\textit{der}\,a\,r \equiv b \cdot r$\\+ −
$\textit{der}\,b\,r \equiv r$\\+ −
$\textit{der}\,c\,r \equiv \ZERO$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent I leave it to you to contemplate whether such a+ −
simplification can have any impact on the correctness of our algorithm+ −
(will it change any answers?). Figure~\ref{scala2} gives a+ −
simplification function that recursively traverses a regular+ −
expression and simplifies it according to the rules given at the+ −
beginning. There are only rules for $+$ and $\cdot$. There is + −
no simplification rule for a star, because+ −
empirical data and also a little thought showed that simplifying under+ −
a star is a waste of computation time. The simplification function+ −
will be called after every derivation. This additional step removes+ −
all the ``junk'' the derivative function introduced. Does this improve+ −
the speed? You bet!!+ −
+ −
\begin{figure}[p]+ −
\lstinputlisting[numbers=left,linebackgroundcolor=+ −
{\ifodd\value{lstnumber}\color{capri!3}\fi}]+ −
{../progs/app6.scala}+ −
\caption{The simplification function and modified + −
\texttt{ders}-function; this function now+ −
calls \texttt{der} first, but then simplifies+ −
the resulting derivative regular expressions before+ −
building the next derivative, see+ −
Line~24.\label{scala2}}+ −
\end{figure}+ −
+ −
\begin{center}+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},+ −
xlabel={$n$},+ −
x label style={at={(1.04,0.0)}},+ −
ylabel={time in secs},+ −
enlargelimits=false,+ −
xtick={0,2500,...,10000},+ −
xmax=12000,+ −
ytick={0,5,...,30},+ −
ymax=32,+ −
scaled ticks=false,+ −
axis lines=left,+ −
width=9cm,+ −
height=5cm, + −
legend entries={Scala V2,Scala V3},+ −
legend pos=outer north east,+ −
legend cell align=left]+ −
\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};+ −
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};+ −
\end{axis}+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\noindent+ −
To recap, Python and Ruby needed approximately 30 seconds to match a+ −
string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot+ −
a^{\{n\}}$. We need a third of this time to do the same with strings+ −
up to 11,000 \texttt{a}s. Similarly, Java 8 and Python needed 30+ −
seconds to find out the regular expression $(a^*)^* \cdot b$ does not+ −
match the string of 28 \texttt{a}s. In Java 9 and later this has been + −
cranked up to 39,000 \texttt{a}s, but we can do the same in the same + −
amount of time for strings composed of nearly 6,000,000 \texttt{a}s. + −
This is shown in the following plot.+ −
+ −
+ −
\begin{center}+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},+ −
xlabel={$n$},+ −
ylabel={time in secs},+ −
enlargelimits=false,+ −
ymax=35,+ −
ytick={0,5,...,30},+ −
axis lines=left,+ −
%%scaled ticks=false,+ −
x label style={at={(1.09,0.0)}},+ −
%%xmax=7700000,+ −
width=9cm,+ −
height=5cm, + −
legend entries={Scala V3},+ −
legend pos=outer north east,+ −
legend cell align=left]+ −
%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};+ −
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};+ −
\end{axis}+ −
\end{tikzpicture}+ −
\end{center}+ −
+ −
\subsection*{Epilogue}+ −
+ −
(23/Aug/2016) I found another place where this algorithm can+ −
be sped up (this idea is not integrated with what is coming next, but+ −
I present it nonetheless). The idea is to not define \texttt{ders}+ −
that it iterates the derivative character-by-character, but in bigger+ −
chunks. The resulting code for \texttt{ders2} looks as follows:+ −
+ −
\lstinputlisting[numbers=none]{../progs/app52.scala} + −
+ −
\noindent+ −
I have not fully understood why this version is much faster,+ −
but it seems it is a combination of the clauses for \texttt{ALT}+ −
and \texttt{SEQ}. In the latter case we call \texttt{der} with + −
a single character and this potentially produces an alternative.+ −
The derivative of such an alternative can then be more efficiently + −
calculated by \texttt{ders2} since it pushes a whole string+ −
under an \texttt{ALT}. The numbers are that in the second case + −
$(a^*)^* \cdot b$ both versions are pretty much the same, but in the + −
first case $a^{?\{n\}} \cdot a^{\{n\}}$ the improvement gives + −
another factor of 100 speedup. Nice!+ −
+ −
\begin{center}+ −
\begin{tabular}{cc}+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},+ −
xlabel={$n$},+ −
x label style={at={(1.04,0.0)}},+ −
ylabel={time in secs},+ −
enlargelimits=false,+ −
xmax=7100000,+ −
ytick={0,5,...,30},+ −
ymax=33,+ −
%scaled ticks=false,+ −
axis lines=left,+ −
width=5.3cm,+ −
height=5cm, + −
legend entries={Scala V3, Scala V4},+ −
legend style={at={(0.1,-0.2)},anchor=north}]+ −
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};+ −
\addplot[purple,mark=square*,mark options={fill=white}] table {re4.data};+ −
\end{axis}+ −
\end{tikzpicture}+ −
&+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},+ −
xlabel={$n$},+ −
x label style={at={(1.09,0.0)}},+ −
ylabel={time in secs},+ −
enlargelimits=false,+ −
xmax=8200000,+ −
ytick={0,5,...,30},+ −
ymax=33,+ −
%scaled ticks=false,+ −
axis lines=left,+ −
width=5.3cm,+ −
height=5cm, + −
legend entries={Scala V3, Scala V4},+ −
legend style={at={(0.1,-0.2)},anchor=north}]+ −
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};+ −
\addplot[purple,mark=square*,mark options={fill=white}] table {re4a.data};+ −
\end{axis}+ −
\end{tikzpicture}+ −
\end{tabular}+ −
\end{center}+ −
+ −
+ −
\section*{Proofs}+ −
+ −
You might not like doing proofs. But they serve a very+ −
important purpose in Computer Science: How can we be sure that+ −
our algorithm matches its specification? We can try to test+ −
the algorithm, but that often overlooks corner cases and an+ −
exhaustive testing is impossible (since there are infinitely+ −
many inputs). Proofs allow us to ensure that an algorithm+ −
really meets its specification. + −
+ −
For the programs we look at in this module, the proofs will+ −
mostly by some form of induction. Remember that regular+ −
expressions are defined as + −
+ −
\begin{center}+ −
\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}+ −
$r$ & $::=$ & $\ZERO$ & nothing\\+ −
& $\mid$ & $\ONE$ & empty string / \texttt{""} / []\\+ −
& $\mid$ & $c$ & single character\\+ −
& $\mid$ & $r_1 + r_2$ & alternative / choice\\+ −
& $\mid$ & $r_1 \cdot r_2$ & sequence\\+ −
& $\mid$ & $r^*$ & star (zero or more)\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent If you want to show a property $P(r)$ for \emph{all} + −
regular expressions $r$, then you have to follow essentially + −
the recipe:+ −
+ −
\begin{itemize}+ −
\item $P$ has to hold for $\ZERO$, $\ONE$ and $c$+ −
(these are the base cases).+ −
\item $P$ has to hold for $r_1 + r_2$ under the assumption + −
that $P$ already holds for $r_1$ and $r_2$.+ −
\item $P$ has to hold for $r_1 \cdot r_2$ under the + −
assumption that $P$ already holds for $r_1$ and $r_2$.+ −
\item $P$ has to hold for $r^*$ under the assumption + −
that $P$ already holds for $r$.+ −
\end{itemize}+ −
+ −
\noindent + −
A simple proof is for example showing the following + −
property:+ −
+ −
\begin{equation}+ −
\textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)+ −
\label{nullableprop}+ −
\end{equation}+ −
+ −
\noindent+ −
Let us say that this property is $P(r)$, then the first case+ −
we need to check is whether $P(\ZERO)$ (see recipe + −
above). So we have to show that+ −
+ −
\[+ −
\textit{nullable}(\ZERO) \;\;\text{if and only if}\;\; + −
[]\in L(\ZERO)+ −
\]+ −
+ −
\noindent whereby $\textit{nullable}(\ZERO)$ is by definition of+ −
the function $\textit{nullable}$ always $\textit{false}$. We also have+ −
that $L(\ZERO)$ is by definition $\{\}$. It is+ −
impossible that the empty string $[]$ is in the empty set.+ −
Therefore also the right-hand side is false. Consequently we+ −
verified this case: both sides are false. We would still need+ −
to do this for $P(\ONE)$ and $P(c)$. I leave this to+ −
you to verify.+ −
+ −
Next we need to check the inductive cases, for example+ −
$P(r_1 + r_2)$, which is+ −
+ −
\begin{equation}+ −
\textit{nullable}(r_1 + r_2) \;\;\text{if and only if}\;\; + −
[]\in L(r_1 + r_2)+ −
\label{propalt}+ −
\end{equation}+ −
+ −
\noindent The difference to the base cases is that in the inductive+ −
cases we can already assume we proved $P$ for the components, that is+ −
we can assume.+ −
+ −
\begin{center}+ −
\begin{tabular}{l}+ −
$\textit{nullable}(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\+ −
$\textit{nullable}(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent These are called the induction hypotheses. To check this + −
case, we can start from $\textit{nullable}(r_1 + r_2)$, which by + −
definition of $\textit{nullable}$ is+ −
+ −
\[+ −
\textit{nullable}(r_1) \vee \textit{nullable}(r_2)+ −
\]+ −
+ −
\noindent Using the two induction hypotheses from above,+ −
we can transform this into + −
+ −
\[+ −
[] \in L(r_1) \vee []\in(r_2)+ −
\]+ −
+ −
\noindent We just replaced the $\textit{nullable}(\ldots)$ parts by+ −
the equivalent $[] \in L(\ldots)$ from the induction + −
hypotheses. A bit of thinking convinces you that if+ −
$[] \in L(r_1) \vee []\in L(r_2)$ then the empty string+ −
must be in the union $L(r_1)\cup L(r_2)$, that is+ −
+ −
\[+ −
[] \in L(r_1)\cup L(r_2)+ −
\]+ −
+ −
\noindent but this is by definition of $L$ exactly $[] \in L(r_1 ++ −
r_2)$, which we needed to establish according to statement in+ −
\eqref{propalt}. What we have shown is that starting from+ −
$\textit{nullable}(r_1 + r_2)$ we have done equivalent transformations+ −
to end up with $[] \in L(r_1 + r_2)$. Consequently we have established+ −
that $P(r_1 + r_2)$ holds.+ −
+ −
In order to complete the proof we would now need to look + −
at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you+ −
check the details.+ −
+ −
You might also have to do induction proofs over strings. + −
That means you want to establish a property $P(s)$ for all+ −
strings $s$. For this remember strings are lists of + −
characters. These lists can be either the empty list or a+ −
list of the form $c::s$. If you want to perform an induction+ −
proof for strings you need to consider the cases+ −
+ −
\begin{itemize}+ −
\item $P$ has to hold for $[]$ (this is the base case).+ −
\item $P$ has to hold for $c::s$ under the assumption + −
that $P$ already holds for $s$.+ −
\end{itemize}+ −
+ −
\noindent+ −
Given this recipe, I let you show+ −
+ −
\begin{equation}+ −
\textit{Ders}\,s\,(L(r)) = L(\textit{ders}\,s\,r)+ −
\label{dersprop}+ −
\end{equation}+ −
+ −
\noindent by induction on $s$. Recall $\textit{Der}$ is defined for + −
character---see \eqref{Der}; $\textit{Ders}$ is similar, but for strings:+ −
+ −
\[+ −
\textit{Ders}\,s\,A\;\dn\;\{s'\,|\,s @ s' \in A\}+ −
\]+ −
+ −
\noindent In this proof you can assume the following property+ −
for $der$ and $\textit{Der}$ has already been proved, that is you can+ −
assume+ −
+ −
\[+ −
L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r))+ −
\]+ −
+ −
\noindent holds (this would be of course another property that needs+ −
to be proved in a side-lemma by induction on $r$). This is a bit+ −
more challenging, but not impossible.+ −
+ −
To sum up, using reasoning like the one shown above allows us + −
to show the correctness of our algorithm. To see this,+ −
start from the specification+ −
+ −
\[+ −
s \in L(r)+ −
\]+ −
+ −
\noindent That is the problem we want to solve. Thinking a + −
little, you will see that this problem is equivalent to the + −
following problem+ −
+ −
\begin{equation}+ −
[] \in \textit{Ders}\,s\,(L(r))+ −
\label{dersstep}+ −
\end{equation}+ −
+ −
\noindent You agree? But we have shown above in \eqref{dersprop},+ −
that the $\textit{Ders}$ can be replaced by+ −
$L(\textit{ders}\ldots)$. That means \eqref{dersstep} is equivalent to+ −
+ −
\begin{equation}+ −
[] \in L(\textit{ders}\,s\,r)+ −
\label{prefinalstep}+ −
\end{equation}+ −
+ −
\noindent We have also shown that testing whether the empty+ −
string is in a language is equivalent to the $\textit{nullable}$+ −
function; see \eqref{nullableprop}. That means+ −
\eqref{prefinalstep} is equivalent with+ −
+ −
\[+ −
\textit{nullable}(\textit{ders}\,s\,r)+ −
\] + −
+ −
\noindent But this is just the definition of $matches$+ −
+ −
\[+ −
matches\,s\,r \dn nullable(\textit{ders}\,s\,r)+ −
\]+ −
+ −
\noindent In effect we have shown+ −
+ −
\[+ −
matches\,s\,r\;\;\text{if and only if}\;\;+ −
s\in L(r)+ −
\]+ −
+ −
\noindent which is the property we set out to prove: our algorithm+ −
meets its specification. To have done so, requires a few induction+ −
proofs about strings and regular expressions. Following the \emph{induction+ −
recipes} is already a big step in actually performing these proofs.+ −
If you do not believe it, proofs have helped me to make sure my code+ −
is correct and in several instances prevented me of letting slip+ −
embarrassing mistakes into the `wild'. + −
+ −
\end{document}+ −
+ −
+ −
+ −
% !TeX program = latexmk -xelatex+ −
%%% Local Variables: + −
%%% mode: latex+ −
%%% TeX-master: t+ −
%%% End: + −