--- a/handouts/ho01.tex Mon Aug 22 23:05:43 2016 +0200
+++ b/handouts/ho01.tex Tue Aug 23 12:26:13 2016 +0200
@@ -24,7 +24,7 @@
%% emacs regexes
%% https://www.gnu.org/software/emacs/manual/html_node/elisp/Regular-Expressions.html
-%% reasons for a new prgramming language
+%% reasons for a new prgramming language
%% http://beautifulracket.com
\begin{document}
@@ -182,13 +182,17 @@
are many libraries implementing regular expressions. I am sure
you have come across them before (remember PRA?). Why on earth
then is there any interest in studying them again in depth in
-this module? Well, one answer is in the following graph about
-regular expression matching in Python and in Ruby.
+this module? Well, one answer is in the following two graphs about
+regular expression matching in Python, Ruby and Java.
\begin{center}
+\begin{tabular}{@{}cc@{}}
\begin{tikzpicture}
\begin{axis}[
- xlabel={strings of {\tt a}s},
+ title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
+ $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+ xlabel={$n$},
+ x label style={at={(1.05,0.0)}},
ylabel={time in secs},
enlargelimits=false,
xtick={0,5,...,30},
@@ -197,7 +201,7 @@
ytick={0,5,...,30},
scaled ticks=false,
axis lines=left,
- width=7cm,
+ width=5.5cm,
height=4.5cm,
legend entries={Python,Ruby},
legend pos=north west,
@@ -208,17 +212,43 @@
table {re-ruby.data};
\end{axis}
\end{tikzpicture}
+&
+\begin{tikzpicture}
+\begin{axis}[
+ title={Graph: $\texttt{(a*)*\,b}$ and strings
+ $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+ xlabel={$n$},
+ x label style={at={(1.05,0.0)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ xtick={0,5,...,30},
+ xmax=33,
+ ymax=35,
+ ytick={0,5,...,30},
+ scaled ticks=false,
+ axis lines=left,
+ width=5.5cm,
+ height=4.5cm,
+ legend entries={Java},
+ legend pos=north west,
+ legend cell align=left]
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
\end{center}
-\noindent This graph shows that Python needs approximately 29
+\noindent This first graph shows that Python needs approximately 29
seconds for finding out whether a string of 28 \texttt{a}s
-matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}.
+matches the regular expression \texttt{a?\{28\}\,a\{28\}}.
Ruby is even slightly worse.\footnote{In this example Ruby
uses the slightly different regular expression
\texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and
\texttt{a} each occur $n$ times. More such test cases can be
-found at \url{http://www.computerbytesman.com/redos/}.}
-Admittedly, this regular expression is carefully chosen to
+found at \url{http://www.computerbytesman.com/redos/}.} Simlarly, Java needs approximately
+30 seconds to find out that the regular expression $\texttt{(a*)*\,b}$
+does not match strings of 28 \texttt{a}s.
+Admittedly, these regular expressions are carefully chosen to
exhibit this exponential behaviour, but similar ones occur
more often than one wants in ``real life''. For example, on 20
July 2016 a similar regular expression brought the webpage
@@ -229,17 +259,17 @@
\end{center}
\noindent
-Such troublesome regular expressions are sometimes called
-\emph{evil regular expressions} because they have the
-potential to make regular expression matching engines to
-topple over, like in Python and Ruby. The problem with evil
-regular expressions is that they can have some serious
-consequences, for example, if you use them in your
-web-application. The reason is that hackers can look for these
-instances where the matching engine behaves badly and then
-mount a nice DoS-attack against your application. These
-attacks are already have their own name: \emph{Regular
-Expression Denial of Service Attacks (ReDoS)}.
+Such troublesome regular expressions are sometimes called \emph{evil
+ regular expressions} because they have the potential to make regular
+expression matching engines to topple over, like in Python, Ruby and
+Java. This `toopling over' is also sometimes called \emph{catastrophic
+ backtracking}. The problem with evil regular expressions is that
+they can have some serious consequences, for example, if you use them
+in your web-application. The reason is that hackers can look for these
+instances where the matching engine behaves badly and then mount a
+nice DoS-attack against your application. These attacks are already
+have their own name: \emph{Regular Expression Denial of Service
+ Attacks (ReDoS)}.
It will be instructive to look behind the ``scenes'' to find
out why Python and Ruby (and others) behave so badly when
@@ -254,23 +284,53 @@
\begin{center}
\begin{tikzpicture}
\begin{axis}[
- xlabel={strings of \pcode{a}s},
+ title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
+ $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+ xlabel={$n$},
+ x label style={at={(1.05,0.0)}},
ylabel={time in secs},
enlargelimits=false,
xtick={0,3000,...,12000},
- xmax=12500,
- ymax=35,
- ytick={0,5,...,30},
+ xmax=13000,
+ ymax=12,
+ ytick={0,5,10},
scaled ticks=false,
axis lines=left,
- width=9cm,
- height=4.5cm]
-\addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};
+ width=7cm,
+ height=4.5cm,
+ legend entries={Our Algorithm V1, Our Algorithm V2},
+ legend pos=outer north east]
+\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
\end{axis}
\end{tikzpicture}
\end{center}
+\noindent And in the case of the regular expression $\texttt{(a*)*\,b}$
+and strings of \texttt{a}s we will beat Java by factor of
+approximately 1,000,000 (note the scale on the $x$-axis).
+
+\begin{center}
+\begin{tikzpicture}
+ \begin{axis}[
+ title={Graph: $\texttt{(a*)*\,b}$ and strings
+ $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+ xlabel={$n$},
+ x label style={at={(1.05,0.0)}},
+ ylabel={time in secs},
+ enlargelimits=false,
+ ymax=12,
+ ytick={0,5,10},
+ axis lines=left,
+ width=7cm,
+ height=4.5cm,
+ legend entries={Our Algorithm V2},
+ legend pos=outer north east]
+\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
+\end{axis}
+\end{tikzpicture}
+\end{center}
+
\subsection*{Basic Regular Expressions}
The regular expressions shown earlier for Scala, we