diff -r 065ca01b62ae -r 4ae59fd3b174 handouts/ho01.tex --- a/handouts/ho01.tex Mon Aug 22 23:05:43 2016 +0200 +++ b/handouts/ho01.tex Tue Aug 23 12:26:13 2016 +0200 @@ -24,7 +24,7 @@ %% emacs regexes %% https://www.gnu.org/software/emacs/manual/html_node/elisp/Regular-Expressions.html -%% reasons for a new prgramming language +%% reasons for a new prgramming language %% http://beautifulracket.com \begin{document} @@ -182,13 +182,17 @@ are many libraries implementing regular expressions. I am sure you have come across them before (remember PRA?). Why on earth then is there any interest in studying them again in depth in -this module? Well, one answer is in the following graph about -regular expression matching in Python and in Ruby. +this module? Well, one answer is in the following two graphs about +regular expression matching in Python, Ruby and Java. \begin{center} +\begin{tabular}{@{}cc@{}} \begin{tikzpicture} \begin{axis}[ - xlabel={strings of {\tt a}s}, + title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings + $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, + xlabel={$n$}, + x label style={at={(1.05,0.0)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,5,...,30}, @@ -197,7 +201,7 @@ ytick={0,5,...,30}, scaled ticks=false, axis lines=left, - width=7cm, + width=5.5cm, height=4.5cm, legend entries={Python,Ruby}, legend pos=north west, @@ -208,17 +212,43 @@ table {re-ruby.data}; \end{axis} \end{tikzpicture} +& +\begin{tikzpicture} +\begin{axis}[ + title={Graph: $\texttt{(a*)*\,b}$ and strings + $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, + xlabel={$n$}, + x label style={at={(1.05,0.0)}}, + ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=35, + ytick={0,5,...,30}, + scaled ticks=false, + axis lines=left, + width=5.5cm, + height=4.5cm, + legend entries={Java}, + legend pos=north west, + legend cell align=left] +\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; +\end{axis} +\end{tikzpicture} +\end{tabular} \end{center} -\noindent This graph shows that Python needs approximately 29 +\noindent This first graph shows that Python needs approximately 29 seconds for finding out whether a string of 28 \texttt{a}s -matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}. +matches the regular expression \texttt{a?\{28\}\,a\{28\}}. Ruby is even slightly worse.\footnote{In this example Ruby uses the slightly different regular expression \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and \texttt{a} each occur $n$ times. More such test cases can be -found at \url{http://www.computerbytesman.com/redos/}.} -Admittedly, this regular expression is carefully chosen to +found at \url{http://www.computerbytesman.com/redos/}.} Simlarly, Java needs approximately +30 seconds to find out that the regular expression $\texttt{(a*)*\,b}$ +does not match strings of 28 \texttt{a}s. +Admittedly, these regular expressions are carefully chosen to exhibit this exponential behaviour, but similar ones occur more often than one wants in ``real life''. For example, on 20 July 2016 a similar regular expression brought the webpage @@ -229,17 +259,17 @@ \end{center} \noindent -Such troublesome regular expressions are sometimes called -\emph{evil regular expressions} because they have the -potential to make regular expression matching engines to -topple over, like in Python and Ruby. The problem with evil -regular expressions is that they can have some serious -consequences, for example, if you use them in your -web-application. The reason is that hackers can look for these -instances where the matching engine behaves badly and then -mount a nice DoS-attack against your application. These -attacks are already have their own name: \emph{Regular -Expression Denial of Service Attacks (ReDoS)}. +Such troublesome regular expressions are sometimes called \emph{evil + regular expressions} because they have the potential to make regular +expression matching engines to topple over, like in Python, Ruby and +Java. This `toopling over' is also sometimes called \emph{catastrophic + backtracking}. The problem with evil regular expressions is that +they can have some serious consequences, for example, if you use them +in your web-application. The reason is that hackers can look for these +instances where the matching engine behaves badly and then mount a +nice DoS-attack against your application. These attacks are already +have their own name: \emph{Regular Expression Denial of Service + Attacks (ReDoS)}. It will be instructive to look behind the ``scenes'' to find out why Python and Ruby (and others) behave so badly when @@ -254,23 +284,53 @@ \begin{center} \begin{tikzpicture} \begin{axis}[ - xlabel={strings of \pcode{a}s}, + title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings + $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, + xlabel={$n$}, + x label style={at={(1.05,0.0)}}, ylabel={time in secs}, enlargelimits=false, xtick={0,3000,...,12000}, - xmax=12500, - ymax=35, - ytick={0,5,...,30}, + xmax=13000, + ymax=12, + ytick={0,5,10}, scaled ticks=false, axis lines=left, - width=9cm, - height=4.5cm] -\addplot[green,mark=square*,mark options={fill=white}] table {re2b.data}; + width=7cm, + height=4.5cm, + legend entries={Our Algorithm V1, Our Algorithm V2}, + legend pos=outer north east] +\addplot[green,mark=square*,mark options={fill=white}] table {re2.data}; \addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; \end{axis} \end{tikzpicture} \end{center} +\noindent And in the case of the regular expression $\texttt{(a*)*\,b}$ +and strings of \texttt{a}s we will beat Java by factor of +approximately 1,000,000 (note the scale on the $x$-axis). + +\begin{center} +\begin{tikzpicture} + \begin{axis}[ + title={Graph: $\texttt{(a*)*\,b}$ and strings + $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, + xlabel={$n$}, + x label style={at={(1.05,0.0)}}, + ylabel={time in secs}, + enlargelimits=false, + ymax=12, + ytick={0,5,10}, + axis lines=left, + width=7cm, + height=4.5cm, + legend entries={Our Algorithm V2}, + legend pos=outer north east] +\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; +\end{axis} +\end{tikzpicture} +\end{center} + \subsection*{Basic Regular Expressions} The regular expressions shown earlier for Scala, we