afl-material: comparison handouts/ho01.tex

equal deleted inserted replaced

-:065ca01b62ae
+:4ae59fd3b174
 % http://regexr.com
 %% emacs regexes
 %% https://www.gnu.org/software/emacs/manual/html_node/elisp/Regular-Expressions.html
-%% reasons for a new prgramming language
+%%  reasons for a new prgramming language
 %% http://beautifulracket.com
 \begin{document}
 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
 and they have been object of intense study since then. They
 are nowadays pretty much ubiquitous in computer science. There
 are many libraries implementing regular expressions. I am sure
 you have come across them before (remember PRA?). Why on earth
 then is there any interest in studying them again in depth in
-this module? Well, one answer is in the following graph about
+this module? Well, one answer is in the following two graphs about
-regular expression matching in Python and in Ruby.
+regular expression matching in Python, Ruby and Java.
 \begin{center}
+\begin{tabular}{@{}cc@{}}
 \begin{tikzpicture}
 \begin{axis}[
-xlabel={strings of {\tt a}s},
+title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
+$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+xlabel={$n$},
+x label style={at={(1.05,0.0)}},
 ylabel={time in secs},
 enlargelimits=false,
 xtick={0,5,...,30},
 xmax=33,
 ymax=35,
 ytick={0,5,...,30},
 scaled ticks=false,
 axis lines=left,
-width=7cm,
+width=5.5cm,
 height=4.5cm,
 legend entries={Python,Ruby},
 legend pos=north west,
 legend cell align=left]
 \addplot[blue,mark=*, mark options={fill=white}]
 table {re-python.data};
 \addplot[brown,mark=triangle*, mark options={fill=white}]
 table {re-ruby.data};
 \end{axis}
 \end{tikzpicture}
-\end{center}
+&
+\begin{tikzpicture}
-\noindent This graph shows that Python needs approximately 29
+\begin{axis}[
+title={Graph: $\texttt{(a*)*\,b}$ and strings
+$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+xlabel={$n$},
+x label style={at={(1.05,0.0)}},
+ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=35,
+ytick={0,5,...,30},
+scaled ticks=false,
+axis lines=left,
+width=5.5cm,
+height=4.5cm,
+legend entries={Java},
+legend pos=north west,
+legend cell align=left]
+\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\end{axis}
+\end{tikzpicture}
+\end{tabular}
+\end{center}
+\noindent This first graph shows that Python needs approximately 29
 seconds for finding out whether a string of 28 \texttt{a}s
-matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}.
+matches the regular expression \texttt{a?\{28\}\,a\{28\}}.
 Ruby is even slightly worse.\footnote{In this example Ruby
 uses the slightly different regular expression
 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and
 \texttt{a} each occur $n$ times. More such test cases can be
-found at \url{http://www.computerbytesman.com/redos/}.}
+found at \url{http://www.computerbytesman.com/redos/}.} Simlarly, Java needs approximately
-Admittedly, this regular expression is carefully chosen to
+30 seconds to find out that the regular expression $\texttt{(a*)*\,b}$
+does not match strings of 28 \texttt{a}s.
+Admittedly, these regular expressions are carefully chosen to
 exhibit this exponential behaviour, but similar ones occur
 more often than one wants in ``real life''. For example, on 20
 July 2016 a similar regular expression brought the webpage
 \href{http://stackexchange.com}{Stack Exchange} to its knees:
 \begin{center}
 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
 \end{center}
 \noindent
-Such troublesome regular expressions are sometimes called
+Such troublesome regular expressions are sometimes called \emph{evil
-\emph{evil regular expressions} because they have the
+regular expressions} because they have the potential to make regular
-potential to make regular expression matching engines to
+expression matching engines to topple over, like in Python, Ruby and
-topple over, like in Python and Ruby. The problem with evil
+Java. This `toopling over' is also sometimes called \emph{catastrophic
-regular expressions is that they can have some serious
+backtracking}.  The problem with evil regular expressions is that
-consequences, for example, if you use them in your
+they can have some serious consequences, for example, if you use them
-web-application. The reason is that hackers can look for these
+in your web-application. The reason is that hackers can look for these
-instances where the matching engine behaves badly and then
+instances where the matching engine behaves badly and then mount a
-mount a nice DoS-attack against your application. These
+nice DoS-attack against your application. These attacks are already
-attacks are already have their own name: \emph{Regular
+have their own name: \emph{Regular Expression Denial of Service
-Expression Denial of Service Attacks (ReDoS)}.
+Attacks (ReDoS)}.
 It will be instructive to look behind the ``scenes'' to find
 out why Python and Ruby (and others) behave so badly when
 matching with evil regular expressions. But we will also look
 at a relatively simple algorithm that solves this problem much
 up to 12,000 in less than 10(!) seconds, see the graph below:
 \begin{center}
 \begin{tikzpicture}
 \begin{axis}[
-xlabel={strings of \pcode{a}s},
+title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings
+$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+xlabel={$n$},
+x label style={at={(1.05,0.0)}},
 ylabel={time in secs},
 enlargelimits=false,
 xtick={0,3000,...,12000},
-xmax=12500,
+xmax=13000,
-ymax=35,
+ymax=12,
-ytick={0,5,...,30},
+ytick={0,5,10},
 scaled ticks=false,
 axis lines=left,
-width=9cm,
+width=7cm,
-height=4.5cm]
+height=4.5cm,
-\addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};
+legend entries={Our Algorithm V1, Our Algorithm V2},
+legend pos=outer north east]
+\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
+\end{axis}
+\end{tikzpicture}
+\end{center}
+\noindent And in the case of the regular expression $\texttt{(a*)*\,b}$
+and strings of \texttt{a}s we will beat Java by factor of
+approximately 1,000,000 (note the scale on the $x$-axis).
+\begin{center}
+\begin{tikzpicture}
+\begin{axis}[
+title={Graph: $\texttt{(a*)*\,b}$ and strings
+$\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
+xlabel={$n$},
+x label style={at={(1.05,0.0)}},
+ylabel={time in secs},
+enlargelimits=false,
+ymax=12,
+ytick={0,5,10},
+axis lines=left,
+width=7cm,
+height=4.5cm,
+legend entries={Our Algorithm V2},
+legend pos=outer north east]
+\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
 \end{axis}
 \end{tikzpicture}
 \end{center}
 \subsection*{Basic Regular Expressions}

changeset 415	4ae59fd3b174
parent 407	4b454a6d1814
child 416	357c395ae838