handouts/ho01.tex
changeset 415 4ae59fd3b174
parent 407 4b454a6d1814
child 416 357c395ae838
equal deleted inserted replaced
414:065ca01b62ae 415:4ae59fd3b174
    22 % http://regexr.com
    22 % http://regexr.com
    23 
    23 
    24 %% emacs regexes
    24 %% emacs regexes
    25 %% https://www.gnu.org/software/emacs/manual/html_node/elisp/Regular-Expressions.html
    25 %% https://www.gnu.org/software/emacs/manual/html_node/elisp/Regular-Expressions.html
    26 
    26 
    27 %% reasons for a new prgramming language
    27 %%  reasons for a new prgramming language
    28 %% http://beautifulracket.com
    28 %% http://beautifulracket.com
    29 
    29 
    30 \begin{document}
    30 \begin{document}
    31 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
    31 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
    32 
    32 
   180 and they have been object of intense study since then. They
   180 and they have been object of intense study since then. They
   181 are nowadays pretty much ubiquitous in computer science. There
   181 are nowadays pretty much ubiquitous in computer science. There
   182 are many libraries implementing regular expressions. I am sure
   182 are many libraries implementing regular expressions. I am sure
   183 you have come across them before (remember PRA?). Why on earth
   183 you have come across them before (remember PRA?). Why on earth
   184 then is there any interest in studying them again in depth in
   184 then is there any interest in studying them again in depth in
   185 this module? Well, one answer is in the following graph about
   185 this module? Well, one answer is in the following two graphs about
   186 regular expression matching in Python and in Ruby.
   186 regular expression matching in Python, Ruby and Java.
   187 
   187 
   188 \begin{center}
   188 \begin{center}
       
   189 \begin{tabular}{@{}cc@{}}
   189 \begin{tikzpicture}
   190 \begin{tikzpicture}
   190 \begin{axis}[
   191 \begin{axis}[
   191     xlabel={strings of {\tt a}s},
   192     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
       
   193            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
       
   194     xlabel={$n$},
       
   195     x label style={at={(1.05,0.0)}},
   192     ylabel={time in secs},
   196     ylabel={time in secs},
   193     enlargelimits=false,
   197     enlargelimits=false,
   194     xtick={0,5,...,30},
   198     xtick={0,5,...,30},
   195     xmax=33,
   199     xmax=33,
   196     ymax=35,
   200     ymax=35,
   197     ytick={0,5,...,30},
   201     ytick={0,5,...,30},
   198     scaled ticks=false,
   202     scaled ticks=false,
   199     axis lines=left,
   203     axis lines=left,
   200     width=7cm,
   204     width=5.5cm,
   201     height=4.5cm, 
   205     height=4.5cm, 
   202     legend entries={Python,Ruby},  
   206     legend entries={Python,Ruby},  
   203     legend pos=north west,
   207     legend pos=north west,
   204     legend cell align=left]
   208     legend cell align=left]
   205 \addplot[blue,mark=*, mark options={fill=white}] 
   209 \addplot[blue,mark=*, mark options={fill=white}] 
   206   table {re-python.data};
   210   table {re-python.data};
   207 \addplot[brown,mark=triangle*, mark options={fill=white}] 
   211 \addplot[brown,mark=triangle*, mark options={fill=white}] 
   208   table {re-ruby.data};  
   212   table {re-ruby.data};  
   209 \end{axis}
   213 \end{axis}
   210 \end{tikzpicture}
   214 \end{tikzpicture}
   211 \end{center}
   215 &
   212 
   216 \begin{tikzpicture}
   213 \noindent This graph shows that Python needs approximately 29
   217 \begin{axis}[
       
   218     title={Graph: $\texttt{(a*)*\,b}$ and strings 
       
   219            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
       
   220     xlabel={$n$},
       
   221     x label style={at={(1.05,0.0)}},
       
   222     ylabel={time in secs},
       
   223     enlargelimits=false,
       
   224     xtick={0,5,...,30},
       
   225     xmax=33,
       
   226     ymax=35,
       
   227     ytick={0,5,...,30},
       
   228     scaled ticks=false,
       
   229     axis lines=left,
       
   230     width=5.5cm,
       
   231     height=4.5cm, 
       
   232     legend entries={Java},  
       
   233     legend pos=north west,
       
   234     legend cell align=left]
       
   235 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   236 \end{axis}
       
   237 \end{tikzpicture}
       
   238 \end{tabular}
       
   239 \end{center}
       
   240 
       
   241 \noindent This first graph shows that Python needs approximately 29
   214 seconds for finding out whether a string of 28 \texttt{a}s
   242 seconds for finding out whether a string of 28 \texttt{a}s
   215 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}.
   243 matches the regular expression \texttt{a?\{28\}\,a\{28\}}.
   216 Ruby is even slightly worse.\footnote{In this example Ruby
   244 Ruby is even slightly worse.\footnote{In this example Ruby
   217 uses the slightly different regular expression
   245 uses the slightly different regular expression
   218 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and
   246 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and
   219 \texttt{a} each occur $n$ times. More such test cases can be
   247 \texttt{a} each occur $n$ times. More such test cases can be
   220 found at \url{http://www.computerbytesman.com/redos/}.}
   248 found at \url{http://www.computerbytesman.com/redos/}.} Simlarly, Java needs approximately
   221 Admittedly, this regular expression is carefully chosen to
   249 30 seconds to find out that the regular expression $\texttt{(a*)*\,b}$
       
   250 does not match strings of 28 \texttt{a}s.
       
   251 Admittedly, these regular expressions are carefully chosen to
   222 exhibit this exponential behaviour, but similar ones occur
   252 exhibit this exponential behaviour, but similar ones occur
   223 more often than one wants in ``real life''. For example, on 20
   253 more often than one wants in ``real life''. For example, on 20
   224 July 2016 a similar regular expression brought the webpage
   254 July 2016 a similar regular expression brought the webpage
   225 \href{http://stackexchange.com}{Stack Exchange} to its knees:
   255 \href{http://stackexchange.com}{Stack Exchange} to its knees:
   226 
   256 
   227 \begin{center}
   257 \begin{center}
   228 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   258 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   229 \end{center}
   259 \end{center}
   230 
   260 
   231 \noindent
   261 \noindent
   232 Such troublesome regular expressions are sometimes called
   262 Such troublesome regular expressions are sometimes called \emph{evil
   233 \emph{evil regular expressions} because they have the
   263   regular expressions} because they have the potential to make regular
   234 potential to make regular expression matching engines to
   264 expression matching engines to topple over, like in Python, Ruby and
   235 topple over, like in Python and Ruby. The problem with evil
   265 Java. This `toopling over' is also sometimes called \emph{catastrophic
   236 regular expressions is that they can have some serious
   266   backtracking}.  The problem with evil regular expressions is that
   237 consequences, for example, if you use them in your
   267 they can have some serious consequences, for example, if you use them
   238 web-application. The reason is that hackers can look for these
   268 in your web-application. The reason is that hackers can look for these
   239 instances where the matching engine behaves badly and then
   269 instances where the matching engine behaves badly and then mount a
   240 mount a nice DoS-attack against your application. These
   270 nice DoS-attack against your application. These attacks are already
   241 attacks are already have their own name: \emph{Regular
   271 have their own name: \emph{Regular Expression Denial of Service
   242 Expression Denial of Service Attacks (ReDoS)}.
   272   Attacks (ReDoS)}.
   243 
   273 
   244 It will be instructive to look behind the ``scenes'' to find
   274 It will be instructive to look behind the ``scenes'' to find
   245 out why Python and Ruby (and others) behave so badly when
   275 out why Python and Ruby (and others) behave so badly when
   246 matching with evil regular expressions. But we will also look
   276 matching with evil regular expressions. But we will also look
   247 at a relatively simple algorithm that solves this problem much
   277 at a relatively simple algorithm that solves this problem much
   252 up to 12,000 in less than 10(!) seconds, see the graph below:
   282 up to 12,000 in less than 10(!) seconds, see the graph below:
   253 
   283 
   254 \begin{center}
   284 \begin{center}
   255 \begin{tikzpicture}
   285 \begin{tikzpicture}
   256   \begin{axis}[
   286   \begin{axis}[
   257     xlabel={strings of \pcode{a}s},
   287     title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings 
       
   288            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
       
   289     xlabel={$n$},
       
   290     x label style={at={(1.05,0.0)}},
   258     ylabel={time in secs},
   291     ylabel={time in secs},
   259     enlargelimits=false,
   292     enlargelimits=false,
   260     xtick={0,3000,...,12000},
   293     xtick={0,3000,...,12000},
   261     xmax=12500,
   294     xmax=13000,
   262     ymax=35,
   295     ymax=12,
   263     ytick={0,5,...,30},
   296     ytick={0,5,10},
   264     scaled ticks=false,
   297     scaled ticks=false,
   265     axis lines=left,
   298     axis lines=left,
   266     width=9cm,
   299     width=7cm,
   267     height=4.5cm]
   300     height=4.5cm,
   268 \addplot[green,mark=square*,mark options={fill=white}] table {re2b.data};
   301     legend entries={Our Algorithm V1, Our Algorithm V2},
       
   302     legend pos=outer north east]
       
   303 \addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
   269 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
   304 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
       
   305 \end{axis}
       
   306 \end{tikzpicture}
       
   307 \end{center}
       
   308 
       
   309 \noindent And in the case of the regular expression $\texttt{(a*)*\,b}$
       
   310 and strings of \texttt{a}s we will beat Java by factor of
       
   311 approximately 1,000,000 (note the scale on the $x$-axis). 
       
   312 
       
   313 \begin{center}
       
   314 \begin{tikzpicture}
       
   315   \begin{axis}[
       
   316     title={Graph: $\texttt{(a*)*\,b}$ and strings 
       
   317            $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$},
       
   318     xlabel={$n$},
       
   319     x label style={at={(1.05,0.0)}},
       
   320     ylabel={time in secs},
       
   321     enlargelimits=false,
       
   322     ymax=12,
       
   323     ytick={0,5,10},
       
   324     axis lines=left,
       
   325     width=7cm,
       
   326     height=4.5cm,
       
   327     legend entries={Our Algorithm V2},
       
   328     legend pos=outer north east]
       
   329 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
   270 \end{axis}
   330 \end{axis}
   271 \end{tikzpicture}
   331 \end{tikzpicture}
   272 \end{center}
   332 \end{center}
   273 
   333 
   274 \subsection*{Basic Regular Expressions}
   334 \subsection*{Basic Regular Expressions}