handouts/ho01.tex
changeset 252 e8ef8f38ca84
parent 250 b79e704acb72
child 258 1e4da6d2490c
equal deleted inserted replaced
251:5b5a68df6d16 252:e8ef8f38ca84
   147 there any interest in studying them again in depth in this
   147 there any interest in studying them again in depth in this
   148 module? Well, one answer is in the following graph about
   148 module? Well, one answer is in the following graph about
   149 regular expression matching in Python and in Ruby.
   149 regular expression matching in Python and in Ruby.
   150 
   150 
   151 \begin{center}
   151 \begin{center}
   152 \begin{tikzpicture}[y=.1cm, x=.15cm]
   152 \begin{tikzpicture}[y=.09cm, x=.15cm]
   153  	%axis
   153  	%axis
   154 	\draw (0,0) -- coordinate (x axis mid) (30,0);
   154 	\draw (0,0) -- coordinate (x axis mid) (30,0);
   155    \draw (0,0) -- coordinate (y axis mid) (0,30);
   155    \draw (0,0) -- coordinate (y axis mid) (0,30);
   156    %ticks
   156    %ticks
   157    \foreach \x in {0,5,...,30}
   157    \foreach \x in {0,5,...,30}
   182 seconds for finding out whether a string of 28 \texttt{a}s
   182 seconds for finding out whether a string of 28 \texttt{a}s
   183 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}.
   183 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}.
   184 Ruby is even slightly worse.\footnote{In this example Ruby
   184 Ruby is even slightly worse.\footnote{In this example Ruby
   185 uses the slightly different regular expression
   185 uses the slightly different regular expression
   186 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and
   186 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and
   187 \texttt{a} each occur $n$ times.} Admittedly, this regular
   187 \texttt{a} each occur $n$ times. More test results can be
   188 expression is carefully chosen to exhibit this exponential
   188 found at \url{http://www.computerbytesman.com/redos/}.}
   189 behaviour, but similar ones occur more often than one wants in
   189 Admittedly, this regular expression is carefully chosen to
   190 ``real life''. They are sometimes called \emph{evil regular
   190 exhibit this exponential behaviour, but similar ones occur
   191 expressions} because they have the potential to make regular
   191 more often than one wants in ``real life''. They are sometimes
   192 expression matching engines to topple over, like in Python and
   192 called \emph{evil regular expressions} because they have the
   193 Ruby. The problem with evil regular expressions is that they
   193 potential to make regular expression matching engines to
   194 can have some serious consequences, for example, if you use
   194 topple over, like in Python and Ruby. The problem with evil
   195 them in your web-application. The reason is that hackers can
   195 regular expressions is that they can have some serious
   196 look for these instances where the matching engine behaves
   196 consequences, for example, if you use them in your
   197 badly and then mount a nice DoS-attack against your
   197 web-application. The reason is that hackers can look for these
   198 application.
   198 instances where the matching engine behaves badly and then
       
   199 mount a nice DoS-attack against your application. These
       
   200 attacks are already have their own name: 
       
   201 \emph{Regular Expression Denial of Servive Attack (ReDoS)}.
   199 
   202 
   200 It will be instructive to look behind the ``scenes'' to find
   203 It will be instructive to look behind the ``scenes'' to find
   201 out why Python and Ruby (and others) behave so badly when
   204 out why Python and Ruby (and others) behave so badly when
   202 matching with evil regular expressions. But we will also look
   205 matching with evil regular expressions. But we will also look
   203 at a relatively simple algorithm that solves this problem much
   206 at a relatively simple algorithm that solves this problem much
   206 process strings of approximately 1,000 \texttt{a}s in 30
   209 process strings of approximately 1,000 \texttt{a}s in 30
   207 seconds, while the second version will even be able to process
   210 seconds, while the second version will even be able to process
   208 up to 12,000 in less than 10(!) seconds, see the graph below:
   211 up to 12,000 in less than 10(!) seconds, see the graph below:
   209 
   212 
   210 \begin{center}
   213 \begin{center}
   211 \begin{tikzpicture}[y=.1cm, x=.0006cm]
   214 \begin{tikzpicture}[y=.09cm, x=.0006cm]
   212  	%axis
   215  	%axis
   213 	\draw (0,0) -- coordinate (x axis mid) (12000,0);
   216 	\draw (0,0) -- coordinate (x axis mid) (12000,0);
   214    \draw (0,0) -- coordinate (y axis mid) (0,30);
   217    \draw (0,0) -- coordinate (y axis mid) (0,30);
   215    %ticks
   218    %ticks
   216    \foreach \x in {0,2000,...,12000}
   219    \foreach \x in {0,2000,...,12000}