147 there any interest in studying them again in depth in this |
147 there any interest in studying them again in depth in this |
148 module? Well, one answer is in the following graph about |
148 module? Well, one answer is in the following graph about |
149 regular expression matching in Python and in Ruby. |
149 regular expression matching in Python and in Ruby. |
150 |
150 |
151 \begin{center} |
151 \begin{center} |
152 \begin{tikzpicture}[y=.1cm, x=.15cm] |
152 \begin{tikzpicture}[y=.09cm, x=.15cm] |
153 %axis |
153 %axis |
154 \draw (0,0) -- coordinate (x axis mid) (30,0); |
154 \draw (0,0) -- coordinate (x axis mid) (30,0); |
155 \draw (0,0) -- coordinate (y axis mid) (0,30); |
155 \draw (0,0) -- coordinate (y axis mid) (0,30); |
156 %ticks |
156 %ticks |
157 \foreach \x in {0,5,...,30} |
157 \foreach \x in {0,5,...,30} |
182 seconds for finding out whether a string of 28 \texttt{a}s |
182 seconds for finding out whether a string of 28 \texttt{a}s |
183 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}. |
183 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}. |
184 Ruby is even slightly worse.\footnote{In this example Ruby |
184 Ruby is even slightly worse.\footnote{In this example Ruby |
185 uses the slightly different regular expression |
185 uses the slightly different regular expression |
186 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and |
186 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and |
187 \texttt{a} each occur $n$ times.} Admittedly, this regular |
187 \texttt{a} each occur $n$ times. More test results can be |
188 expression is carefully chosen to exhibit this exponential |
188 found at \url{http://www.computerbytesman.com/redos/}.} |
189 behaviour, but similar ones occur more often than one wants in |
189 Admittedly, this regular expression is carefully chosen to |
190 ``real life''. They are sometimes called \emph{evil regular |
190 exhibit this exponential behaviour, but similar ones occur |
191 expressions} because they have the potential to make regular |
191 more often than one wants in ``real life''. They are sometimes |
192 expression matching engines to topple over, like in Python and |
192 called \emph{evil regular expressions} because they have the |
193 Ruby. The problem with evil regular expressions is that they |
193 potential to make regular expression matching engines to |
194 can have some serious consequences, for example, if you use |
194 topple over, like in Python and Ruby. The problem with evil |
195 them in your web-application. The reason is that hackers can |
195 regular expressions is that they can have some serious |
196 look for these instances where the matching engine behaves |
196 consequences, for example, if you use them in your |
197 badly and then mount a nice DoS-attack against your |
197 web-application. The reason is that hackers can look for these |
198 application. |
198 instances where the matching engine behaves badly and then |
|
199 mount a nice DoS-attack against your application. These |
|
200 attacks are already have their own name: |
|
201 \emph{Regular Expression Denial of Servive Attack (ReDoS)}. |
199 |
202 |
200 It will be instructive to look behind the ``scenes'' to find |
203 It will be instructive to look behind the ``scenes'' to find |
201 out why Python and Ruby (and others) behave so badly when |
204 out why Python and Ruby (and others) behave so badly when |
202 matching with evil regular expressions. But we will also look |
205 matching with evil regular expressions. But we will also look |
203 at a relatively simple algorithm that solves this problem much |
206 at a relatively simple algorithm that solves this problem much |
206 process strings of approximately 1,000 \texttt{a}s in 30 |
209 process strings of approximately 1,000 \texttt{a}s in 30 |
207 seconds, while the second version will even be able to process |
210 seconds, while the second version will even be able to process |
208 up to 12,000 in less than 10(!) seconds, see the graph below: |
211 up to 12,000 in less than 10(!) seconds, see the graph below: |
209 |
212 |
210 \begin{center} |
213 \begin{center} |
211 \begin{tikzpicture}[y=.1cm, x=.0006cm] |
214 \begin{tikzpicture}[y=.09cm, x=.0006cm] |
212 %axis |
215 %axis |
213 \draw (0,0) -- coordinate (x axis mid) (12000,0); |
216 \draw (0,0) -- coordinate (x axis mid) (12000,0); |
214 \draw (0,0) -- coordinate (y axis mid) (0,30); |
217 \draw (0,0) -- coordinate (y axis mid) (0,30); |
215 %ticks |
218 %ticks |
216 \foreach \x in {0,2000,...,12000} |
219 \foreach \x in {0,2000,...,12000} |