180 and they have been object of intense study since then. They |
180 and they have been object of intense study since then. They |
181 are nowadays pretty much ubiquitous in computer science. There |
181 are nowadays pretty much ubiquitous in computer science. There |
182 are many libraries implementing regular expressions. I am sure |
182 are many libraries implementing regular expressions. I am sure |
183 you have come across them before (remember PRA?). Why on earth |
183 you have come across them before (remember PRA?). Why on earth |
184 then is there any interest in studying them again in depth in |
184 then is there any interest in studying them again in depth in |
185 this module? Well, one answer is in the following graph about |
185 this module? Well, one answer is in the following two graphs about |
186 regular expression matching in Python and in Ruby. |
186 regular expression matching in Python, Ruby and Java. |
187 |
187 |
188 \begin{center} |
188 \begin{center} |
|
189 \begin{tabular}{@{}cc@{}} |
189 \begin{tikzpicture} |
190 \begin{tikzpicture} |
190 \begin{axis}[ |
191 \begin{axis}[ |
191 xlabel={strings of {\tt a}s}, |
192 title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings |
|
193 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
|
194 xlabel={$n$}, |
|
195 x label style={at={(1.05,0.0)}}, |
192 ylabel={time in secs}, |
196 ylabel={time in secs}, |
193 enlargelimits=false, |
197 enlargelimits=false, |
194 xtick={0,5,...,30}, |
198 xtick={0,5,...,30}, |
195 xmax=33, |
199 xmax=33, |
196 ymax=35, |
200 ymax=35, |
197 ytick={0,5,...,30}, |
201 ytick={0,5,...,30}, |
198 scaled ticks=false, |
202 scaled ticks=false, |
199 axis lines=left, |
203 axis lines=left, |
200 width=7cm, |
204 width=5.5cm, |
201 height=4.5cm, |
205 height=4.5cm, |
202 legend entries={Python,Ruby}, |
206 legend entries={Python,Ruby}, |
203 legend pos=north west, |
207 legend pos=north west, |
204 legend cell align=left] |
208 legend cell align=left] |
205 \addplot[blue,mark=*, mark options={fill=white}] |
209 \addplot[blue,mark=*, mark options={fill=white}] |
206 table {re-python.data}; |
210 table {re-python.data}; |
207 \addplot[brown,mark=triangle*, mark options={fill=white}] |
211 \addplot[brown,mark=triangle*, mark options={fill=white}] |
208 table {re-ruby.data}; |
212 table {re-ruby.data}; |
209 \end{axis} |
213 \end{axis} |
210 \end{tikzpicture} |
214 \end{tikzpicture} |
211 \end{center} |
215 & |
212 |
216 \begin{tikzpicture} |
213 \noindent This graph shows that Python needs approximately 29 |
217 \begin{axis}[ |
|
218 title={Graph: $\texttt{(a*)*\,b}$ and strings |
|
219 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
|
220 xlabel={$n$}, |
|
221 x label style={at={(1.05,0.0)}}, |
|
222 ylabel={time in secs}, |
|
223 enlargelimits=false, |
|
224 xtick={0,5,...,30}, |
|
225 xmax=33, |
|
226 ymax=35, |
|
227 ytick={0,5,...,30}, |
|
228 scaled ticks=false, |
|
229 axis lines=left, |
|
230 width=5.5cm, |
|
231 height=4.5cm, |
|
232 legend entries={Java}, |
|
233 legend pos=north west, |
|
234 legend cell align=left] |
|
235 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
236 \end{axis} |
|
237 \end{tikzpicture} |
|
238 \end{tabular} |
|
239 \end{center} |
|
240 |
|
241 \noindent This first graph shows that Python needs approximately 29 |
214 seconds for finding out whether a string of 28 \texttt{a}s |
242 seconds for finding out whether a string of 28 \texttt{a}s |
215 matches the regular expression \texttt{[a?]\{28\}[a]\{28\}}. |
243 matches the regular expression \texttt{a?\{28\}\,a\{28\}}. |
216 Ruby is even slightly worse.\footnote{In this example Ruby |
244 Ruby is even slightly worse.\footnote{In this example Ruby |
217 uses the slightly different regular expression |
245 uses the slightly different regular expression |
218 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and |
246 \texttt{a?a?a?...a?a?aaa...aa}, where the \texttt{a?} and |
219 \texttt{a} each occur $n$ times. More such test cases can be |
247 \texttt{a} each occur $n$ times. More such test cases can be |
220 found at \url{http://www.computerbytesman.com/redos/}.} |
248 found at \url{http://www.computerbytesman.com/redos/}.} Simlarly, Java needs approximately |
221 Admittedly, this regular expression is carefully chosen to |
249 30 seconds to find out that the regular expression $\texttt{(a*)*\,b}$ |
|
250 does not match strings of 28 \texttt{a}s. |
|
251 Admittedly, these regular expressions are carefully chosen to |
222 exhibit this exponential behaviour, but similar ones occur |
252 exhibit this exponential behaviour, but similar ones occur |
223 more often than one wants in ``real life''. For example, on 20 |
253 more often than one wants in ``real life''. For example, on 20 |
224 July 2016 a similar regular expression brought the webpage |
254 July 2016 a similar regular expression brought the webpage |
225 \href{http://stackexchange.com}{Stack Exchange} to its knees: |
255 \href{http://stackexchange.com}{Stack Exchange} to its knees: |
226 |
256 |
227 \begin{center} |
257 \begin{center} |
228 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
258 \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
229 \end{center} |
259 \end{center} |
230 |
260 |
231 \noindent |
261 \noindent |
232 Such troublesome regular expressions are sometimes called |
262 Such troublesome regular expressions are sometimes called \emph{evil |
233 \emph{evil regular expressions} because they have the |
263 regular expressions} because they have the potential to make regular |
234 potential to make regular expression matching engines to |
264 expression matching engines to topple over, like in Python, Ruby and |
235 topple over, like in Python and Ruby. The problem with evil |
265 Java. This `toopling over' is also sometimes called \emph{catastrophic |
236 regular expressions is that they can have some serious |
266 backtracking}. The problem with evil regular expressions is that |
237 consequences, for example, if you use them in your |
267 they can have some serious consequences, for example, if you use them |
238 web-application. The reason is that hackers can look for these |
268 in your web-application. The reason is that hackers can look for these |
239 instances where the matching engine behaves badly and then |
269 instances where the matching engine behaves badly and then mount a |
240 mount a nice DoS-attack against your application. These |
270 nice DoS-attack against your application. These attacks are already |
241 attacks are already have their own name: \emph{Regular |
271 have their own name: \emph{Regular Expression Denial of Service |
242 Expression Denial of Service Attacks (ReDoS)}. |
272 Attacks (ReDoS)}. |
243 |
273 |
244 It will be instructive to look behind the ``scenes'' to find |
274 It will be instructive to look behind the ``scenes'' to find |
245 out why Python and Ruby (and others) behave so badly when |
275 out why Python and Ruby (and others) behave so badly when |
246 matching with evil regular expressions. But we will also look |
276 matching with evil regular expressions. But we will also look |
247 at a relatively simple algorithm that solves this problem much |
277 at a relatively simple algorithm that solves this problem much |
252 up to 12,000 in less than 10(!) seconds, see the graph below: |
282 up to 12,000 in less than 10(!) seconds, see the graph below: |
253 |
283 |
254 \begin{center} |
284 \begin{center} |
255 \begin{tikzpicture} |
285 \begin{tikzpicture} |
256 \begin{axis}[ |
286 \begin{axis}[ |
257 xlabel={strings of \pcode{a}s}, |
287 title={Graph: $\texttt{a?\{n\}\,a{\{n\}}}$ and strings |
|
288 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
|
289 xlabel={$n$}, |
|
290 x label style={at={(1.05,0.0)}}, |
258 ylabel={time in secs}, |
291 ylabel={time in secs}, |
259 enlargelimits=false, |
292 enlargelimits=false, |
260 xtick={0,3000,...,12000}, |
293 xtick={0,3000,...,12000}, |
261 xmax=12500, |
294 xmax=13000, |
262 ymax=35, |
295 ymax=12, |
263 ytick={0,5,...,30}, |
296 ytick={0,5,10}, |
264 scaled ticks=false, |
297 scaled ticks=false, |
265 axis lines=left, |
298 axis lines=left, |
266 width=9cm, |
299 width=7cm, |
267 height=4.5cm] |
300 height=4.5cm, |
268 \addplot[green,mark=square*,mark options={fill=white}] table {re2b.data}; |
301 legend entries={Our Algorithm V1, Our Algorithm V2}, |
|
302 legend pos=outer north east] |
|
303 \addplot[green,mark=square*,mark options={fill=white}] table {re2.data}; |
269 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; |
304 \addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; |
|
305 \end{axis} |
|
306 \end{tikzpicture} |
|
307 \end{center} |
|
308 |
|
309 \noindent And in the case of the regular expression $\texttt{(a*)*\,b}$ |
|
310 and strings of \texttt{a}s we will beat Java by factor of |
|
311 approximately 1,000,000 (note the scale on the $x$-axis). |
|
312 |
|
313 \begin{center} |
|
314 \begin{tikzpicture} |
|
315 \begin{axis}[ |
|
316 title={Graph: $\texttt{(a*)*\,b}$ and strings |
|
317 $\underbrace{\texttt{a}\ldots \texttt{a}}_{n}$}, |
|
318 xlabel={$n$}, |
|
319 x label style={at={(1.05,0.0)}}, |
|
320 ylabel={time in secs}, |
|
321 enlargelimits=false, |
|
322 ymax=12, |
|
323 ytick={0,5,10}, |
|
324 axis lines=left, |
|
325 width=7cm, |
|
326 height=4.5cm, |
|
327 legend entries={Our Algorithm V2}, |
|
328 legend pos=outer north east] |
|
329 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; |
270 \end{axis} |
330 \end{axis} |
271 \end{tikzpicture} |
331 \end{tikzpicture} |
272 \end{center} |
332 \end{center} |
273 |
333 |
274 \subsection*{Basic Regular Expressions} |
334 \subsection*{Basic Regular Expressions} |