220 under a certain class of inputs. |
218 under a certain class of inputs. |
221 %However, , this is not the case for $\mathbf{all}$ inputs. |
219 %However, , this is not the case for $\mathbf{all}$ inputs. |
222 %TODO: get source for SNORT/BRO's regex matching engine/speed |
220 %TODO: get source for SNORT/BRO's regex matching engine/speed |
223 |
221 |
224 |
222 |
225 Take $(a^*)^*\,b$ and ask whether |
223 Consider $(a^*)^*\,b$ and ask whether |
226 strings of the form $aa..a$ match this regular |
224 strings of the form $aa..a$ can be matched by this regular |
227 expression. Obviously this is not the case---the expected $b$ in the last |
225 expression. Obviously this is not the case---the expected $b$ in the last |
228 position is missing. One would expect that modern regular expression |
226 position is missing. One would expect that modern regular expression |
229 matching engines can find this out very quickly. Alas, if one tries |
227 matching engines can find this out very quickly. Surprisingly, if one tries |
230 this example in JavaScript, Python or Java 8, even with strings of a small |
228 this example in JavaScript, Python or Java 8, even with small strings, |
231 length, say around 30 $a$'s, |
229 say of lenght of around 30 $a$'s, |
232 the decision takes crazy time to finish (graph \ref{fig:aStarStarb}). |
230 the decision takes an absurd time to finish (see graphs in figure \ref{fig:aStarStarb}). |
233 This is clearly exponential behaviour, and |
231 This is clearly exponential behaviour, and |
234 is triggered by some relatively simple regex patterns. |
232 is triggered by some relatively simple regular expressions. |
235 Java 9 and newer |
233 Java 9 and newer |
236 versions improves this behaviour, but is still slow compared |
234 versions improve this behaviour somewhat, but is still slow compared |
237 with the approach we are going to use. |
235 with the approach we are going to use in this thesis. |
238 |
|
239 |
236 |
240 |
237 |
241 |
238 |
242 This superlinear blowup in regular expression engines |
239 This superlinear blowup in regular expression engines |
243 had repeatedly caused grief in real life that they |
240 had repeatedly caused grief in ``real life'' where it is |
244 get a name for them--``catastrophic backtracking''. |
241 given the name ``catastrophic backtracking'' or ``evil'' regular expressions. |
245 For example, on 20 July 2016 one evil |
242 For example, on 20 July 2016 one evil |
246 regular expression brought the webpage |
243 regular expression brought the webpage |
247 \href{http://stackexchange.com}{Stack Exchange} to its |
244 \href{http://stackexchange.com}{Stack Exchange} to its |
248 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} |
245 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}(Last accessed in 2019)} |
249 In this instance, a regular expression intended to just trim white |
246 In this instance, a regular expression intended to just trim white |
353 ytick={0,5,...,30}, |
329 ytick={0,5,...,30}, |
354 scaled ticks=false, |
330 scaled ticks=false, |
355 axis lines=left, |
331 axis lines=left, |
356 width=5cm, |
332 width=5cm, |
357 height=4cm, |
333 height=4cm, |
|
334 legend entries={Dart}, |
|
335 legend pos=north west, |
|
336 legend cell align=left] |
|
337 \addplot[green,mark=*, mark options={fill=white}] table {re-dart.data}; |
|
338 \end{axis} |
|
339 \end{tikzpicture}\\ |
|
340 \begin{tikzpicture} |
|
341 \begin{axis}[ |
|
342 xlabel={$n$}, |
|
343 x label style={at={(1.05,-0.05)}}, |
|
344 ylabel={time in secs}, |
|
345 enlargelimits=false, |
|
346 xtick={0,5,...,30}, |
|
347 xmax=33, |
|
348 ymax=35, |
|
349 ytick={0,5,...,30}, |
|
350 scaled ticks=false, |
|
351 axis lines=left, |
|
352 width=5cm, |
|
353 height=4cm, |
358 legend entries={Swift}, |
354 legend entries={Swift}, |
359 legend pos=north west, |
355 legend pos=north west, |
360 legend cell align=left] |
356 legend cell align=left] |
361 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data}; |
357 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data}; |
362 \end{axis} |
358 \end{axis} |
363 \end{tikzpicture} |
359 \end{tikzpicture} |
364 & \\ |
360 & |
365 \multicolumn{3}{c}{Graphs} |
361 \begin{tikzpicture} |
|
362 \begin{axis}[ |
|
363 xlabel={$n$}, |
|
364 x label style={at={(1.05,-0.05)}}, |
|
365 %ylabel={time in secs}, |
|
366 enlargelimits=true, |
|
367 %xtick={0,5000,...,40000}, |
|
368 %xmax=40000, |
|
369 %ymax=35, |
|
370 restrict x to domain*=0:40000, |
|
371 restrict y to domain*=0:35, |
|
372 %ytick={0,5,...,30}, |
|
373 %scaled ticks=false, |
|
374 axis lines=left, |
|
375 width=5cm, |
|
376 height=4cm, |
|
377 legend entries={Java9+}, |
|
378 legend pos=north west, |
|
379 legend cell align=left] |
|
380 \addplot[orange,mark=*, mark options={fill=white}] table {re-java9.data}; |
|
381 \end{axis} |
|
382 \end{tikzpicture}\\ |
|
383 \multicolumn{2}{c}{Graphs} |
366 \end{tabular} |
384 \end{tabular} |
367 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings |
385 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings |
368 of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. |
386 of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries. |
369 The reason for their superlinear behaviour is that they do a depth-first-search. |
387 The reason for their superlinear behaviour is that they do a depth-first-search |
370 If the string does not match, the engine starts to explore all possibilities. |
388 using NFAs. |
|
389 If the string does not match, the regular expression matching |
|
390 engine starts to explore all possibilities. |
371 }\label{fig:aStarStarb} |
391 }\label{fig:aStarStarb} |
372 \end{figure}\afterpage{\clearpage} |
392 \end{figure}\afterpage{\clearpage} |
373 |
393 |
374 A more recent example is a global outage of all Cloudflare servers on 2 July |
394 A more recent example is a global outage of all Cloudflare servers on 2 July |
375 2019. A poorly written regular expression exhibited exponential |
395 2019. A poorly written regular expression exhibited catastrophic backtracking |
376 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage |
396 and exhausted CPUs that serve HTTP traffic. Although the outage |
377 had several causes, at the heart was a regular expression that |
397 had several causes, at the heart was a regular expression that |
378 was used to monitor network |
398 was used to monitor network |
379 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} |
399 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)} |
380 These problems with regular expressions |
400 These problems with regular expressions |
381 are not isolated events that happen |
401 are not isolated events that happen |
382 very occasionally, but actually widespread. |
402 very occasionally, but actually widespread. |
383 They occur so often that they get a |
403 They occur so often that they have a |
384 name--Regular-Expression-Denial-Of-Service (ReDoS) |
404 name: Regular-Expression-Denial-Of-Service (ReDoS) |
385 attack. |
405 attack. |
386 \citeauthor{Davis18} detected more |
406 \citeauthor{Davis18} detected more |
387 than 1000 super-linear (SL) regular expressions |
407 than 1000 evil regular expressions |
388 in Node.js, Python core libraries, and npm and pypi. |
408 in Node.js, Python core libraries, npm and in pypi. |
389 They therefore concluded that evil regular expressions |
409 They therefore concluded that evil regular expressions |
390 are problems "more than a parlour trick", but one that |
410 are real problems rather than "a parlour trick". |
391 requires |
|
392 more research attention. |
|
393 |
411 |
394 This work aims to address this issue |
412 This work aims to address this issue |
395 with the help of formal proofs. |
413 with the help of formal proofs. |
396 We offer a lexing algorithm based |
414 We describe a lexing algorithm based |
397 on Brzozowski derivatives with certified correctness (in |
415 on Brzozowski derivatives with verified correctness (in |
398 Isabelle/HOL) |
416 Isabelle/HOL) |
399 and finiteness property. |
417 and a finiteness property. |
400 Such properties guarantee the absence of |
418 Such properties %guarantee the absence of |
401 catastrophic backtracking in most cases. |
419 are an important step in preventing |
|
420 catastrophic backtracking once and for all. |
402 We will give more details in the next sections |
421 We will give more details in the next sections |
403 on (i) why the slow cases in graph \ref{fig:aStarStarb} |
422 on (i) why the slow cases in graph \ref{fig:aStarStarb} |
404 can occur |
423 can occur in traditional regular expression engines |
405 and (ii) why we choose our |
424 and (ii) why we choose our |
406 approach (Brzozowski derivatives and formal proofs). |
425 approach based on Brzozowski derivatives and formal proofs. |
407 |
426 |
408 |
427 |
409 \section{Regex, and the Problems with Regex Matchers} |
428 \section{Preliminaries}%Regex, and the Problems with Regex Matchers} |
410 Regular expressions and regular expression matchers |
429 Regular expressions and regular expression matchers |
411 have of course been studied for many, many years. |
430 have of course been studied for many, many years. |
412 Theoretical results in automata theory say |
431 Theoretical results in automata theory state |
413 that basic regular expression matching should be linear |
432 that basic regular expression matching should be linear |
414 w.r.t the input. |
433 w.r.t the input. |
415 This assumes that the regular expression |
434 This assumes that the regular expression |
416 $r$ was pre-processed and turned into a |
435 $r$ was pre-processed and turned into a |
417 deterministic finite automata (DFA) before matching, |
436 deterministic finite automaton (DFA) before matching\cite{Sakarovitch2009}. |
418 which could be exponential\cite{Sakarovitch2009}. |
|
419 By basic we mean textbook definitions such as the one |
437 By basic we mean textbook definitions such as the one |
420 below, involving only characters, alternatives, |
438 below, involving only regular expressions for characters, alternatives, |
421 sequences, and Kleene stars: |
439 sequences, and Kleene stars: |
422 \[ |
440 \[ |
423 r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^* |
441 r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^* |
424 \] |
442 \] |
425 Modern regular expression matchers used by programmers, |
443 Modern regular expression matchers used by programmers, |
426 however, |
444 however, |
427 support richer constructs such as bounded repetitions |
445 support much richer constructs, such as bounded repetitions |
428 and back-references. |
446 and back-references. |
429 To differentiate, people use the word \emph{regex} to refer |
447 To differentiate, we use the word \emph{regex} to refer |
430 to those expressions with richer constructs while reserving the |
448 to those expressions with richer constructs while reserving the |
431 term \emph{regular expression} |
449 term \emph{regular expression} |
432 for the more traditional meaning in formal languages theory. |
450 for the more traditional meaning in formal languages theory. |
433 We follow this convention |
451 We follow this convention |
434 in this thesis. |
452 in this thesis. |
435 In the future, we aim to support all the popular features of regexes, |
453 In the future, we aim to support all the popular features of regexes, |
436 but for this work we mainly look at regular expressions. |
454 but for this work we mainly look at basic regular expressions |
|
455 and bounded repetitions. |
437 |
456 |
438 |
457 |
439 |
458 |
440 %Most modern regex libraries |
459 %Most modern regex libraries |
441 %the so-called PCRE standard (Peral Compatible Regular Expressions) |
460 %the so-called PCRE standard (Peral Compatible Regular Expressions) |
442 %has the back-references |
461 %has the back-references |
443 Regexes come with a lot of constructs |
462 Regexes come with a number of constructs |
444 beyond the basic ones |
|
445 that make it more convenient for |
463 that make it more convenient for |
446 programmers to write regular expressions. |
464 programmers to write regular expressions. |
447 Depending on the types of these constructs |
465 Depending on the types of constructs |
448 the task of matching and lexing with them |
466 the task of matching and lexing with them |
449 will have different levels of complexity increase. |
467 will have different levels of complexity. |
450 Some of those constructs are syntactic sugars that are |
468 Some of those constructs are just syntactic sugars that are |
451 simply short hand notations |
469 simply short hand notations |
452 that save the programmers a few keystrokes. |
470 that save the programmers a few keystrokes. |
453 These will not cause trouble for regex libraries. |
471 These will not cause problems for regex libraries. |
454 |
|
455 \noindent |
|
456 For example the |
472 For example the |
457 non-binary alternative involving three or more choices: |
473 non-binary alternative involving three or more choices just means: |
458 \[ |
474 \[ |
459 (a | b | c) \stackrel{means}{=} ((a + b)+ c) |
475 (a | b | c) \stackrel{means}{=} ((a + b)+ c) |
460 \] |
476 \] |
461 the range operator $-$ used to express the alternative |
477 Similarly, the range operator used to express the alternative |
462 of all characters between its operands in a concise way: |
478 of all characters between its operands is just a concise way: |
463 \[ |
479 \[ |
464 [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)} |
480 [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)} |
465 \] |
481 \] |
466 and the |
482 for an alternative. The |
467 wildcard character $.$ used to refer to any single character: |
483 wildcard character $.$ is used to refer to any single character, |
468 \[ |
484 \[ |
469 . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots] |
485 . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots] |
470 \] |
486 \] |
471 |
487 except the newline. |
472 \noindent |
488 |
473 \subsection{Bounded Repetitions} |
489 \subsection{Bounded Repetitions} |
474 Some of those constructs do make the expressions much |
490 More interesting are bounded repetitions, which can |
|
491 make the regular expressions much |
475 more compact. |
492 more compact. |
476 For example, the bounded regular expressions |
493 There are |
477 (where $n$ and $m$ are constant natural numbers) |
494 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$ |
478 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$, |
495 (where $n$ and $m$ are constant natural numbers). |
479 defined as |
496 Like the star regular expressions, the set of strings or language |
|
497 a bounded regular expression can match |
|
498 is defined using the power operation on sets: |
480 \begin{center} |
499 \begin{center} |
481 \begin{tabular}{lcl} |
500 \begin{tabular}{lcl} |
482 $L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\ |
501 $L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\ |
483 $L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\ |
502 $L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\ |
484 $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\ |
503 $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\ |
485 $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$ |
504 $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$ |
486 \end{tabular} |
505 \end{tabular} |
487 \end{center} |
506 \end{center} |
488 are exponentially smaller compared with |
507 The attraction of bounded repetitions is that they can be |
489 their unfolded form: for example $r^{\{n\}}$ |
508 used to avoid a blow up: for example $r^{\{n\}}$ |
490 as opposed to |
509 is a shorthand for |
491 \[ |
510 \[ |
492 \underbrace{r\ldots r}_\text{n copies of r}. |
511 \underbrace{r\ldots r}_\text{n copies of r}. |
493 \] |
512 \] |
494 %Therefore, a naive algorithm that simply unfolds |
513 %Therefore, a naive algorithm that simply unfolds |
495 %them into their desugared forms |
514 %them into their desugared forms |
496 %will suffer from at least an exponential runtime increase. |
515 %will suffer from at least an exponential runtime increase. |
497 |
516 |
498 The problem here is that tools based on the classic notion of |
517 |
499 automata need to expand $r^{n}$ into $n$ connected |
518 The problem with matching |
|
519 is that tools based on the classic notion of |
|
520 automata need to expand $r^{\{n\}}$ into $n$ connected |
500 copies of the automaton for $r$. This leads to very inefficient matching |
521 copies of the automaton for $r$. This leads to very inefficient matching |
501 algorithms or algorithms that consume large amounts of memory. |
522 algorithms or algorithms that consume large amounts of memory. |
502 Implementations using $\DFA$s will |
523 Implementations using $\DFA$s will |
503 either become excruciatingly slow |
524 either become excruciatingly slow |
504 (for example Verbatim++\cite{Verbatimpp}) or get |
525 (for example Verbatim++\cite{Verbatimpp}) or get |
613 can be above 1000, and in the built-in Rust regular expression library |
622 can be above 1000, and in the built-in Rust regular expression library |
614 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
623 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
615 for being too big. |
624 for being too big. |
616 As Becchi and Crawley\cite{Becchi08} have pointed out, |
625 As Becchi and Crawley\cite{Becchi08} have pointed out, |
617 the reason for these restrictions |
626 the reason for these restrictions |
618 are that they simulate a non-deterministic finite |
627 is that they simulate a non-deterministic finite |
619 automata (NFA) with a breadth-first search. |
628 automata (NFA) with a breadth-first search. |
620 This way the number of active states could |
629 This way the number of active states could |
621 be equal to the counter number. |
630 be equal to the counter number. |
622 When the counters are large, |
631 When the counters are large, |
623 the memory requirement could become |
632 the memory requirement could become |
624 infeasible, and the pattern needs to be rejected straight away. |
633 infeasible, and a regex engine |
|
634 like Go will reject this pattern straight away. |
625 \begin{figure}[H] |
635 \begin{figure}[H] |
626 \begin{center} |
636 \begin{center} |
627 \begin{tikzpicture} [node distance = 2cm, on grid, auto] |
637 \begin{tikzpicture} [node distance = 2cm, on grid, auto] |
628 |
638 |
629 \node (q0) [state, initial] {$0$}; |
639 \node (q0) [state, initial] {$0$}; |
630 \node (q1) [state, right = of q0] {$1$}; |
640 \node (q1) [state, right = of q0] {$1$}; |
631 \node (q2) [state, right = of q1] {$2$}; |
641 %\node (q2) [state, right = of q1] {$2$}; |
632 \node (qdots) [right = of q2] {$\ldots$}; |
642 \node (qdots) [right = of q1] {$\ldots$}; |
633 \node (qn) [state, right = of qdots] {$n$}; |
643 \node (qn) [state, right = of qdots] {$n$}; |
634 \node (qn1) [state, right = of qn] {$n+1$}; |
644 \node (qn1) [state, right = of qn] {$n+1$}; |
635 \node (qn2) [state, right = of qn1] {$n+2$}; |
645 \node (qn2) [state, right = of qn1] {$n+2$}; |
636 \node (qn3) [state, accepting, right = of qn2] {$n+3$}; |
646 \node (qn3) [state, accepting, right = of qn2] {$n+3$}; |
637 |
647 |
638 \path [-stealth, thick] |
648 \path [-stealth, thick] |
639 (q0) edge [loop above] node {a} () |
649 (q0) edge [loop above] node {a} () |
640 (q0) edge node {a} (q1) |
650 (q0) edge node {a} (q1) |
641 (q1) edge node {.} (q2) |
651 %(q1) edge node {.} (q2) |
642 (q2) edge node {.} (qdots) |
652 (q1) edge node {.} (qdots) |
643 (qdots) edge node {.} (qn) |
653 (qdots) edge node {.} (qn) |
644 (qn) edge node {.} (qn1) |
654 (qn) edge node {.} (qn1) |
645 (qn1) edge node {b} (qn2) |
655 (qn1) edge node {b} (qn2) |
646 (qn2) edge node {$c$} (qn3); |
656 (qn2) edge node {$c$} (qn3); |
647 \end{tikzpicture} |
657 \end{tikzpicture} |