419 but they occur actually often enough that they have a |
419 but they occur actually often enough that they have a |
420 name: Regular-Expression-Denial-Of-Service (ReDoS) |
420 name: Regular-Expression-Denial-Of-Service (ReDoS) |
421 attacks. |
421 attacks. |
422 Davis et al. \cite{Davis18} detected more |
422 Davis et al. \cite{Davis18} detected more |
423 than 1000 evil regular expressions |
423 than 1000 evil regular expressions |
424 in Node.js, Python core libraries, npm and in pypi. |
424 in Node.js, Python core libraries, npm and pypi. |
425 They therefore concluded that evil regular expressions |
425 They therefore concluded that evil regular expressions |
426 are real problems rather than just "a parlour trick". |
426 are real problems rather than just "a parlour trick". |
427 |
427 |
428 This work aims to address this issue |
428 This work aims to address this issue |
429 with the help of formal proofs. |
429 with the help of formal proofs. |
430 We describe a lexing algorithm based |
430 We describe a lexing algorithm based |
431 on Brzozowski derivatives with verified correctness (in |
431 on Brzozowski derivatives with verified correctness |
432 Isabelle/HOL) |
432 and a finiteness property for the size of derivatives |
433 and a finiteness property for the size of derivatives. |
433 (which are all done in Isabelle/HOL). |
434 Such properties %guarantee the absence of |
434 Such properties %guarantee the absence of |
435 are an important step in preventing |
435 are an important step in preventing |
436 catastrophic backtracking once and for all. |
436 catastrophic backtracking once and for all. |
437 We will give more details in the next sections |
437 We will give more details in the next sections |
438 on (i) why the slow cases in graph \ref{fig:aStarStarb} |
438 on (i) why the slow cases in graph \ref{fig:aStarStarb} |
441 approach based on Brzozowski derivatives and formal proofs. |
441 approach based on Brzozowski derivatives and formal proofs. |
442 |
442 |
443 |
443 |
444 \section{Preliminaries}%Regex, and the Problems with Regex Matchers} |
444 \section{Preliminaries}%Regex, and the Problems with Regex Matchers} |
445 Regular expressions and regular expression matchers |
445 Regular expressions and regular expression matchers |
446 have of course been studied for many, many years. |
446 have clearly been studied for many, many years. |
447 Theoretical results in automata theory state |
447 Theoretical results in automata theory state |
448 that basic regular expression matching should be linear |
448 that basic regular expression matching should be linear |
449 w.r.t the input. |
449 w.r.t the input. |
450 This assumes that the regular expression |
450 This assumes that the regular expression |
451 $r$ was pre-processed and turned into a |
451 $r$ was pre-processed and turned into a |
452 deterministic finite automaton (DFA) before matching~\cite{Sakarovitch2009}. |
452 deterministic finite automaton (DFA) before matching \cite{Sakarovitch2009}. |
453 By basic we mean textbook definitions such as the one |
453 By basic we mean textbook definitions such as the one |
454 below, involving only regular expressions for characters, alternatives, |
454 below, involving only regular expressions for characters, alternatives, |
455 sequences, and Kleene stars: |
455 sequences, and Kleene stars: |
456 \[ |
456 \[ |
457 r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^* |
457 r ::= c | r_1 + r_2 | r_1 \cdot r_2 | r^* |
458 \] |
458 \] |
459 Modern regular expression matchers used by programmers, |
459 Modern regular expression matchers used by programmers, |
460 however, |
460 however, |
461 support much richer constructs, such as bounded repetitions |
461 support much richer constructs, such as bounded repetitions, |
|
462 negations, |
462 and back-references. |
463 and back-references. |
463 To differentiate, we use the word \emph{regex} to refer |
464 To differentiate, we use the word \emph{regex} to refer |
464 to those expressions with richer constructs while reserving the |
465 to those expressions with richer constructs while reserving the |
465 term \emph{regular expression} |
466 term \emph{regular expression} |
466 for the more traditional meaning in formal languages theory. |
467 for the more traditional meaning in formal languages theory. |
479 that make it more convenient for |
480 that make it more convenient for |
480 programmers to write regular expressions. |
481 programmers to write regular expressions. |
481 Depending on the types of constructs |
482 Depending on the types of constructs |
482 the task of matching and lexing with them |
483 the task of matching and lexing with them |
483 will have different levels of complexity. |
484 will have different levels of complexity. |
484 Some of those constructs are just syntactic sugars that are |
485 Some of those constructs are syntactic sugars that are |
485 simply short hand notations |
486 simply short hand notations |
486 that save the programmers a few keystrokes. |
487 that save the programmers a few keystrokes. |
487 These will not cause problems for regex libraries. |
488 These will not cause problems for regex libraries. |
488 For example the |
489 For example the |
489 non-binary alternative involving three or more choices just means: |
490 non-binary alternative involving three or more choices just means: |
490 \[ |
491 \[ |
491 (a | b | c) \stackrel{means}{=} ((a + b)+ c) |
492 (a | b | c) \stackrel{means}{=} ((a + b)+ c) |
492 \] |
493 \] |
493 Similarly, the range operator used to express the alternative |
494 Similarly, the range operator |
494 of all characters between its operands is just a concise way: |
495 %used to express the alternative |
|
496 %of all characters between its operands, |
|
497 is just a concise way |
|
498 of expressing an alternative of consecutive characters: |
495 \[ |
499 \[ |
496 [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)} |
500 [0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) |
497 \] |
501 \] |
498 for an alternative. The |
502 for an alternative. The |
499 wildcard character $.$ is used to refer to any single character, |
503 wildcard character '$.$' is used to refer to any single character, |
500 \[ |
504 \[ |
501 . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots] |
505 . \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots] |
502 \] |
506 \] |
503 except the newline. |
507 except the newline. |
504 |
508 |
505 \subsection{Bounded Repetitions} |
509 \subsection{Bounded Repetitions} |
506 More interesting are bounded repetitions, which can |
510 More interesting are bounded repetitions, which can |
507 make the regular expressions much |
511 make the regular expressions much |
508 more compact. |
512 more compact. |
509 There are |
513 Normally there are four kinds of bounded repetitions: |
510 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$ |
514 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$ |
511 (where $n$ and $m$ are constant natural numbers). |
515 (where $n$ and $m$ are constant natural numbers). |
512 Like the star regular expressions, the set of strings or language |
516 Like the star regular expressions, the set of strings or language |
513 a bounded regular expression can match |
517 a bounded regular expression can match |
514 is defined using the power operation on sets: |
518 is defined using the power operation on sets: |
519 $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\ |
523 $L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\ |
520 $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$ |
524 $L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$ |
521 \end{tabular} |
525 \end{tabular} |
522 \end{center} |
526 \end{center} |
523 The attraction of bounded repetitions is that they can be |
527 The attraction of bounded repetitions is that they can be |
524 used to avoid a blow up: for example $r^{\{n\}}$ |
528 used to avoid a size blow up: for example $r^{\{n\}}$ |
525 is a shorthand for |
529 is a shorthand for |
|
530 the much longer regular expression: |
526 \[ |
531 \[ |
527 \underbrace{r\ldots r}_\text{n copies of r}. |
532 \underbrace{r\ldots r}_\text{n copies of r}. |
528 \] |
533 \] |
529 %Therefore, a naive algorithm that simply unfolds |
534 %Therefore, a naive algorithm that simply unfolds |
530 %them into their desugared forms |
535 %them into their desugared forms |
531 %will suffer from at least an exponential runtime increase. |
536 %will suffer from at least an exponential runtime increase. |
532 |
537 |
533 |
538 |
534 The problem with matching |
539 The problem with matching |
|
540 such bounded repetitions |
535 is that tools based on the classic notion of |
541 is that tools based on the classic notion of |
536 automata need to expand $r^{\{n\}}$ into $n$ connected |
542 automata need to expand $r^{\{n\}}$ into $n$ connected |
537 copies of the automaton for $r$. This leads to very inefficient matching |
543 copies of the automaton for $r$. This leads to very inefficient matching |
538 algorithms or algorithms that consume large amounts of memory. |
544 algorithms or algorithms that consume large amounts of memory. |
539 Implementations using $\DFA$s will |
545 Implementations using $\DFA$s will |
|
546 in such situations |
540 either become excruciatingly slow |
547 either become excruciatingly slow |
541 (for example Verbatim++~\cite{Verbatimpp}) or get |
548 (for example Verbatim++ \cite{Verbatimpp}) or run |
542 out of memory errors (for example $\mathit{LEX}$ and |
549 out of memory (for example $\mathit{LEX}$ and |
543 $\mathit{JFLEX}$\footnote{which are lexer generators |
550 $\mathit{JFLEX}$\footnote{LEX and JFLEX are lexer generators |
544 in C and JAVA that generate $\mathit{DFA}$-based |
551 in C and JAVA that generate $\mathit{DFA}$-based |
545 lexers. The user provides a set of regular expressions |
552 lexers. The user provides a set of regular expressions |
546 and configurations to them, and then |
553 and configurations, and then |
547 gets an output program encoding a minimized $\mathit{DFA}$ |
554 gets an output program encoding a minimized $\mathit{DFA}$ |
548 that can be compiled and run. |
555 that can be compiled and run. |
549 When given the above countdown regular expression, |
556 When given the above countdown regular expression, |
550 a small $n$ (a few dozen) would result in a |
557 a small $n$ (say 20) would result in a program representing a |
551 determinised automata |
558 DFA |
552 with millions of states.}) for large counters. |
559 with millions of states.}) for large counters. |
553 A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$ |
560 A classic example for this phenomenon is the regular expression $(a+b)^* a (a+b)^{n}$ |
554 where the minimal DFA requires at least $2^{n+1}$ states. |
561 where the minimal DFA requires at least $2^{n+1}$ states. |
555 For example, when $n$ is equal to 2, |
562 For example, when $n$ is equal to 2, |
556 The corresponding $\mathit{NFA}$ looks like: |
563 the corresponding $\mathit{NFA}$ looks like: |
|
564 \vspace{6mm} |
557 \begin{center} |
565 \begin{center} |
558 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
566 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] |
559 \node[state,initial] (q_0) {$q_0$}; |
567 \node[state,initial] (q_0) {$q_0$}; |
560 \node[state, red] (q_1) [right=of q_0] {$q_1$}; |
568 \node[state, red] (q_1) [right=of q_0] {$q_1$}; |
561 \node[state, red] (q_2) [right=of q_1] {$q_2$}; |
569 \node[state, red] (q_2) [right=of q_1] {$q_2$}; |
565 edge [loop below] node {a,b} () |
573 edge [loop below] node {a,b} () |
566 (q_1) edge node {a,b} (q_2) |
574 (q_1) edge node {a,b} (q_2) |
567 (q_2) edge node {a,b} (q_3); |
575 (q_2) edge node {a,b} (q_3); |
568 \end{tikzpicture} |
576 \end{tikzpicture} |
569 \end{center} |
577 \end{center} |
570 when turned into a DFA by the subset construction |
578 and when turned into a DFA by the subset construction |
571 requires at least $2^3$ states.\footnote{The |
579 requires at least $2^3$ states.\footnote{The |
572 red states are "countdown states" which counts down |
580 red states are "countdown states" which count down |
573 the number of characters needed in addition to the current |
581 the number of characters needed in addition to the current |
574 string to make a successful match. |
582 string to make a successful match. |
575 For example, state $q_1$ indicates a match that has |
583 For example, state $q_1$ indicates a match that has |
576 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$, |
584 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$, |
577 and just consumed the "delimiter" $a$ in the middle, and |
585 and just consumed the "delimiter" $a$ in the middle, and |
578 need to match 2 more iterations of $(a|b)$ to complete. |
586 needs to match 2 more iterations of $(a|b)$ to complete. |
579 State $q_2$ on the other hand, can be viewed as a state |
587 State $q_2$ on the other hand, can be viewed as a state |
580 after $q_1$ has consumed 1 character, and just waits |
588 after $q_1$ has consumed 1 character, and just waits |
581 for 1 more character to complete. |
589 for 1 more character to complete. |
582 $q_3$ is the last state, requiring 0 more character and is accepting. |
590 The state $q_3$ is the last (accepting) state, requiring 0 |
|
591 more characters. |
583 Depending on the suffix of the |
592 Depending on the suffix of the |
584 input string up to the current read location, |
593 input string up to the current read location, |
585 the states $q_1$ and $q_2$, $q_3$ |
594 the states $q_1$ and $q_2$, $q_3$ |
586 may or may |
595 may or may |
587 not be active, independent from each other. |
596 not be active. |
588 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would |
597 A $\mathit{DFA}$ for such an $\mathit{NFA}$ would |
589 contain at least $2^3$ non-equivalent states that cannot be merged, |
598 contain at least $2^3$ non-equivalent states that cannot be merged, |
590 because the subset construction during determinisation will generate |
599 because the subset construction during determinisation will generate |
591 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$. |
600 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$. |
592 Generalizing this to regular expressions with larger |
601 Generalizing this to regular expressions with larger |
593 bounded repetitions number, we have that |
602 bounded repetitions number, we have that |
594 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s |
603 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s |
595 would require at least $2^{n+1}$ states, if $r$ itself contains |
604 would require at least $2^{n+1}$ states, if $r$ itself contains |
596 more than 1 string. |
605 more than 1 string. |
597 This is to represent all different |
606 This is to represent all different |
598 scenarios which "countdown" states are active.} |
607 scenarios in which "countdown" states are active.} |
599 |
608 |
600 |
609 |
601 Bounded repetitions are very important because they |
610 Bounded repetitions are important because they |
602 tend to occur a lot in practical use, |
611 tend to occur frequently in practical use, |
603 for example in the regex library RegExLib, |
612 for example in the regex library RegExLib, in |
604 the rules library of Snort \cite{Snort1999}\footnote{ |
613 the rules library of Snort \cite{Snort1999}\footnote{ |
605 Snort is a network intrusion detection (NID) tool |
614 Snort is a network intrusion detection (NID) tool |
606 for monitoring network traffic. |
615 for monitoring network traffic. |
607 The network security community curates a list |
616 The network security community curates a list |
608 of malicious patterns written as regexes, |
617 of malicious patterns written as regexes, |
613 According to Bj\"{o}rklund et al \cite{xml2015}, |
622 According to Bj\"{o}rklund et al \cite{xml2015}, |
614 more than half of the |
623 more than half of the |
615 XSDs they found on the Maven.org central repository |
624 XSDs they found on the Maven.org central repository |
616 have bounded regular expressions in them. |
625 have bounded regular expressions in them. |
617 Often the counters are quite large, with the largest being |
626 Often the counters are quite large, with the largest being |
618 approximately up to ten million. |
627 close to ten million. |
619 An example XSD they gave |
628 A smaller sample XSD they gave |
620 is: |
629 is: |
621 \begin{verbatim} |
630 \begin{verbatim} |
622 <sequence minOccurs="0" maxOccurs="65535"> |
631 <sequence minOccurs="0" maxOccurs="65535"> |
623 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
632 <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/> |
624 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
633 <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/> |
625 </sequence> |
634 </sequence> |
626 \end{verbatim} |
635 \end{verbatim} |
627 This can be seen as the expression |
636 This can be seen as the regex |
628 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
637 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves |
629 regular expressions |
638 regular expressions |
630 satisfying certain constraints (such as |
639 satisfying certain constraints (such as |
631 satisfying the floating point number format). |
640 satisfying the floating point number format). |
632 It is therefore quite unsatisfying that |
641 It is therefore quite unsatisfying that |
636 For example, in the regular expression matching library in the Go |
645 For example, in the regular expression matching library in the Go |
637 language the regular expression $a^{1001}$ is not permitted, because no counter |
646 language the regular expression $a^{1001}$ is not permitted, because no counter |
638 can be above 1000, and in the built-in Rust regular expression library |
647 can be above 1000, and in the built-in Rust regular expression library |
639 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
648 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message |
640 for being too big. |
649 for being too big. |
641 As Becchi and Crawley~\cite{Becchi08} have pointed out, |
650 As Becchi and Crawley \cite{Becchi08} have pointed out, |
642 the reason for these restrictions |
651 the reason for these restrictions |
643 is that they simulate a non-deterministic finite |
652 is that they simulate a non-deterministic finite |
644 automata (NFA) with a breadth-first search. |
653 automata (NFA) with a breadth-first search. |
645 This way the number of active states could |
654 This way the number of active states could |
646 be equal to the counter number. |
655 be equal to the counter number. |
647 When the counters are large, |
656 When the counters are large, |
648 the memory requirement could become |
657 the memory requirement could become |
649 infeasible, and a regex engine |
658 infeasible, and a regex engine |
650 like Go will reject this pattern straight away. |
659 like in Go will reject this pattern straight away. |
651 \begin{figure}[H] |
660 \begin{figure}[H] |
652 \begin{center} |
661 \begin{center} |
653 \begin{tikzpicture} [node distance = 2cm, on grid, auto] |
662 \begin{tikzpicture} [node distance = 2cm, on grid, auto] |
654 |
663 |
655 \node (q0) [state, initial] {$0$}; |
664 \node (q0) [state, initial] {$0$}; |
700 counters \cite{Turo_ov__2020}. |
709 counters \cite{Turo_ov__2020}. |
701 These solutions can be quite efficient, |
710 These solutions can be quite efficient, |
702 with the ability to process |
711 with the ability to process |
703 gigabytes of strings input per second |
712 gigabytes of strings input per second |
704 even with large counters \cite{Becchi08}. |
713 even with large counters \cite{Becchi08}. |
705 But formal reasoning about these automata especially in Isabelle |
714 These practical solutions do not come with |
706 can be challenging |
715 formal guarantees, and as pointed out by |
707 and un-intuitive. |
716 Kuklewicz \cite{KuklewiczHaskell}, can be error-prone. |
708 Therefore, we take correctness and runtime claims made about these solutions |
717 %But formal reasoning about these automata especially in Isabelle |
709 with a grain of salt. |
718 %can be challenging |
|
719 %and un-intuitive. |
|
720 %Therefore, we take correctness and runtime claims made about these solutions |
|
721 %with a grain of salt. |
710 |
722 |
711 In the work reported in \cite{FoSSaCS2023} and here, |
723 In the work reported in \cite{FoSSaCS2023} and here, |
712 we add better support using derivatives |
724 we add better support using derivatives |
713 for bounded regular expressions $r^{\{n\}}$. |
725 for bounded regular expression $r^{\{n\}}$. |
714 The results |
726 Our results |
715 extend straightforwardly to |
727 extend straightforwardly to |
716 repetitions with an interval such as |
728 repetitions with intervals such as |
717 $r^{\{n\ldots m\}}$. |
729 $r^{\{n\ldots m\}}$. |
718 The merit of Brzozowski derivatives (more on this later) |
730 The merit of Brzozowski derivatives (more on this later) |
719 on this problem is that |
731 on this problem is that |
720 it can be naturally extended to support bounded repetitions. |
732 it can be naturally extended to support bounded repetitions. |
721 Moreover these extensions are still made up of only |
733 Moreover these extensions are still made up of only small |
722 inductive datatypes and recursive functions, |
734 inductive datatypes and recursive functions, |
723 making it handy to deal with them in theorem provers. |
735 making it handy to deal with them in theorem provers. |
724 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be |
736 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be |
725 %straightforwardly extended to deal with bounded regular expressions |
737 %straightforwardly extended to deal with bounded regular expressions |
726 %and moreover the resulting code still consists of only simple |
738 %and moreover the resulting code still consists of only simple |