179 them---\emph{evil regular expressions}. In empiric work, Davis et al |
182 them---\emph{evil regular expressions}. In empiric work, Davis et al |
180 report that they have found thousands of such evil regular expressions |
183 report that they have found thousands of such evil regular expressions |
181 in the JavaScript and Python ecosystems \cite{Davis18}. |
184 in the JavaScript and Python ecosystems \cite{Davis18}. |
182 |
185 |
183 This exponential blowup in matching algorithms sometimes causes |
186 This exponential blowup in matching algorithms sometimes causes |
184 considerable disturbance in real life: for example on 20 July 2016 one evil |
187 considerable grief in real life: for example on 20 July 2016 one evil |
185 regular expression brought the webpage |
188 regular expression brought the webpage |
186 \href{http://stackexchange.com}{Stack Exchange} to its |
189 \href{http://stackexchange.com}{Stack Exchange} to its |
187 knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
190 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}} |
188 In this instance, a regular expression intended to just trim white |
191 In this instance, a regular expression intended to just trim white |
189 spaces from the beginning and the end of a line actually consumed |
192 spaces from the beginning and the end of a line actually consumed |
190 massive amounts of CPU-resources and because of this the web servers |
193 massive amounts of CPU-resources and because of this the web servers |
191 ground to a halt. This happened when a post with 20,000 white spaces was |
194 ground to a halt. This happened when a post with 20,000 white spaces was |
192 submitted, but importantly the white spaces were neither at the |
195 submitted, but importantly the white spaces were neither at the |
394 \end{tabular} |
398 \end{tabular} |
395 \end{center} |
399 \end{center} |
396 |
400 |
397 \noindent |
401 \noindent |
398 There is no value corresponding to $\ZERO$; $\Empty$ corresponds to |
402 There is no value corresponding to $\ZERO$; $\Empty$ corresponds to |
399 $\ONE$; $\Seq$ to the sequence regular expression and so on. The idea of |
403 $\ONE$; $\Char$ to the character regular expression; $\Seq$ to the |
400 values is to encode parse trees. To see this, suppose a \emph{flatten} |
404 sequence regular expression and so on. The idea of values is to encode |
401 operation, written $|v|$, which we can use to extract the underlying |
405 parse trees. To see this, suppose a \emph{flatten} operation, written |
402 string of a value $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, |
406 $|v|$. We use this function to extract the underlying string of a value |
403 (\textit{Char y})|$ is the string $xy$. We omit the straightforward |
407 $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char |
404 definition of flatten. Using flatten, we can describe how values encode |
408 y})|$ is the string $xy$. Using flatten, we can describe how values |
405 parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @ |v_2|$ |
409 encode parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @ |
406 matches the regex $r_1 \cdot r_2$: $r_1$ matches the substring $|v_1|$ and, |
410 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the |
407 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two are matched |
411 substring $|v_1|$ and, respectively, $r_2$ matches the substring |
408 is contained in the sub-structure of $v_1$ and $v_2$. |
412 $|v_2|$. Exactly how these two are matched is contained in the |
|
413 sub-structure of $v_1$ and $v_2$. |
409 |
414 |
410 To give a concrete example of how value works, consider the string $xy$ |
415 To give a concrete example of how value works, consider the string $xy$ |
411 and the regular expression $(x + (y + xy))^*$. We can view this regular |
416 and the regular expression $(x + (y + xy))^*$. We can view this regular |
412 expression as a tree and if the string $xy$ is matched by two Star |
417 expression as a tree and if the string $xy$ is matched by two Star |
413 ``iterations'', then the $x$ is matched by the left-most alternative in |
418 ``iterations'', then the $x$ is matched by the left-most alternative in |
479 also that in case of alternatives we give preference to the |
484 also that in case of alternatives we give preference to the |
480 regular expression on the left-hand side. This will become |
485 regular expression on the left-hand side. This will become |
481 important later on. |
486 important later on. |
482 |
487 |
483 After this, we inject back the characters one by one in order to build |
488 After this, we inject back the characters one by one in order to build |
484 the parse tree $v_i$ for how the regex $r_i$ matches the string |
489 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$ |
485 $s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we |
490 ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After |
486 get the parse tree for how $r_0$ matches $s$, exactly as we wanted. |
491 injecting back $n$ characters, we get the parse tree for how $r_0$ |
487 |
492 matches $s$. For this Sulzmann and Lu defined a function that reverses |
488 We need a function that |
493 the ``chopping off'' of characters during the derivative phase. The |
489 reverses the "chopping off" for values which we did in the |
494 corresponding function is called $\textit{inj}$; it takes three |
490 first phase for derivatives. The corresponding function is |
495 arguments: the first one is a regular expression ${r_{i-1}}$, before the |
491 called $\textit{inj}$ (short for injection). This function takes three |
496 character is chopped off, the second is a character ${c_{i-1}}$, the |
492 arguments: the first one is a regular expression ${r_{i-1}}$, |
497 character we want to inject and the third argument is the value |
493 before the character is chopped off, the second is ${c_{i-1}}$, |
498 $textit{v_i}$, into which one wants to inject the character (it |
494 the character |
499 corresponds to the regular expression after the character has been |
495 we |
500 chopped off). The result of this function is a new value. The definition |
496 want to inject and the third argument is the value $textit{v_i}$, where we |
501 of $\textit{inj}$ is as follows: |
497 will inject the character(it corresponds to the regular expression |
|
498 after the character |
|
499 has been chopped off). The result of this function is a |
|
500 new value. The definition of $\textit{inj}$ is as follows: |
|
501 |
502 |
502 \begin{center} |
503 \begin{center} |
503 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
504 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
504 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
505 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
505 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
506 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
509 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ |
510 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ |
510 $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ |
511 $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ |
511 \end{tabular} |
512 \end{tabular} |
512 \end{center} |
513 \end{center} |
513 |
514 |
514 \noindent This definition is by recursion on the regular |
515 \noindent This definition is by recursion on the ``shape'' of regular |
515 expressions' and values' shapes. |
516 expressions and values. To understands this definition better consider |
516 To understands this definition in a more vivid way, the reader |
517 the situation when we build the derivative on regular expression $r_{i-1}$. |
517 might imagine this: when we do the derivative on regular expression |
518 For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a |
518 $r_{i-1}$, we chop off a character from $r_{i-1}$ to form $r_i$. |
519 ``hole'' in $r_i$. In order to calculate a value for $r_{i-1}$, we need to |
519 This leaves a "hole" on $r_i$. In order to calculate a value for |
520 locate where that hole is. We can find this location information by |
520 $r_{i-1}$, we need to locate where that hole is. We can find this location |
521 comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape |
521 informtation by comparing $r_{i-1}$ and $v_i$. |
522 $r_a \cdot r_b$, and $v_i$ is of shape $\textit{Left}(\textit{Seq}(v_a, |
522 For instance, if $r_{i-1}$ is of shape $r_a \cdot r_b$, and $v_i$ |
523 v_b))$, we know immediately that |
523 is of shape $\textit{Left}(\textit{Seq}(v_a, v_b))$, we know immediately that |
524 % |
524 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c\], |
525 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\] |
|
526 |
|
527 \noindent |
525 otherwise if $r_a$ is not nullable, |
528 otherwise if $r_a$ is not nullable, |
526 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b\], |
529 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\] |
527 the value $v_i$ should be of shape $Seq(\ldots)$, contradicting the fact that |
530 |
528 $v_i$ is actually $Left(\ldots)$. Furthermore, since $v_i$ is of shape |
531 \noindent |
529 $Left(\ldots)$ instead of $Right(\ldots)$, we know that the left |
532 the value $v_i$ should be of shape $\Seq(\ldots)$, contradicting the fact that |
|
533 $v_i$ is actually $\Left(\ldots)$. Furthermore, since $v_i$ is of shape |
|
534 $\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the left |
530 branch is taken instead of the right one. We have therefore found out |
535 branch is taken instead of the right one. We have therefore found out |
531 that the hole will be on $r_a$. So we recursively call $\textit{inj} |
536 that the hole will be on $r_a$. So we recursively call $\inj\, |
532 r_a c v_a$ to fill that hole in $v_a$. After injection, the value |
537 r_a\,c\,v_a$ in order to fill that hole in $v_a$. After injection, the value |
533 $v_i$ for $r_i = r_a \cdot r_b$ should be $\textit{Seq}(\textit{inj} |
538 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$. |
534 r_a c v_a, v_b)$. |
|
535 Other clauses can be understood in a similar way. |
539 Other clauses can be understood in a similar way. |
536 |
540 |
537 To understand what is going on, it might be best to do some |
541 Let us do some further examples involving $\inj$: Suppose we have the |
538 example calculations and compare them with Figure~\eqref{graph:2}. |
542 regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against |
539 % A correctness proof using induction can be routinely established. |
543 the string $abc$ (when $abc$ is written as a regular expression, the most |
540 Suppose we have a regular expression $((((a+b)+ab)+c)+abc)^*$ and want to |
544 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit |
541 match it against the string $abc$(when $abc$ is written as a regular |
545 the parentheses and dots here for readability). By POSIX rules the lexer |
542 expression, the most standard way of expressing it should be $a \cdot (b |
546 should go for the longest matching, i.e. it should match the string |
543 \cdot c)$. We omit the parentheses and dots here for readability). By |
547 $abc$ in one star iteration, using the longest string $abc$ in the |
544 POSIX rules the lexer should go for the longest matching, i.e. it should |
548 sub-expression $a+b+ab+c+abc$ (we use $r$ to denote this sub-expression |
545 match the string $abc$ in one star iteration, using the longest string |
549 for conciseness). Here is how the lexer achieves a parse tree for this |
546 $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this |
550 matching. First, build successive derivatives until we exhaust the |
547 sub-expression for conciseness). Here is how the lexer achieves a parse |
551 string, as illustrated here (we simplified some regular expressions like |
548 tree for this matching. First, build successive derivatives until we |
552 $0 \cdot b$ to $0$ for conciseness; we also omit some parentheses if |
549 exhaust the string, as illustrated here(we simplified some regular |
553 they are clear from the context): |
550 expressions like $0 \cdot b$ to $0$ for conciseness. We also omit |
|
551 some parentheses if they are clear from the context.): |
|
552 %Similarly, we allow |
554 %Similarly, we allow |
553 %$\textit{ALT}$ to take a list of regular expressions as an argument |
555 %$\textit{ALT}$ to take a list of regular expressions as an argument |
554 %instead of just 2 operands to reduce the nested depth of |
556 %instead of just 2 operands to reduce the nested depth of |
555 %$\textit{ALT}$ |
557 %$\textit{ALT}$ |
556 \begin{center} |
558 \begin{center} |
557 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r^* \xrightarrow{\backslash b}\] |
559 \begin{tabular}{lcl} |
558 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0 + 0 + 0) \cdot r^* \xrightarrow{\backslash c}\] |
560 $r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r^*$\\ |
559 \[r_3 = (( 0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r^*) +((0+1+0 + 0 + 0) \cdot r^*+(0+0+0 + 1 + 0) \cdot r^* ) |
561 & $\xrightarrow{\backslash b}$ & $r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0 + 0 + 0) \cdot r^*$\\ |
560 \] |
562 & $\xrightarrow{\backslash c}$ & $r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0 + 1 + 0) \cdot r^*) + $\\ |
|
563 & & $\phantom{r_3 = (} ((0+1+0 + 0 + 0) \cdot r^* + (0+0+0 + 1 + 0) \cdot r^* )$ |
|
564 \end{tabular} |
561 \end{center} |
565 \end{center} |
|
566 |
|
567 \noindent |
562 Now instead when $nullable$ gives a $yes$ on $r_3$, we call $mkeps$ |
568 Now instead when $nullable$ gives a $yes$ on $r_3$, we call $mkeps$ |
563 to construct a parse tree for how $r_3$ matched the string $abc$. |
569 to construct a parse tree for how $r_3$ matched the string $abc$. |
564 $mkeps$ gives the following value $v_3$: |
570 $mkeps$ gives the following value $v_3$: |
565 \[$Left(Left(Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])))$\] |
571 \[$Left(Left(Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])))$\] |
566 The outer 2 $Left$ tells us the first nullable part hides in the term |
572 The outer 2 $Left$ tells us the first nullable part hides in the term |
610 Which one of these alternatives is chosen later does not affect their relative position because our algorithm |
616 Which one of these alternatives is chosen later does not affect their relative position because our algorithm |
611 does not change this order. If this parsing information can be determined and does |
617 does not change this order. If this parsing information can be determined and does |
612 not change because of later derivatives, |
618 not change because of later derivatives, |
613 there is no point in traversing this information twice. This leads to a new approach of lexing-- if we store the information for parse trees in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives. |
619 there is no point in traversing this information twice. This leads to a new approach of lexing-- if we store the information for parse trees in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives. |
614 |
620 |
615 In the next section, we shall focus on the bit-coded algorithm and the natural |
621 In the next section, we shall focus on the bit-coded algorithm and the |
616 process of simplification of regular expressions using bit-codes, which is needed in |
622 process of simplification of regular expressions. This is needed in |
617 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
623 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
618 and Lu's algorithms. This is where the PhD-project hopes to advance |
624 and Lu's algorithms. This is where the PhD-project aims to advance the |
619 the state-of-the-art. |
625 state-of-the-art. |
620 |
626 |
621 |
627 |
622 \section{Simplification of Regular Expressions} |
628 \section{Simplification of Regular Expressions} |
623 Using bit-codes to guide parsing is not a new idea. |
629 |
624 It was applied to context free grammars and then adapted by Henglein and Nielson for efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and Lu took a step further by integrating bitcodes into derivatives. |
630 |
625 |
631 Using bit-codes to guide parsing is not a novel idea. It was applied to |
626 The argument for complicating the data structures from basic regular expressions to those with bitcodes |
632 context free grammars and then adapted by Henglein and Nielson for |
627 is that we can introduce simplification without making the algorithm crash or overly complex to reason about. |
633 efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and |
628 The reason why we need simplification is due to the shortcoming of the previous algorithm. |
634 Lu took a step further by integrating bitcodes into derivatives. |
629 The main drawback of building successive derivatives according to |
635 |
630 Brzozowski's definition is that they can grow very quickly in size. |
636 The argument for complicating the data structures from basic regular |
631 This is mainly due to the fact that the derivative operation generates |
637 expressions to those with bitcodes is that we can introduce |
632 often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, |
638 simplification without making the algorithm crash or overly complex to |
633 if implemented naively both algorithms by Brzozowski and by Sulzmann |
639 reason about. The reason why we need simplification is due to the |
634 and Lu are excruciatingly slow. For example when starting with the |
640 shortcoming of the previous algorithm. The main drawback of building |
635 regular expression $(a + aa)^*$ and building 12 successive derivatives |
641 successive derivatives according to Brzozowski's definition is that they |
636 w.r.t.~the character $a$, one obtains a derivative regular expression |
642 can grow very quickly in size. This is mainly due to the fact that the |
637 with more than 8000 nodes (when viewed as a tree). Operations like |
643 derivative operation generates often ``useless'' $\ZERO$s and $\ONE$s in |
638 derivative and $\nullable$ need to traverse such trees and |
644 derivatives. As a result, if implemented naively both algorithms by |
639 consequently the bigger the size of the derivative the slower the |
645 Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example |
640 algorithm. Fortunately, one can simplify regular expressions after |
646 when starting with the regular expression $(a + aa)^*$ and building 12 |
641 each derivative step. Various simplifications of regular expressions |
647 successive derivatives w.r.t.~the character $a$, one obtains a |
642 are possible, such as the simplifications of $\ZERO + r$, |
648 derivative regular expression with more than 8000 nodes (when viewed as |
643 $r + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just |
649 a tree). Operations like derivative and $\nullable$ need to traverse |
644 $r$. These simplifications do not affect the answer for whether a |
650 such trees and consequently the bigger the size of the derivative the |
645 regular expression matches a string or not, but fortunately also do |
651 slower the algorithm. Fortunately, one can simplify regular expressions |
646 not affect the POSIX strategy of how regular expressions match |
652 after each derivative step. Various simplifications of regular |
647 strings---although the latter is much harder to establish. Some |
653 expressions are possible, such as the simplifications of $\ZERO + r$, $r |
648 initial results in this regard have been obtained in |
654 + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just $r$. These |
649 \cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet |
655 simplifications do not affect the answer for whether a regular |
650 is a very tight bound for the size. Such a tight bound is suggested by |
656 expression matches a string or not, but fortunately also do not affect |
651 work of Antimirov who proved that (partial) derivatives can be bound |
657 the POSIX strategy of how regular expressions match strings---although |
652 by the number of characters contained in the initial regular |
658 the latter is much harder to establish. Some initial results in this |
653 expression \cite{Antimirov95}. |
659 regard have been obtained in \cite{AusafDyckhoffUrban2016}. However, |
654 |
660 what has not been achieved yet is a very tight bound for the size. Such |
655 Antimirov defined the "partial derivatives" of regular expressions to be this: |
661 a tight bound is suggested by work of Antimirov who proved that |
|
662 (partial) derivatives can be bound by the number of characters contained |
|
663 in the initial regular expression \cite{Antimirov95}. |
|
664 |
|
665 Antimirov defined the \emph{partial derivatives} of regular expressions to be this: |
656 %TODO definition of partial derivatives |
666 %TODO definition of partial derivatives |
657 \begin{center} |
667 \begin{center} |
658 \begin{tabular}{lcl} |
668 \begin{tabular}{lcl} |
659 $\textit{pder} \; c \; 0$ & $\dn$ & $\emptyset$\\ |
669 $\textit{pder} \; c \; 0$ & $\dn$ & $\emptyset$\\ |
660 $\textit{pder} \; c \; 1$ & $\dn$ & $\emptyset$ \\ |
670 $\textit{pder} \; c \; 1$ & $\dn$ & $\emptyset$ \\ |