70 regular expression matches a string, but in case it matches also |
70 regular expression matches a string, but in case it matches also |
71 answers with \emph{how} it matches the string. This is important for |
71 answers with \emph{how} it matches the string. This is important for |
72 applications such as lexing (tokenising a string). The problem is to |
72 applications such as lexing (tokenising a string). The problem is to |
73 make the algorithm by Sulzmann and Lu fast on all inputs without |
73 make the algorithm by Sulzmann and Lu fast on all inputs without |
74 breaking its correctness. We have already developed some |
74 breaking its correctness. We have already developed some |
75 simplification rules for this, but have not proved yet that they |
75 simplification rules for this, but have not yet proved that they |
76 preserve the correctness of the algorithm. We also have not yet |
76 preserve the correctness of the algorithm. We also have not yet |
77 looked at extended regular expressions, such as bounded repetitions, |
77 looked at extended regular expressions, such as bounded repetitions, |
78 negation and back-references. |
78 negation and back-references. |
79 \end{abstract} |
79 \end{abstract} |
80 |
80 |
179 them---\emph{evil regular expressions}. In empiric work, Davis et al |
179 them---\emph{evil regular expressions}. In empiric work, Davis et al |
180 report that they have found thousands of such evil regular expressions |
180 report that they have found thousands of such evil regular expressions |
181 in the JavaScript and Python ecosystems \cite{Davis18}. |
181 in the JavaScript and Python ecosystems \cite{Davis18}. |
182 |
182 |
183 This exponential blowup in matching algorithms sometimes causes |
183 This exponential blowup in matching algorithms sometimes causes |
184 considerable grief in real life: for example on 20 July 2016 one evil |
184 considerable disturbance in real life: for example on 20 July 2016 one evil |
185 regular expression brought the webpage |
185 regular expression brought the webpage |
186 \href{http://stackexchange.com}{Stack Exchange} to its |
186 \href{http://stackexchange.com}{Stack Exchange} to its |
187 knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
187 knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
188 In this instance, a regular expression intended to just trim white |
188 In this instance, a regular expression intended to just trim white |
189 spaces from the beginning and the end of a line actually consumed |
189 spaces from the beginning and the end of a line actually consumed |
405 parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @ |v_2|$ |
405 parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @ |v_2|$ |
406 matches the regex $r_1 \cdot r_2$: $r_1$ matches the substring $|v_1|$ and, |
406 matches the regex $r_1 \cdot r_2$: $r_1$ matches the substring $|v_1|$ and, |
407 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two are matched |
407 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two are matched |
408 is contained in the sub-structure of $v_1$ and $v_2$. |
408 is contained in the sub-structure of $v_1$ and $v_2$. |
409 |
409 |
410 To give a concrete example of how value work, consider the string $xy$ |
410 To give a concrete example of how value works, consider the string $xy$ |
411 and the regular expression $(x + (y + xy))^*$. We can view this regular |
411 and the regular expression $(x + (y + xy))^*$. We can view this regular |
412 expression as a tree and if the string $xy$ is matched by two Star |
412 expression as a tree and if the string $xy$ is matched by two Star |
413 ``iterations'', then the $x$ is matched by the left-most alternative in |
413 ``iterations'', then the $x$ is matched by the left-most alternative in |
414 this tree and the $y$ by the right-left alternative. This suggests to |
414 this tree and the $y$ by the right-left alternative. This suggests to |
415 record this matching as |
415 record this matching as |
438 algorithm by a second phase (the first phase being building successive |
438 algorithm by a second phase (the first phase being building successive |
439 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
439 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value |
440 is generated assuming the regular expression matches the string. |
440 is generated assuming the regular expression matches the string. |
441 Pictorially, the algorithm is as follows: |
441 Pictorially, the algorithm is as follows: |
442 |
442 |
443 \begin{center} |
443 \begin{equation}\label{graph:2} |
444 \begin{tikzcd} |
444 \begin{tikzcd} |
445 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
445 r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\ |
446 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
446 v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed] |
447 \end{tikzcd} |
447 \end{tikzcd} |
448 \end{center} |
448 \end{equation} |
449 |
449 |
450 \noindent |
450 \noindent |
451 For the convenience, we shall employ the following notations: the regular expression we |
451 For convenience, we shall employ the following notations: the regular expression we |
452 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1 |
452 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1 |
453 \ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots according to |
453 \ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots according to |
454 the characters $c_0$, $c_1$,\ldots until we exhaust the string and |
454 the characters $c_0$, $c_1$,\ldots until we exhaust the string and |
455 obtain at the derivative $r_n$. We test whether this derivative is |
455 obtain at the derivative $r_n$. We test whether this derivative is |
456 $\textit{nullable}$ or not. If not, we know the string does not match |
456 $\textit{nullable}$ or not. If not, we know the string does not match |
471 $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ |
471 $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\ |
472 $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ |
472 $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$ |
473 \end{tabular} |
473 \end{tabular} |
474 \end{center} |
474 \end{center} |
475 |
475 |
476 After this, we inject back the characters one by one in order to build |
476 |
|
477 \noindent There are no cases for $\ZERO$ and $c$, since |
|
478 these regular expression cannot match the empty string. Note |
|
479 also that in case of alternatives we give preference to the |
|
480 regular expression on the left-hand side. This will become |
|
481 important later on. |
|
482 |
|
483 After this, we inject back the characters one by one in order to build |
477 the parse tree $v_i$ for how the regex $r_i$ matches the string |
484 the parse tree $v_i$ for how the regex $r_i$ matches the string |
478 $s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we |
485 $s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we |
479 get the parse tree for how $r_0$ matches $s$, exactly as we wanted. |
486 get the parse tree for how $r_0$ matches $s$, exactly as we wanted. |
|
487 |
|
488 We need a function that |
|
489 reverses the "chopping off" for values which we did in the |
|
490 first phase for derivatives. The corresponding function is |
|
491 called $\textit{inj}$ (short for injection). This function takes three |
|
492 arguments: the first one is a regular expression $\textit{r_{i-1}}$ |
|
493 before the character is chopped off, the second is $\textit{c_{i-1}}$, |
|
494 the character |
|
495 we |
|
496 want to inject and the third argument is the value $textit{v_i}$, where we |
|
497 will inject the character(it corresponds to the regular expression |
|
498 after the character |
|
499 has been chopped off). The result of this function is a |
|
500 new value. The definition of $\textit{inj}$ is as follows: |
|
501 |
|
502 \begin{center} |
|
503 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l} |
|
504 $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ |
|
505 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\ |
|
506 $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\ |
|
507 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
508 $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\ |
|
509 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\ |
|
510 $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\ |
|
511 \end{tabular} |
|
512 \end{center} |
|
513 |
|
514 \noindent This definition is by recursion on the regular |
|
515 expressions' and values' shapes. |
|
516 To understands this definition in a more vivid way, the reader |
|
517 might imagine this: when we do the derivative on regular expression |
|
518 $r_{i-1}$, we chop off a character from $r_{i-1}$ to form $r_i$. |
|
519 This leaves a "hole" on $r_i$. In order to calculate a value for |
|
520 $r_{i-1}$, we need to locate where that hole is. We can find this location |
|
521 informtaion by comparing $r_{i-1}$ and $v_i$. |
|
522 For instance, if $r_{i-1}$ is of shape $r_a \cdot r_b$, and $v_i$ |
|
523 is of shape $\textit{Left(Seq(v_a, v_b))}$, we know immediately that |
|
524 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c\], |
|
525 otherwise if $r_a$ is not nullable, |
|
526 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b\], |
|
527 the value $v_i$ should be of shape $Seq(\ldots)$, contradicting the fact that |
|
528 $v_i$ is actually $Left(\ldots)$. Furthermore, since $v_i$ is of shape |
|
529 $Left(\ldots)$ instead of $Right(\ldots)$, we know that the left |
|
530 branch is taken instead of the right one. We have therefore found out |
|
531 that the hole will be on $r_a$. So we recursively call $\textit{inj} |
|
532 r_a c v_a$ to fill that hole in $v_a$. After injection, the value |
|
533 $v_i$ for $r_i = r_a \cdot \rb$ should be $\textit{Seq}(\textit{inj} |
|
534 r_a c v_a, v_b)$. |
|
535 |
|
536 To understand what is going on, it might be best to do some |
|
537 example calculations and compare them with Figure~\eqref{graph:2}. |
|
538 |
|
539 |
|
540 |
480 A correctness proof using induction can be routinely established. |
541 A correctness proof using induction can be routinely established. |
481 |
542 |
482 It is instructive to see how this algorithm works by a little example. |
543 It is instructive to see how this algorithm works by a little example. |
483 Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to |
544 Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to |
484 match it against the string $abc$(when $abc$ is written as a regular |
545 match it against the string $abc$(when $abc$ is written as a regular |