ninems/ninems.tex
changeset 59 8ff7b7508824
parent 58 f0360e17080e
child 60 c737a0259194
equal deleted inserted replaced
58:f0360e17080e 59:8ff7b7508824
    70   regular expression matches a string, but in case it matches also
    70   regular expression matches a string, but in case it matches also
    71   answers with \emph{how} it matches the string.  This is important for
    71   answers with \emph{how} it matches the string.  This is important for
    72   applications such as lexing (tokenising a string). The problem is to
    72   applications such as lexing (tokenising a string). The problem is to
    73   make the algorithm by Sulzmann and Lu fast on all inputs without
    73   make the algorithm by Sulzmann and Lu fast on all inputs without
    74   breaking its correctness. We have already developed some
    74   breaking its correctness. We have already developed some
    75   simplification rules for this, but have not proved yet that they
    75   simplification rules for this, but have not yet proved that they
    76   preserve the correctness of the algorithm. We also have not yet
    76   preserve the correctness of the algorithm. We also have not yet
    77   looked at extended regular expressions, such as bounded repetitions,
    77   looked at extended regular expressions, such as bounded repetitions,
    78   negation and back-references.
    78   negation and back-references.
    79 \end{abstract}
    79 \end{abstract}
    80 
    80 
   179 them---\emph{evil regular expressions}. In empiric work, Davis et al
   179 them---\emph{evil regular expressions}. In empiric work, Davis et al
   180 report that they have found thousands of such evil regular expressions
   180 report that they have found thousands of such evil regular expressions
   181 in the JavaScript and Python ecosystems \cite{Davis18}.
   181 in the JavaScript and Python ecosystems \cite{Davis18}.
   182 
   182 
   183 This exponential blowup in matching algorithms sometimes causes
   183 This exponential blowup in matching algorithms sometimes causes
   184 considerable grief in real life: for example on 20 July 2016 one evil
   184 considerable disturbance in real life: for example on 20 July 2016 one evil
   185 regular expression brought the webpage
   185 regular expression brought the webpage
   186 \href{http://stackexchange.com}{Stack Exchange} to its
   186 \href{http://stackexchange.com}{Stack Exchange} to its
   187 knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   187 knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
   188 In this instance, a regular expression intended to just trim white
   188 In this instance, a regular expression intended to just trim white
   189 spaces from the beginning and the end of a line actually consumed
   189 spaces from the beginning and the end of a line actually consumed
   405 parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @ |v_2|$
   405 parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @ |v_2|$
   406 matches the regex $r_1 \cdot r_2$: $r_1$ matches the substring $|v_1|$ and,
   406 matches the regex $r_1 \cdot r_2$: $r_1$ matches the substring $|v_1|$ and,
   407 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two are matched
   407 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two are matched
   408 is contained in the sub-structure of $v_1$ and $v_2$. 
   408 is contained in the sub-structure of $v_1$ and $v_2$. 
   409 
   409 
   410  To give a concrete example of how value work, consider the string $xy$
   410  To give a concrete example of how value works, consider the string $xy$
   411 and the regular expression $(x + (y + xy))^*$. We can view this regular
   411 and the regular expression $(x + (y + xy))^*$. We can view this regular
   412 expression as a tree and if the string $xy$ is matched by two Star
   412 expression as a tree and if the string $xy$ is matched by two Star
   413 ``iterations'', then the $x$ is matched by the left-most alternative in
   413 ``iterations'', then the $x$ is matched by the left-most alternative in
   414 this tree and the $y$ by the right-left alternative. This suggests to
   414 this tree and the $y$ by the right-left alternative. This suggests to
   415 record this matching as
   415 record this matching as
   438 algorithm by a second phase (the first phase being building successive
   438 algorithm by a second phase (the first phase being building successive
   439 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
   439 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
   440 is generated assuming the regular expression matches  the string. 
   440 is generated assuming the regular expression matches  the string. 
   441 Pictorially, the algorithm is as follows:
   441 Pictorially, the algorithm is as follows:
   442 
   442 
   443 \begin{center}
   443 \begin{equation}\label{graph:2}
   444 \begin{tikzcd}
   444 \begin{tikzcd}
   445 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   445 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
   446 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   446 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
   447 \end{tikzcd}
   447 \end{tikzcd}
   448 \end{center}
   448 \end{equation}
   449 
   449 
   450 \noindent
   450 \noindent
   451 For the convenience, we shall employ the following notations: the regular expression we
   451 For convenience, we shall employ the following notations: the regular expression we
   452 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
   452 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
   453 \ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots  according to
   453 \ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots  according to
   454 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
   454 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
   455 obtain at the derivative $r_n$. We test whether this derivative is
   455 obtain at the derivative $r_n$. We test whether this derivative is
   456 $\textit{nullable}$ or not. If not, we know the string does not match
   456 $\textit{nullable}$ or not. If not, we know the string does not match
   471 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
   471 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
   472 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
   472 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
   473 		\end{tabular}
   473 		\end{tabular}
   474 	\end{center}
   474 	\end{center}
   475 
   475 
   476  After this, we inject back the characters one by one in order to build
   476 
       
   477 \noindent There are no cases for $\ZERO$ and $c$, since
       
   478 these regular expression cannot match the empty string. Note
       
   479 also that in case of alternatives we give preference to the
       
   480 regular expression on the left-hand side. This will become
       
   481 important later on.
       
   482 
       
   483 After this, we inject back the characters one by one in order to build
   477 the parse tree $v_i$ for how the regex $r_i$ matches the string
   484 the parse tree $v_i$ for how the regex $r_i$ matches the string
   478 $s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we
   485 $s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we
   479 get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
   486 get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
       
   487 
       
   488 We need a function that
       
   489 reverses the "chopping off" for values which we did in the
       
   490 first phase for derivatives. The corresponding function is
       
   491 called $\textit{inj}$ (short for injection). This function takes three
       
   492 arguments: the first one is a regular expression $\textit{r_{i-1}}$ 
       
   493 before the character is chopped off, the second is  $\textit{c_{i-1}}$,
       
   494 the character
       
   495 we
       
   496 want to inject and the third argument is the value $textit{v_i}$, where we
       
   497 will inject the character(it corresponds to the regular expression
       
   498  after the character
       
   499 has been chopped off). The result of this function is a
       
   500 new value. The definition of $\textit{inj}$ is as follows: 
       
   501 
       
   502 \begin{center}
       
   503 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   504   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
       
   505   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
       
   506   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
       
   507   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   508   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
       
   509   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
       
   510   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
       
   511 \end{tabular}
       
   512 \end{center}
       
   513 
       
   514 \noindent This definition is by recursion on the regular
       
   515 expressions' and values' shapes. 
       
   516 To understands this definition in a more vivid way, the reader 
       
   517 might imagine this: when we do the derivative on regular expression 
       
   518 $r_{i-1}$, we chop off a character from $r_{i-1}$ to form $r_i$.
       
   519 This leaves a "hole" on $r_i$. In order to calculate a value for
       
   520 $r_{i-1}$, we need to locate where that hole is. We can find this location
       
   521 informtaion by comparing $r_{i-1}$ and $v_i$.
       
   522 For instance, if $r_{i-1}$ is of shape $r_a \cdot r_b$, and $v_i$
       
   523 is of shape $\textit{Left(Seq(v_a, v_b))}$, we know immediately that 
       
   524 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c\],
       
   525 otherwise if $r_a$ is not nullable,
       
   526 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b\],
       
   527 the value $v_i$ should be of shape $Seq(\ldots)$, contradicting the fact that
       
   528 $v_i$ is actually $Left(\ldots)$. Furthermore, since $v_i$ is of shape
       
   529 $Left(\ldots)$ instead of $Right(\ldots)$, we know that the left
       
   530 branch is taken instead of the right one. We have therefore found out 
       
   531 that the hole will be on $r_a$. So we recursively call $\textit{inj} 
       
   532 r_a c v_a$ to fill that hole in $v_a$. After injection, the value 
       
   533 $v_i$ for $r_i = r_a \cdot \rb$ should be $\textit{Seq}(\textit{inj}
       
   534 r_a c v_a, v_b)$.
       
   535 
       
   536 To understand what is going on, it might be best to do some
       
   537 example calculations and compare them with Figure~\eqref{graph:2}.
       
   538 
       
   539 
       
   540 
   480  A correctness proof using induction can be routinely established.
   541  A correctness proof using induction can be routinely established.
   481 
   542 
   482 It is instructive to see how this algorithm works by a little example.
   543 It is instructive to see how this algorithm works by a little example.
   483 Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to
   544 Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to
   484 match it against the string $abc$(when $abc$ is written as a regular
   545 match it against the string $abc$(when $abc$ is written as a regular