cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:f0360e17080e
+:8ff7b7508824
 regular expression matches a string, but in case it matches also
 answers with \emph{how} it matches the string.  This is important for
 applications such as lexing (tokenising a string). The problem is to
 make the algorithm by Sulzmann and Lu fast on all inputs without
 breaking its correctness. We have already developed some
-simplification rules for this, but have not proved yet that they
+simplification rules for this, but have not yet proved that they
 preserve the correctness of the algorithm. We also have not yet
 looked at extended regular expressions, such as bounded repetitions,
 negation and back-references.
 \end{abstract}
 them---\emph{evil regular expressions}. In empiric work, Davis et al
 report that they have found thousands of such evil regular expressions
 in the JavaScript and Python ecosystems \cite{Davis18}.
 This exponential blowup in matching algorithms sometimes causes
-considerable grief in real life: for example on 20 July 2016 one evil
+considerable disturbance in real life: for example on 20 July 2016 one evil
 regular expression brought the webpage
 \href{http://stackexchange.com}{Stack Exchange} to its
 knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
 In this instance, a regular expression intended to just trim white
 spaces from the beginning and the end of a line actually consumed
 parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @ |v_2|$
 matches the regex $r_1 \cdot r_2$: $r_1$ matches the substring $|v_1|$ and,
 respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two are matched
 is contained in the sub-structure of $v_1$ and $v_2$.
-To give a concrete example of how value work, consider the string $xy$
+To give a concrete example of how value works, consider the string $xy$
 and the regular expression $(x + (y + xy))^*$. We can view this regular
 expression as a tree and if the string $xy$ is matched by two Star
 ``iterations'', then the $x$ is matched by the left-most alternative in
 this tree and the $y$ by the right-left alternative. This suggests to
 record this matching as
 algorithm by a second phase (the first phase being building successive
 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
 is generated assuming the regular expression matches  the string.
 Pictorially, the algorithm is as follows:
-\begin{center}
+\begin{equation}\label{graph:2}
 \begin{tikzcd}
 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]
 \end{tikzcd}
-\end{center}
+\end{equation}
 \noindent
-For the convenience, we shall employ the following notations: the regular expression we
+For convenience, we shall employ the following notations: the regular expression we
 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
 \ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots  according to
 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
 obtain at the derivative $r_n$. We test whether this derivative is
 $\textit{nullable}$ or not. If not, we know the string does not match
 			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
 			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
 		\end{tabular}
 	\end{center}
-After this, we inject back the characters one by one in order to build
+\noindent There are no cases for $\ZERO$ and $c$, since
+these regular expression cannot match the empty string. Note
+also that in case of alternatives we give preference to the
+regular expression on the left-hand side. This will become
+important later on.
+After this, we inject back the characters one by one in order to build
 the parse tree $v_i$ for how the regex $r_i$ matches the string
 $s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we
 get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
+We need a function that
+reverses the "chopping off" for values which we did in the
+first phase for derivatives. The corresponding function is
+called $\textit{inj}$ (short for injection). This function takes three
+arguments: the first one is a regular expression $\textit{r_{i-1}}$
+before the character is chopped off, the second is  $\textit{c_{i-1}}$,
+the character
+we
+want to inject and the third argument is the value $textit{v_i}$, where we
+will inject the character(it corresponds to the regular expression
+after the character
+has been chopped off). The result of this function is a
+new value. The definition of $\textit{inj}$ is as follows:
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+$\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
+$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
+$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
+$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
+$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
+\end{tabular}
+\end{center}
+\noindent This definition is by recursion on the regular
+expressions' and values' shapes.
+To understands this definition in a more vivid way, the reader
+might imagine this: when we do the derivative on regular expression
+$r_{i-1}$, we chop off a character from $r_{i-1}$ to form $r_i$.
+This leaves a "hole" on $r_i$. In order to calculate a value for
+$r_{i-1}$, we need to locate where that hole is. We can find this location
+informtaion by comparing $r_{i-1}$ and $v_i$.
+For instance, if $r_{i-1}$ is of shape $r_a \cdot r_b$, and $v_i$
+is of shape $\textit{Left(Seq(v_a, v_b))}$, we know immediately that
+\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c\],
+otherwise if $r_a$ is not nullable,
+\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b\],
+the value $v_i$ should be of shape $Seq(\ldots)$, contradicting the fact that
+$v_i$ is actually $Left(\ldots)$. Furthermore, since $v_i$ is of shape
+$Left(\ldots)$ instead of $Right(\ldots)$, we know that the left
+branch is taken instead of the right one. We have therefore found out
+that the hole will be on $r_a$. So we recursively call $\textit{inj}
+r_a c v_a$ to fill that hole in $v_a$. After injection, the value
+$v_i$ for $r_i = r_a \cdot \rb$ should be $\textit{Seq}(\textit{inj}
+r_a c v_a, v_b)$.
+To understand what is going on, it might be best to do some
+example calculations and compare them with Figure~\eqref{graph:2}.
 A correctness proof using induction can be routinely established.
 It is instructive to see how this algorithm works by a little example.
 Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to
 match it against the string $abc$(when $abc$ is written as a regular

changeset 59	8ff7b7508824
parent 58	f0360e17080e
child 60	c737a0259194