cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:60cb82639691
+:9b48724ec609
 regular expression matches a string, but in case it matches also
 \emph{how} it matches the string.  This is important for
 applications such as lexing (tokenising a string). The problem is to
 make the algorithm by Sulzmann and Lu fast on all inputs without
 breaking its correctness. We have already developed some
-simplification rules for this, but have not proved that they
+simplification rules for this, but have not proved yet that they
 preserve the correctness of the algorithm. We also have not yet
 looked at extended regular expressions, such as bounded repetitions,
 negation and back-references.
 \end{abstract}
 frequent enough that a separate name has been created for
 them---\emph{evil regular expressions}. In empiric work, Davis et al
 report that they have found thousands of such evil regular expressions
 in the JavaScript and Python ecosystems \cite{Davis18}.
-This exponential blowup sometimes causes real pain in real life:
+This exponential blowup sometimes causes pain in real life:
 for example on 20 July 2016 one evil regular expression brought the
-webpage \href{http://stackexchange.com}{Stack Exchange} to its knees \footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}.
+webpage \href{http://stackexchange.com}{Stack Exchange} to its
+knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
 In this instance, a regular expression intended to just trim white
 spaces from the beginning and the end of a line actually consumed
 massive amounts of CPU-resources and because of this the web servers
 ground to a halt. This happened when a post with 20,000 white spaces
 was submitted, but importantly the white spaces were neither at the
 The underlying problem is that many ``real life'' regular expression
 matching engines do not use DFAs for matching. This is because they
 support regular expressions that are not covered by the classical
 automata theory, and in this more general setting there are quite a
 few research questions still unanswered and fast algorithms still need
-to be developed.
+to be developed (for example how to include bounded repetitions, negation
+and  back-references).
 There is also another under-researched problem to do with regular
 expressions and lexing, i.e.~the process of breaking up strings into
 sequences of tokens according to some regular expressions. In this
 setting one is not just interested in whether or not a regular
 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
 ``fire''---so is it an identifier or a keyword?  While in applications
 there is a well-known strategy to decide these questions, called POSIX
 matching, only relatively recently precise definitions of what POSIX
 matching actually means have been formalised
-\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly,
+\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}.
-POSIX matching means matching the longest initial substring.
+Such a definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014}, but the
+corresponding correctness proof turned out to be  faulty \cite{AusafDyckhoffUrban2016}.
+Roughly, POSIX matching means matching the longest initial substring.
 In the case of a tie, the initial submatch is chosen according to some priorities attached to the
 regular expressions (e.g.~keywords have a higher priority than
 identifiers). This sounds rather simple, but according to Grathwohl et
 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
 Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
 regular expression matching according to the POSIX strategy
 \cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
 Brzozowski from 1964 where he introduced the notion of derivatives of
 regular expressions \cite{Brzozowski1964}. We shall briefly explain
-the algorithms next.
+this algorithms next.
-\section{The Algorithms by  Brzozowski, and Sulzmann and Lu}
+\section{The Algorithm by Brzozowski based on Derivatives of Regular
+Expressions}
 Suppose (basic) regular expressions are given by the following grammar:
 \[			r ::=   \ZERO \mid  \ONE
 			 \mid  c
 			 \mid  r_1 \cdot r_2
 			 \mid  r_1 + r_2
 			 \mid r^*
 \]
 \noindent
-The intended meaning of the constructors is as usual: $\ZERO$
+The intended meaning of the constructors is as follows: $\ZERO$
 cannot match any string, $\ONE$ can match the empty string, the
 character regular expression $c$ can match the character $c$, and so
 on.
 The brilliant contribution by Brzozowski is the notion of
 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
 \end{tabular}
 \end{center}
-\noindent
+%Assuming the classic notion of a
-%Assuming the classic notion of a
 %\emph{language} of a regular expression, written $L(\_)$, t
 \noindent
 The main property of the derivative operation is that
 \begin{center}
 $c\!::\!s \in L(r)$ holds
 if and only if $s \in L(r\backslash c)$.
 \end{center}
 \noindent
 For us the main advantage is that derivatives can be
 straightforwardly implemented in any functional programming language,
 and are easily definable and reasoned about in theorem provers---the
 definitions just consist of inductive datatypes and simple recursive
 functions. Moreover, the notion of derivatives can be easily
 generalised to cover extended regular expression constructors such as
 expression $r$, build the derivatives of $r$ w.r.t.\ (in succession)
 all the characters of the string $s$. Finally, test whether the
 resulting regular expression can match the empty string.  If yes, then
 $r$ matches $s$, and no in the negative case. To implement this idea
 we can generalise the derivative operation to strings like this:
 \begin{center}
 \begin{tabular}{lcl}
 $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
 $r \backslash [\,] $ & $\dn$ & $r$
 \end{tabular}
 \end{center}
 \noindent
-Using this definition we obtain a simple and elegant regular
+and then define as  regular-expression matching algorithm:
-expression matching algorithm:
 \[
 match\;s\;r \;\dn\; nullable(r\backslash s)
 \]
 \noindent
-Pictorially this algorithm can be illustrated as follows:
+This algorithm can be illustrated as follows:
+\begin{equation}\label{graph:*}
-\begin{center}
+\begin{tikzcd}
-\begin{tikzcd}\label{graph:*}
+r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
-r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"nullable?"] & Yes/No
 \end{tikzcd}
-\end{center}
+\end{equation}
 \noindent
-where we start with  a regular expression  $r_0$, build successive derivatives
+where we start with  a regular expression  $r_0$, build successive
-until we exhaust the string and then use \textit{nullable} to test whether the
+derivatives until we exhaust the string and then use \textit{nullable}
-result can match the empty string. It can  be relatively  easily shown that this
+to test whether the result can match the empty string. It can  be
-matcher is correct.
+relatively  easily shown that this matcher is correct  (that is given
+$s$ and $r$, it generates YES if and only if $s \in L(r)$).
+\section{Values and the Algorithm by Sulzmann and Lu}
 One limitation, however, of Brzozowski's algorithm is that it only
 produces a YES/NO answer for whether a string is being matched by a
 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
 algorithm to allow generation of an actual matching, called a
-\emph{value}.
+\emph{value}. Values and regular expressions correspond to each
+other as illustrated in the following table:
 \begin{center}
 	\begin{tabular}{c@{\hspace{20mm}}c}
 		\begin{tabular}{@{}rrl@{}}
 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
 		\end{tabular}
 	\end{tabular}
 \end{center}
 \noindent
-Values and regular expressions correspond to each other as illustrated by placing corresponding values next to the regular expressions.
+The idea of values is to express parse trees. Suppose a flatten
-The idea of values is to express parse trees.
+operation, written $|v|$, which we can use to extract the underlying
-Suppose by using a flatten operation, written $|v|$, we can extract the underlying string of v.
+string of $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \,
-For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$. We omit this straightforward definition.
+(\textit{Char y})|$ is the string $xy$. We omit the straightforward
-Using this flatten notation, we now elaborate how values can express parse trees. $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and respectively $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$.
+definition of flatten. Using flatten, we can describe how
+values encode parse trees: $\Seq\,v_1\, v_2$ tells us how the
-To give a concrete example of how value works, consider the string $xy$ and the
+string $|v_1| @ |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$
-regular expression $(x + (y + xy))^*$. We can view this regular
+matches $|v_1|$ and, respectively, $r_2$ matches $|v_2|$. Exactly how
+these two are matched is contained in the sub-structure of $v_1$ and
+$v_2$.
+To give a concrete example of how value works, consider the string $xy$
+and the regular expression $(x + (y + xy))^*$. We can view this regular
 expression as a tree and if the string $xy$ is matched by two Star
-``iterations'', then the $x$ is matched by the left-most alternative
+``iterations'', then the $x$ is matched by the left-most alternative in
-in this tree and the $y$ by the right-left alternative. This suggests
+this tree and the $y$ by the right-left alternative. This suggests to
-to record this matching as
+record this matching as
 \begin{center}
 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
 \end{center}
 and $\Seq$ indicates that $xy$ is matched by a sequence regular
 expression.
 The contribution of Sulzmann and Lu is an extension of Brzozowski's
 algorithm by a second phase (the first phase being building successive
-derivatives). In this second phase, for every successful match the
+derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
-corresponding POSIX value is computed. Pictorially, the Sulzmann and Lu algorithm looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is given before \ref{graph:*}):\\
+is generated assuming the regular expression matches  the string.
+Pictorially, the algorithm as follows:
+\begin{center}
 \begin{tikzcd}
 r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]
 \end{tikzcd}
+\end{center}
-We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
+\noindent
-First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ...  until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for how the empty word matched the empty regular expression epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
+We shall briefly explain this algorithm. For the convenience of
+explanation, we have the following notations: the regular expression we
+start with is $r_0$ and the string $s$ is composed characters $c_0 c_1
+\ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots, using
+the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
+arrive at the derivative $r_n$. We test whether this derivative is
+$\textit{nullable}$ or not. If not, we know the string does not match
+$r$ and no value needs to be generated. If yes, we start building the
+parse tree incrementally by \emph{injecting} back the characters into
+the values $v_n, \ldots, v_0$. We first call the function
+$\textit{mkeps}$, which builds the parse tree for how the empty string
+is matched the empty regular expression $r_n$. This function is defined
+as
 $mkeps $ $1 \,[] $ $= Empty$
 ......
-After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
+After this, we inject back the characters one by one in order to build
-An inductive proof can be routinely established.
+the parse tree $v_i$ for how the regex $r_i$ matches the string
+$s_i$ ($s_i$ means the string s with the first $i$ characters being
+chopped off) from the previous parse tree. After $n$ transformations, we
+get the parse tree for how $r_0$ matches $s$, exactly as we wanted. An
+inductive proof can be routinely established.
 It is instructive to see how it works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)*$ and we want to match it against the string $abc$. By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching.
 First, we build successive derivatives until we exhaust the string, as illustrated here( we omitted some parenthesis for better readability):
 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\]

changeset 46	9b48724ec609
parent 45	60cb82639691
child 47	d2a7e87ea6e1