--- a/ninems/ninems.tex Thu Jul 04 10:19:35 2019 +0100
+++ b/ninems/ninems.tex Thu Jul 04 11:04:02 2019 +0100
@@ -72,7 +72,7 @@
applications such as lexing (tokenising a string). The problem is to
make the algorithm by Sulzmann and Lu fast on all inputs without
breaking its correctness. We have already developed some
- simplification rules for this, but have not proved that they
+ simplification rules for this, but have not proved yet that they
preserve the correctness of the algorithm. We also have not yet
looked at extended regular expressions, such as bounded repetitions,
negation and back-references.
@@ -182,9 +182,10 @@
report that they have found thousands of such evil regular expressions
in the JavaScript and Python ecosystems \cite{Davis18}.
-This exponential blowup sometimes causes real pain in real life:
+This exponential blowup sometimes causes pain in real life:
for example on 20 July 2016 one evil regular expression brought the
-webpage \href{http://stackexchange.com}{Stack Exchange} to its knees \footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}.
+webpage \href{http://stackexchange.com}{Stack Exchange} to its
+knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
In this instance, a regular expression intended to just trim white
spaces from the beginning and the end of a line actually consumed
massive amounts of CPU-resources and because of this the web servers
@@ -198,7 +199,8 @@
support regular expressions that are not covered by the classical
automata theory, and in this more general setting there are quite a
few research questions still unanswered and fast algorithms still need
-to be developed.
+to be developed (for example how to include bounded repetitions, negation
+and back-references).
There is also another under-researched problem to do with regular
expressions and lexing, i.e.~the process of breaking up strings into
@@ -219,8 +221,10 @@
there is a well-known strategy to decide these questions, called POSIX
matching, only relatively recently precise definitions of what POSIX
matching actually means have been formalised
-\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly,
-POSIX matching means matching the longest initial substring.
+\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}.
+Such a definition has also been given by Sulzmann and Lu \cite{Sulzmann2014}, but the
+corresponding correctness proof turned out to be faulty \cite{AusafDyckhoffUrban2016}.
+Roughly, POSIX matching means matching the longest initial substring.
In the case of a tie, the initial submatch is chosen according to some priorities attached to the
regular expressions (e.g.~keywords have a higher priority than
identifiers). This sounds rather simple, but according to Grathwohl et
@@ -242,9 +246,10 @@
\cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
Brzozowski from 1964 where he introduced the notion of derivatives of
regular expressions \cite{Brzozowski1964}. We shall briefly explain
-the algorithms next.
+this algorithms next.
-\section{The Algorithms by Brzozowski, and Sulzmann and Lu}
+\section{The Algorithm by Brzozowski based on Derivatives of Regular
+Expressions}
Suppose (basic) regular expressions are given by the following grammar:
\[ r ::= \ZERO \mid \ONE
@@ -255,7 +260,7 @@
\]
\noindent
-The intended meaning of the constructors is as usual: $\ZERO$
+The intended meaning of the constructors is as follows: $\ZERO$
cannot match any string, $\ONE$ can match the empty string, the
character regular expression $c$ can match the character $c$, and so
on.
@@ -296,12 +301,9 @@
\end{tabular}
\end{center}
-\noindent
-
+%Assuming the classic notion of a
+%\emph{language} of a regular expression, written $L(\_)$, t
-
- %Assuming the classic notion of a
-%\emph{language} of a regular expression, written $L(\_)$, t
\noindent
The main property of the derivative operation is that
@@ -311,7 +313,7 @@
\end{center}
\noindent
- For us the main advantage is that derivatives can be
+For us the main advantage is that derivatives can be
straightforwardly implemented in any functional programming language,
and are easily definable and reasoned about in theorem provers---the
definitions just consist of inductive datatypes and simple recursive
@@ -329,6 +331,7 @@
resulting regular expression can match the empty string. If yes, then
$r$ matches $s$, and no in the negative case. To implement this idea
we can generalise the derivative operation to strings like this:
+
\begin{center}
\begin{tabular}{lcl}
$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
@@ -337,32 +340,36 @@
\end{center}
\noindent
-Using this definition we obtain a simple and elegant regular
-expression matching algorithm:
+and then define as regular-expression matching algorithm:
\[
match\;s\;r \;\dn\; nullable(r\backslash s)
\]
\noindent
-Pictorially this algorithm can be illustrated as follows:
-
-\begin{center}
-\begin{tikzcd}\label{graph:*}
-r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"nullable?"] & Yes/No
+This algorithm can be illustrated as follows:
+\begin{equation}\label{graph:*}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
\end{tikzcd}
-\end{center}
+\end{equation}
\noindent
-where we start with a regular expression $r_0$, build successive derivatives
-until we exhaust the string and then use \textit{nullable} to test whether the
-result can match the empty string. It can be relatively easily shown that this
-matcher is correct.
+where we start with a regular expression $r_0$, build successive
+derivatives until we exhaust the string and then use \textit{nullable}
+to test whether the result can match the empty string. It can be
+relatively easily shown that this matcher is correct (that is given
+$s$ and $r$, it generates YES if and only if $s \in L(r)$).
+
+
+\section{Values and the Algorithm by Sulzmann and Lu}
One limitation, however, of Brzozowski's algorithm is that it only
produces a YES/NO answer for whether a string is being matched by a
regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this
algorithm to allow generation of an actual matching, called a
-\emph{value}.
+\emph{value}. Values and regular expressions correspond to each
+other as illustrated in the following table:
+
\begin{center}
\begin{tabular}{c@{\hspace{20mm}}c}
@@ -391,18 +398,23 @@
\end{center}
\noindent
-Values and regular expressions correspond to each other as illustrated by placing corresponding values next to the regular expressions.
-The idea of values is to express parse trees.
-Suppose by using a flatten operation, written $|v|$, we can extract the underlying string of v.
-For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$. We omit this straightforward definition.
-Using this flatten notation, we now elaborate how values can express parse trees. $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and respectively $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$.
+The idea of values is to express parse trees. Suppose a flatten
+operation, written $|v|$, which we can use to extract the underlying
+string of $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \,
+(\textit{Char y})|$ is the string $xy$. We omit the straightforward
+definition of flatten. Using flatten, we can describe how
+values encode parse trees: $\Seq\,v_1\, v_2$ tells us how the
+string $|v_1| @ |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$
+matches $|v_1|$ and, respectively, $r_2$ matches $|v_2|$. Exactly how
+these two are matched is contained in the sub-structure of $v_1$ and
+$v_2$.
- To give a concrete example of how value works, consider the string $xy$ and the
-regular expression $(x + (y + xy))^*$. We can view this regular
+ To give a concrete example of how value works, consider the string $xy$
+and the regular expression $(x + (y + xy))^*$. We can view this regular
expression as a tree and if the string $xy$ is matched by two Star
-``iterations'', then the $x$ is matched by the left-most alternative
-in this tree and the $y$ by the right-left alternative. This suggests
-to record this matching as
+``iterations'', then the $x$ is matched by the left-most alternative in
+this tree and the $y$ by the right-left alternative. This suggests to
+record this matching as
\begin{center}
$\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
@@ -426,23 +438,42 @@
The contribution of Sulzmann and Lu is an extension of Brzozowski's
algorithm by a second phase (the first phase being building successive
-derivatives). In this second phase, for every successful match the
-corresponding POSIX value is computed. Pictorially, the Sulzmann and Lu algorithm looks like the following diagram(the working flow of the simple matching algorithm that just gives a $YES/NO$ answer is given before \ref{graph:*}):\\
+derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
+is generated assuming the regular expression matches the string.
+Pictorially, the algorithm as follows:
+
+\begin{center}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
\end{tikzcd}
-
+\end{center}
-We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
-First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ... until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for how the empty word matched the empty regular expression epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
+\noindent
+We shall briefly explain this algorithm. For the convenience of
+explanation, we have the following notations: the regular expression we
+start with is $r_0$ and the string $s$ is composed characters $c_0 c_1
+\ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots, using
+the characters $c_0$, $c_1$,\ldots until we exhaust the string and
+arrive at the derivative $r_n$. We test whether this derivative is
+$\textit{nullable}$ or not. If not, we know the string does not match
+$r$ and no value needs to be generated. If yes, we start building the
+parse tree incrementally by \emph{injecting} back the characters into
+the values $v_n, \ldots, v_0$. We first call the function
+$\textit{mkeps}$, which builds the parse tree for how the empty string
+is matched the empty regular expression $r_n$. This function is defined
+as
$mkeps $ $1 \,[] $ $= Empty$
......
- After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
-An inductive proof can be routinely established.
+ After this, we inject back the characters one by one in order to build
+the parse tree $v_i$ for how the regex $r_i$ matches the string
+$s_i$ ($s_i$ means the string s with the first $i$ characters being
+chopped off) from the previous parse tree. After $n$ transformations, we
+get the parse tree for how $r_0$ matches $s$, exactly as we wanted. An
+inductive proof can be routinely established.
It is instructive to see how it works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)*$ and we want to match it against the string $abc$. By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching.
First, we build successive derivatives until we exhaust the string, as illustrated here( we omitted some parenthesis for better readability):