--- a/ninems/ninems.tex Sat Jul 06 23:34:27 2019 +0100
+++ b/ninems/ninems.tex Sun Jul 07 18:43:13 2019 +0100
@@ -70,7 +70,7 @@
regular expression matching based on the notion of derivatives of
regular expressions. In 2014, Sulzmann and Lu extended this
algorithm to not just give a YES/NO answer for whether or not a
- regular expression matches a string, but in case it matches also
+ regular expression matches a string, but in case it does also
answers with \emph{how} it matches the string. This is important for
applications such as lexing (tokenising a string). The problem is to
make the algorithm by Sulzmann and Lu fast on all inputs without
@@ -93,7 +93,7 @@
Unfortunately these preconceptions are not supported by evidence: Take
for example the regular expression $(a^*)^*\,b$ and ask whether
strings of the form $aa..a$ match this regular
-expression. Obviously they do not match---the expected $b$ in the last
+expression. Obviously not---the expected $b$ in the last
position is missing. One would expect that modern regular expression
matching engines can find this out very quickly. Alas, if one tries
this example in JavaScript, Python or Java 8 with strings like 28
@@ -171,14 +171,14 @@
\end{center}
\noindent These are clearly abysmal and possibly surprising results. One
-would expect these systems doing much better than that---after all,
+would expect these systems to do much better than that---after all,
given a DFA and a string, deciding whether a string is matched by this
DFA should be linear.
Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
-exhibit this ``exponential behaviour''. Unfortunately, such regular
-expressions are not just a few ``outliers'', but actually they are
-frequent enough that a separate name has been created for
+exhibit this exponential behaviour. Unfortunately, such regular
+expressions are not just a few outliers. They are actually
+frequent enough to have a separate name created for
them---\emph{evil regular expressions}. In empiric work, Davis et al
report that they have found thousands of such evil regular expressions
in the JavaScript and Python ecosystems \cite{Davis18}.
@@ -190,8 +190,8 @@
knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
In this instance, a regular expression intended to just trim white
spaces from the beginning and the end of a line actually consumed
-massive amounts of CPU-resources and because of this the web servers
-ground to a halt. This happened when a post with 20,000 white spaces was
+massive amounts of CPU-resources---causing web servers to
+grind to a halt. This happened when a post with 20,000 white spaces was
submitted, but importantly the white spaces were neither at the
beginning nor at the end. As a result, the regular expression matching
engine needed to backtrack over many choices. The underlying problem is
@@ -202,13 +202,13 @@
unanswered and fast algorithms still need to be developed (for example
how to treat bounded repetitions, negation and back-references
efficiently).
-
+%question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
+%how do they avoid dfas exponential states if they use them for fast matching?
There is also another under-researched problem to do with regular
expressions and lexing, i.e.~the process of breaking up strings into
sequences of tokens according to some regular expressions. In this
setting one is not just interested in whether or not a regular
-expression matches a string, but if it matches also in \emph{how} it
-matches the string. Consider for example a regular expression
+expression matches a string, but also in \emph{how}. Consider for example a regular expression
$r_{key}$ for recognising keywords such as \textit{if}, \textit{then}
and so on; and a regular expression $r_{id}$ for recognising
identifiers (say, a single character followed by characters or
@@ -221,12 +221,12 @@
``fire''---so is it an identifier or a keyword? While in applications
there is a well-known strategy to decide these questions, called POSIX
matching, only relatively recently precise definitions of what POSIX
-matching actually means have been formalised
+matching actually means has been formalised
\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}.
Such a definition has also been given by Sulzmann and Lu \cite{Sulzmann2014}, but the
corresponding correctness proof turned out to be faulty \cite{AusafDyckhoffUrban2016}.
Roughly, POSIX matching means matching the longest initial substring.
-In the case of a tie, the initial submatch is chosen according to some priorities attached to the
+In the case of a tie, the initial sub-match is chosen according to some priorities attached to the
regular expressions (e.g.~keywords have a higher priority than
identifiers). This sounds rather simple, but according to Grathwohl et
al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
@@ -347,7 +347,7 @@
\]
\noindent
-This algorithm can be illustrated graphically as follows
+This algorithm looks graphically as follows:
\begin{equation}\label{graph:*}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
@@ -359,7 +359,7 @@
derivatives until we exhaust the string and then use \textit{nullable}
to test whether the result can match the empty string. It can be
relatively easily shown that this matcher is correct (that is given
-an $s$ and a $r$, it generates YES if and only if $s \in L(r)$).
+an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
\section{Values and the Algorithm by Sulzmann and Lu}
@@ -399,18 +399,20 @@
\end{center}
\noindent
-There is no value corresponding to $\ZERO$; $\Empty$ corresponds to
+No value corresponds to $\ZERO$; $\Empty$ corresponds to
$\ONE$; $\Char$ to the character regular expression; $\Seq$ to the
sequence regular expression and so on. The idea of values is to encode
-parse trees. To see this, suppose a \emph{flatten} operation, written
+parse trees for how the sub-parts of a regular expression matches
+the sub-parts of a string. To see this, suppose a \emph{flatten} operation, written
$|v|$. We use this function to extract the underlying string of a value
$v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char
y})|$ is the string $xy$. Using flatten, we can describe how values
-encode parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @
+encode parse trees: $\Seq\,v_1\, v_2$ encodes a tree with 2 children nodes
+that tells how the string $|v_1| @
|v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the
substring $|v_1|$ and, respectively, $r_2$ matches the substring
$|v_2|$. Exactly how these two are matched is contained in the
-sub-structure of $v_1$ and $v_2$.
+children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ .
To give a concrete example of how value works, consider the string $xy$
and the regular expression $(x + (y + xy))^*$. We can view this regular
@@ -424,8 +426,8 @@
\end{center}
\noindent
-where $\Stars$ records how many
-iterations were used; and $\Left$, respectively $\Right$, which
+where $\Stars [\ldots]$ records all the
+iterations; and $\Left$, respectively $\Right$, which
alternative is used. The value for
matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
would look as follows
@@ -455,7 +457,7 @@
\noindent
For convenience, we shall employ the following notations: the regular expression we
start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
-\ldots c_n$. In the first phase, we build the derivatives $r_1$, $r_2$, \ldots according to
+\ldots c_{n-1}$. In the first phase, we build the derivatives $r_1$, $r_2$, \ldots according to
the characters $c_0$, $c_1$,\ldots until we exhaust the string and
obtain at the derivative $r_n$. We test whether this derivative is
$\textit{nullable}$ or not. If not, we know the string does not match
@@ -487,7 +489,7 @@
After this, we inject back the characters one by one in order to build
the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
-($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After
+($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$. After
injecting back $n$ characters, we get the parse tree for how $r_0$
matches $s$. For this Sulzmann and Lu defined a function that reverses
the ``chopping off'' of characters during the derivative phase. The
@@ -495,7 +497,7 @@
arguments: the first one is a regular expression ${r_{i-1}}$, before the
character is chopped off, the second is a character ${c_{i-1}}$, the
character we want to inject and the third argument is the value
-$textit{v_i}$, into which one wants to inject the character (it
+${v_i}$, into which one wants to inject the character (it
corresponds to the regular expression after the character has been
chopped off). The result of this function is a new value. The definition
of $\textit{inj}$ is as follows:
@@ -516,11 +518,12 @@
expressions and values. To understands this definition better consider
the situation when we build the derivative on regular expression $r_{i-1}$.
For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a
-``hole'' in $r_i$. In order to calculate a value for $r_{i-1}$, we need to
-locate where that hole is. We can find this location information by
+``hole'' in $r_i$ and its corresponding value $v_i$
+. To calculate $v_{i-1}$, we need to
+locate where that hole is and fill it.
+We can find this location by
comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape
-$r_a \cdot r_b$, and $v_i$ is of shape $\textit{Left}(\textit{Seq}(v_a,
-v_b))$, we know immediately that
+$r_a \cdot r_b$, and $v_i$ is of shape $\Left(Seq(v_1,v_2))$, we know immediately that
%
\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\]
@@ -529,12 +532,16 @@
\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\]
\noindent
-the value $v_i$ should be of shape $\Seq(\ldots)$, contradicting the fact that
-$v_i$ is actually $\Left(\ldots)$. Furthermore, since $v_i$ is of shape
+the value $v_i$ should be $\Seq(\ldots)$, contradicting the fact that
+$v_i$ is actually of shape $\Left(\ldots)$. Furthermore, since $v_i$ is of shape
$\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the left
-branch is taken instead of the right one. We have therefore found out
+branch of \[ (r_a \cdot r_b)\backslash c =
+\bold{\underline{ (r_a\backslash c) \cdot r_b} }\,+\, r_b\backslash c,\](underlined)
+ is taken instead of the right one. This means $c$ is chopped off
+from $r_a$ rather than $r_b$.
+We have therefore found out
that the hole will be on $r_a$. So we recursively call $\inj\,
-r_a\,c\,v_a$ in order to fill that hole in $v_a$. After injection, the value
+r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value
$v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
Other clauses can be understood in a similar way.
@@ -542,11 +549,12 @@
regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against
the string $abc$ (when $abc$ is written as a regular expression, the most
standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
-the parentheses and dots here for readability). By POSIX rules the lexer
+the parentheses and dots here for readability). By POSIX rules the algorithm
should go for the longest matching, i.e. it should match the string
$abc$ in one star iteration, using the longest string $abc$ in the
sub-expression $a+b+ab+c+abc$ (we use $r$ to denote this sub-expression
-for conciseness). Here is how the lexer achieves a parse tree for this
+for conciseness).
+Here is how the lexer achieves a parse tree for this
matching. First, build successive derivatives until we exhaust the
string, as illustrated here (we simplified some regular expressions like
$0 \cdot b$ to $0$ for conciseness; we also omit some parentheses if