ninems/ninems.tex
changeset 64 afd0d702a4fe
parent 63 d3c22f809dde
child 65 df2e0faccb23
--- a/ninems/ninems.tex	Sat Jul 06 23:34:27 2019 +0100
+++ b/ninems/ninems.tex	Sun Jul 07 18:43:13 2019 +0100
@@ -70,7 +70,7 @@
   regular expression matching based on the notion of derivatives of
   regular expressions. In 2014, Sulzmann and Lu extended this
   algorithm to not just give a YES/NO answer for whether or not a
-  regular expression matches a string, but in case it matches also
+  regular expression matches a string, but in case it does also
   answers with \emph{how} it matches the string.  This is important for
   applications such as lexing (tokenising a string). The problem is to
   make the algorithm by Sulzmann and Lu fast on all inputs without
@@ -93,7 +93,7 @@
 Unfortunately these preconceptions are not supported by evidence: Take
 for example the regular expression $(a^*)^*\,b$ and ask whether
 strings of the form $aa..a$ match this regular
-expression. Obviously they do not match---the expected $b$ in the last
+expression. Obviously not---the expected $b$ in the last
 position is missing. One would expect that modern regular expression
 matching engines can find this out very quickly. Alas, if one tries
 this example in JavaScript, Python or Java 8 with strings like 28
@@ -171,14 +171,14 @@
 \end{center}  
 
 \noindent These are clearly abysmal and possibly surprising results. One
-would expect these systems doing much better than that---after all,
+would expect these systems to do  much better than that---after all,
 given a DFA and a string, deciding whether a string is matched by this
 DFA should be linear.
 
 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
-exhibit this ``exponential behaviour''.  Unfortunately, such regular
-expressions are not just a few ``outliers'', but actually they are
-frequent enough that a separate name has been created for
+exhibit this exponential behaviour.  Unfortunately, such regular
+expressions are not just a few outliers. They are actually 
+frequent enough to have a separate name created for
 them---\emph{evil regular expressions}. In empiric work, Davis et al
 report that they have found thousands of such evil regular expressions
 in the JavaScript and Python ecosystems \cite{Davis18}.
@@ -190,8 +190,8 @@
 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
 In this instance, a regular expression intended to just trim white
 spaces from the beginning and the end of a line actually consumed
-massive amounts of CPU-resources and because of this the web servers
-ground to a halt. This happened when a post with 20,000 white spaces was
+massive amounts of CPU-resources---causing web servers to
+grind to a halt. This happened when a post with 20,000 white spaces was
 submitted, but importantly the white spaces were neither at the
 beginning nor at the end. As a result, the regular expression matching
 engine needed to backtrack over many choices. The underlying problem is
@@ -202,13 +202,13 @@
 unanswered and fast algorithms still need to be developed (for example
 how to treat bounded repetitions, negation and  back-references
 efficiently).
-
+%question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
+%how do they avoid dfas exponential states if they use them for fast matching?
 There is also another under-researched problem to do with regular
 expressions and lexing, i.e.~the process of breaking up strings into
 sequences of tokens according to some regular expressions. In this
 setting one is not just interested in whether or not a regular
-expression matches a string, but if it matches also in \emph{how} it
-matches the string.  Consider for example a regular expression
+expression matches a string, but also in \emph{how}.  Consider for example a regular expression
 $r_{key}$ for recognising keywords such as \textit{if}, \textit{then}
 and so on; and a regular expression $r_{id}$ for recognising
 identifiers (say, a single character followed by characters or
@@ -221,12 +221,12 @@
 ``fire''---so is it an identifier or a keyword?  While in applications
 there is a well-known strategy to decide these questions, called POSIX
 matching, only relatively recently precise definitions of what POSIX
-matching actually means have been formalised
+matching actually means has been formalised
 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. 
 Such a definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014}, but the
 corresponding correctness proof turned out to be  faulty \cite{AusafDyckhoffUrban2016}.
 Roughly, POSIX matching means matching the longest initial substring.
-In the case of a tie, the initial submatch is chosen according to some priorities attached to the
+In the case of a tie, the initial sub-match is chosen according to some priorities attached to the
 regular expressions (e.g.~keywords have a higher priority than
 identifiers). This sounds rather simple, but according to Grathwohl et
 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
@@ -347,7 +347,7 @@
 \]
 
 \noindent
-This algorithm can be illustrated graphically as follows
+This algorithm looks graphically as follows:
 \begin{equation}\label{graph:*}
 \begin{tikzcd}
 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
@@ -359,7 +359,7 @@
 derivatives until we exhaust the string and then use \textit{nullable}
 to test whether the result can match the empty string. It can  be
 relatively  easily shown that this matcher is correct  (that is given
-an $s$ and a $r$, it generates YES if and only if $s \in L(r)$).
+an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
 
  
 \section{Values and the Algorithm by Sulzmann and Lu}
@@ -399,18 +399,20 @@
 \end{center}
 
 \noindent
-There is no value  corresponding to $\ZERO$; $\Empty$ corresponds to
+No value  corresponds to $\ZERO$; $\Empty$ corresponds to
 $\ONE$; $\Char$ to the character regular expression; $\Seq$ to the
 sequence regular expression and so on. The idea of values is to encode
-parse trees. To see this, suppose a \emph{flatten} operation, written
+parse trees for how the sub-parts of a regular expression matches
+the sub-parts of a string. To see this, suppose a \emph{flatten} operation, written
 $|v|$. We use this function to extract the underlying string of a value
 $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char
 y})|$ is the string $xy$.  Using flatten, we can describe how values
-encode parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @
+encode parse trees: $\Seq\,v_1\, v_2$ encodes a tree with 2 children nodes
+that tells how the string $|v_1| @
 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the
 substring $|v_1|$ and, respectively, $r_2$ matches the substring
 $|v_2|$. Exactly how these two are matched is contained in the
-sub-structure of $v_1$ and $v_2$. 
+children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ . 
 
  To give a concrete example of how value works, consider the string $xy$
 and the regular expression $(x + (y + xy))^*$. We can view this regular
@@ -424,8 +426,8 @@
 \end{center}
 
 \noindent
-where $\Stars$ records how many
-iterations were used; and $\Left$, respectively $\Right$, which
+where $\Stars [\ldots]$ records all the
+iterations; and $\Left$, respectively $\Right$, which
 alternative is used. The value for
 matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
 would look as follows
@@ -455,7 +457,7 @@
 \noindent
 For convenience, we shall employ the following notations: the regular expression we
 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
-\ldots c_n$. In  the first phase, we build the derivatives $r_1$, $r_2$, \ldots  according to
+\ldots c_{n-1}$. In  the first phase, we build the derivatives $r_1$, $r_2$, \ldots  according to
 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
 obtain at the derivative $r_n$. We test whether this derivative is
 $\textit{nullable}$ or not. If not, we know the string does not match
@@ -487,7 +489,7 @@
 
 After this, we inject back the characters one by one in order to build
 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
-($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After
+($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$. After
 injecting back $n$ characters, we get the parse tree for how $r_0$
 matches $s$. For this Sulzmann and Lu defined a function that reverses
 the ``chopping off'' of characters during the derivative phase. The
@@ -495,7 +497,7 @@
 arguments: the first one is a regular expression ${r_{i-1}}$, before the
 character is chopped off, the second is a character ${c_{i-1}}$, the
 character we want to inject and the third argument is the value
-$textit{v_i}$, into which one wants to inject the character (it
+${v_i}$, into which one wants to inject the character (it
 corresponds to the regular expression after the character has been
 chopped off). The result of this function is a new value. The definition
 of $\textit{inj}$ is as follows: 
@@ -516,11 +518,12 @@
 expressions and values. To understands this definition better consider
 the situation when we build the derivative on regular expression $r_{i-1}$.
 For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a
-``hole'' in $r_i$. In order to calculate a value for $r_{i-1}$, we need to
-locate where that hole is. We can find this location information by
+``hole'' in $r_i$ and its corresponding value $v_i$
+. To calculate $v_{i-1}$, we need to
+locate where that hole is and fill it. 
+We can find this location by
 comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape
-$r_a \cdot r_b$, and $v_i$ is of shape $\textit{Left}(\textit{Seq}(v_a,
-v_b))$, we know immediately that 
+$r_a \cdot r_b$, and $v_i$ is of shape $\Left(Seq(v_1,v_2))$, we know immediately that 
 %
 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\]
 
@@ -529,12 +532,16 @@
 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\]
 
 \noindent
-the value $v_i$ should be of shape $\Seq(\ldots)$, contradicting the fact that
-$v_i$ is actually $\Left(\ldots)$. Furthermore, since $v_i$ is of shape
+the value $v_i$ should be  $\Seq(\ldots)$, contradicting the fact that
+$v_i$ is actually of shape $\Left(\ldots)$. Furthermore, since $v_i$ is of shape
 $\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the left
-branch is taken instead of the right one. We have therefore found out 
+branch of \[ (r_a \cdot r_b)\backslash c =
+\bold{\underline{ (r_a\backslash c) \cdot r_b} }\,+\, r_b\backslash c,\](underlined)
+ is taken instead of the right one. This means $c$ is chopped off 
+from $r_a$ rather than $r_b$.
+We have therefore found out 
 that the hole will be on $r_a$. So we recursively call $\inj\, 
-r_a\,c\,v_a$ in order to fill that hole in $v_a$. After injection, the value 
+r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value 
 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
 Other clauses can be understood in a similar way.
 
@@ -542,11 +549,12 @@
 regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against
 the string $abc$ (when $abc$ is written as a regular expression, the most
 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
-the parentheses and dots here for readability). By POSIX rules the lexer
+the parentheses and dots here for readability). By POSIX rules the algorithm
 should go for the longest matching, i.e. it should match the string
 $abc$ in one star iteration, using the longest string $abc$ in the
 sub-expression $a+b+ab+c+abc$ (we use $r$ to denote this sub-expression
-for conciseness). Here is how the lexer achieves a parse tree for this
+for conciseness). 
+Here is how the lexer achieves a parse tree for this
 matching. First, build successive derivatives until we exhaust the
 string, as illustrated here (we simplified some regular expressions like
 $0 \cdot b$ to $0$ for conciseness; we also omit some parentheses if