cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:d3c22f809dde
+:afd0d702a4fe
 \begin{abstract}
 Brzozowski introduced in 1964 a beautifully simple algorithm for
 regular expression matching based on the notion of derivatives of
 regular expressions. In 2014, Sulzmann and Lu extended this
 algorithm to not just give a YES/NO answer for whether or not a
-regular expression matches a string, but in case it matches also
+regular expression matches a string, but in case it does also
 answers with \emph{how} it matches the string.  This is important for
 applications such as lexing (tokenising a string). The problem is to
 make the algorithm by Sulzmann and Lu fast on all inputs without
 breaking its correctness. We have already developed some
 simplification rules for this, but have not yet proved that they
 fast?
 Unfortunately these preconceptions are not supported by evidence: Take
 for example the regular expression $(a^*)^*\,b$ and ask whether
 strings of the form $aa..a$ match this regular
-expression. Obviously they do not match---the expected $b$ in the last
+expression. Obviously not---the expected $b$ in the last
 position is missing. One would expect that modern regular expression
 matching engines can find this out very quickly. Alas, if one tries
 this example in JavaScript, Python or Java 8 with strings like 28
 $a$'s, one discovers that this decision takes around 30 seconds and
 takes considerably longer when adding a few more $a$'s, as the graphs
 of the form $\underbrace{aa..a}_{n}$.}
 \end{tabular}
 \end{center}
 \noindent These are clearly abysmal and possibly surprising results. One
-would expect these systems doing much better than that---after all,
+would expect these systems to do  much better than that---after all,
 given a DFA and a string, deciding whether a string is matched by this
 DFA should be linear.
 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
-exhibit this ``exponential behaviour''.  Unfortunately, such regular
+exhibit this exponential behaviour.  Unfortunately, such regular
-expressions are not just a few ``outliers'', but actually they are
+expressions are not just a few outliers. They are actually
-frequent enough that a separate name has been created for
+frequent enough to have a separate name created for
 them---\emph{evil regular expressions}. In empiric work, Davis et al
 report that they have found thousands of such evil regular expressions
 in the JavaScript and Python ecosystems \cite{Davis18}.
 This exponential blowup in matching algorithms sometimes causes
 regular expression brought the webpage
 \href{http://stackexchange.com}{Stack Exchange} to its
 knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
 In this instance, a regular expression intended to just trim white
 spaces from the beginning and the end of a line actually consumed
-massive amounts of CPU-resources and because of this the web servers
+massive amounts of CPU-resources---causing web servers to
-ground to a halt. This happened when a post with 20,000 white spaces was
+grind to a halt. This happened when a post with 20,000 white spaces was
 submitted, but importantly the white spaces were neither at the
 beginning nor at the end. As a result, the regular expression matching
 engine needed to backtrack over many choices. The underlying problem is
 that many ``real life'' regular expression matching engines do not use
 DFAs for matching. This is because they support regular expressions that
 are not covered by the classical automata theory, and in this more
 general setting there are quite a few research questions still
 unanswered and fast algorithms still need to be developed (for example
 how to treat bounded repetitions, negation and  back-references
 efficiently).
+%question: dfa can have exponential states. isn't this the actual reason why they do not use dfas?
+%how do they avoid dfas exponential states if they use them for fast matching?
 There is also another under-researched problem to do with regular
 expressions and lexing, i.e.~the process of breaking up strings into
 sequences of tokens according to some regular expressions. In this
 setting one is not just interested in whether or not a regular
-expression matches a string, but if it matches also in \emph{how} it
+expression matches a string, but also in \emph{how}.  Consider for example a regular expression
-matches the string.  Consider for example a regular expression
 $r_{key}$ for recognising keywords such as \textit{if}, \textit{then}
 and so on; and a regular expression $r_{id}$ for recognising
 identifiers (say, a single character followed by characters or
 numbers). One can then form the compound regular expression
 $(r_{key} + r_{id})^*$ and use it to tokenise strings.  But then how
 single identifier.  Similarly, how should the string \textit{if} be
 tokenised? Both regular expressions, $r_{key}$ and $r_{id}$, would
 ``fire''---so is it an identifier or a keyword?  While in applications
 there is a well-known strategy to decide these questions, called POSIX
 matching, only relatively recently precise definitions of what POSIX
-matching actually means have been formalised
+matching actually means has been formalised
 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}.
 Such a definition has also been given by Sulzmann and  Lu \cite{Sulzmann2014}, but the
 corresponding correctness proof turned out to be  faulty \cite{AusafDyckhoffUrban2016}.
 Roughly, POSIX matching means matching the longest initial substring.
-In the case of a tie, the initial submatch is chosen according to some priorities attached to the
+In the case of a tie, the initial sub-match is chosen according to some priorities attached to the
 regular expressions (e.g.~keywords have a higher priority than
 identifiers). This sounds rather simple, but according to Grathwohl et
 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
 \begin{quote}
 \[
 match\;s\;r \;\dn\; nullable(r\backslash s)
 \]
 \noindent
-This algorithm can be illustrated graphically as follows
+This algorithm looks graphically as follows:
 \begin{equation}\label{graph:*}
 \begin{tikzcd}
 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
 \end{tikzcd}
 \end{equation}
 \noindent
 where we start with  a regular expression  $r_0$, build successive
 derivatives until we exhaust the string and then use \textit{nullable}
 to test whether the result can match the empty string. It can  be
 relatively  easily shown that this matcher is correct  (that is given
-an $s$ and a $r$, it generates YES if and only if $s \in L(r)$).
+an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
 \section{Values and the Algorithm by Sulzmann and Lu}
 One limitation, however, of Brzozowski's algorithm is that it only
 		\end{tabular}
 	\end{tabular}
 \end{center}
 \noindent
-There is no value  corresponding to $\ZERO$; $\Empty$ corresponds to
+No value  corresponds to $\ZERO$; $\Empty$ corresponds to
 $\ONE$; $\Char$ to the character regular expression; $\Seq$ to the
 sequence regular expression and so on. The idea of values is to encode
-parse trees. To see this, suppose a \emph{flatten} operation, written
+parse trees for how the sub-parts of a regular expression matches
+the sub-parts of a string. To see this, suppose a \emph{flatten} operation, written
 $|v|$. We use this function to extract the underlying string of a value
 $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char
 y})|$ is the string $xy$.  Using flatten, we can describe how values
-encode parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @
+encode parse trees: $\Seq\,v_1\, v_2$ encodes a tree with 2 children nodes
+that tells how the string $|v_1| @
 |v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the
 substring $|v_1|$ and, respectively, $r_2$ matches the substring
 $|v_2|$. Exactly how these two are matched is contained in the
-sub-structure of $v_1$ and $v_2$.
+children nodes $v_1$ and $v_2$ of parent $\textit{Seq}$ .
 To give a concrete example of how value works, consider the string $xy$
 and the regular expression $(x + (y + xy))^*$. We can view this regular
 expression as a tree and if the string $xy$ is matched by two Star
 ``iterations'', then the $x$ is matched by the left-most alternative in
 \begin{center}
 $\Stars\,[\Left\,(\Char\,x), \Right(\Left(\Char\,y))]$
 \end{center}
 \noindent
-where $\Stars$ records how many
+where $\Stars [\ldots]$ records all the
-iterations were used; and $\Left$, respectively $\Right$, which
+iterations; and $\Left$, respectively $\Right$, which
 alternative is used. The value for
 matching $xy$ in a single ``iteration'', i.e.~the POSIX value,
 would look as follows
 \begin{center}
 \end{equation}
 \noindent
 For convenience, we shall employ the following notations: the regular expression we
 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
-\ldots c_n$. In  the first phase, we build the derivatives $r_1$, $r_2$, \ldots  according to
+\ldots c_{n-1}$. In  the first phase, we build the derivatives $r_1$, $r_2$, \ldots  according to
 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
 obtain at the derivative $r_n$. We test whether this derivative is
 $\textit{nullable}$ or not. If not, we know the string does not match
 $r$ and no value needs to be generated. If yes, we start building the
 parse tree incrementally by \emph{injecting} back the characters into
 regular expression on the left-hand side. This will become
 important later on.
 After this, we inject back the characters one by one in order to build
 the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
-($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After
+($s_i = c_i \ldots c_{n-1}$ ) from the previous parse tree $v_{i+1}$. After
 injecting back $n$ characters, we get the parse tree for how $r_0$
 matches $s$. For this Sulzmann and Lu defined a function that reverses
 the ``chopping off'' of characters during the derivative phase. The
 corresponding function is called $\textit{inj}$; it takes three
 arguments: the first one is a regular expression ${r_{i-1}}$, before the
 character is chopped off, the second is a character ${c_{i-1}}$, the
 character we want to inject and the third argument is the value
-$textit{v_i}$, into which one wants to inject the character (it
+${v_i}$, into which one wants to inject the character (it
 corresponds to the regular expression after the character has been
 chopped off). The result of this function is a new value. The definition
 of $\textit{inj}$ is as follows:
 \begin{center}
 \noindent This definition is by recursion on the ``shape'' of regular
 expressions and values. To understands this definition better consider
 the situation when we build the derivative on regular expression $r_{i-1}$.
 For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a
-``hole'' in $r_i$. In order to calculate a value for $r_{i-1}$, we need to
+``hole'' in $r_i$ and its corresponding value $v_i$
-locate where that hole is. We can find this location information by
+. To calculate $v_{i-1}$, we need to
+locate where that hole is and fill it.
+We can find this location by
 comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape
-$r_a \cdot r_b$, and $v_i$ is of shape $\textit{Left}(\textit{Seq}(v_a,
+$r_a \cdot r_b$, and $v_i$ is of shape $\Left(Seq(v_1,v_2))$, we know immediately that
-v_b))$, we know immediately that
 %
 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\]
 \noindent
 otherwise if $r_a$ is not nullable,
 \[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\]
 \noindent
-the value $v_i$ should be of shape $\Seq(\ldots)$, contradicting the fact that
+the value $v_i$ should be  $\Seq(\ldots)$, contradicting the fact that
-$v_i$ is actually $\Left(\ldots)$. Furthermore, since $v_i$ is of shape
+$v_i$ is actually of shape $\Left(\ldots)$. Furthermore, since $v_i$ is of shape
 $\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the left
-branch is taken instead of the right one. We have therefore found out
+branch of \[ (r_a \cdot r_b)\backslash c =
+\bold{\underline{ (r_a\backslash c) \cdot r_b} }\,+\, r_b\backslash c,\](underlined)
+is taken instead of the right one. This means $c$ is chopped off
+from $r_a$ rather than $r_b$.
+We have therefore found out
 that the hole will be on $r_a$. So we recursively call $\inj\,
-r_a\,c\,v_a$ in order to fill that hole in $v_a$. After injection, the value
+r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value
 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
 Other clauses can be understood in a similar way.
 Let us do some further examples involving $\inj$: Suppose we have the
 regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against
 the string $abc$ (when $abc$ is written as a regular expression, the most
 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
-the parentheses and dots here for readability). By POSIX rules the lexer
+the parentheses and dots here for readability). By POSIX rules the algorithm
 should go for the longest matching, i.e. it should match the string
 $abc$ in one star iteration, using the longest string $abc$ in the
 sub-expression $a+b+ab+c+abc$ (we use $r$ to denote this sub-expression
-for conciseness). Here is how the lexer achieves a parse tree for this
+for conciseness).
+Here is how the lexer achieves a parse tree for this
 matching. First, build successive derivatives until we exhaust the
 string, as illustrated here (we simplified some regular expressions like
 $0 \cdot b$ to $0$ for conciseness; we also omit some parentheses if
 they are clear from the context):
 %Similarly, we allow

changeset 64	afd0d702a4fe
parent 63	d3c22f809dde
child 65	df2e0faccb23