cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:8ea861a19d70
+:f0360e17080e
 Brzozowski introduced in 1964 a beautifully simple algorithm for
 regular expression matching based on the notion of derivatives of
 regular expressions. In 2014, Sulzmann and Lu extended this
 algorithm to not just give a YES/NO answer for whether or not a
 regular expression matches a string, but in case it matches also
-\emph{how} it matches the string.  This is important for
+answers with \emph{how} it matches the string.  This is important for
 applications such as lexing (tokenising a string). The problem is to
 make the algorithm by Sulzmann and Lu fast on all inputs without
 breaking its correctness. We have already developed some
 simplification rules for this, but have not proved yet that they
 preserve the correctness of the algorithm. We also have not yet
 lexing. Given the maturity of this topic, the reader might wonder:
 Surely, regular expressions must have already been studied to death?
 What could possibly be \emph{not} known in this area? And surely all
 implemented algorithms for regular expression matching are blindingly
 fast?
 Unfortunately these preconceptions are not supported by evidence: Take
 for example the regular expression $(a^*)^*\,b$ and ask whether
 strings of the form $aa..a$ match this regular
 expression. Obviously they do not match---the expected $b$ in the last
 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
 of the form $\underbrace{aa..a}_{n}$.}
 \end{tabular}
 \end{center}
-\noindent These are clearly abysmal and possibly surprising results.
+\noindent These are clearly abysmal and possibly surprising results. One
-One would expect these systems doing much better than that---after
+would expect these systems doing much better than that---after all,
-all, given a DFA and a string, whether a string is matched by this DFA
+given a DFA and a string, deciding whether a string is matched by this
-should be linear.
+DFA should be linear.
 Admittedly, the regular expression $(a^*)^*\,b$ is carefully chosen to
 exhibit this ``exponential behaviour''.  Unfortunately, such regular
 expressions are not just a few ``outliers'', but actually they are
 frequent enough that a separate name has been created for
 them---\emph{evil regular expressions}. In empiric work, Davis et al
 report that they have found thousands of such evil regular expressions
 in the JavaScript and Python ecosystems \cite{Davis18}.
-This exponential blowup sometimes causes pain in real life:
+This exponential blowup in matching algorithms sometimes causes
-for example on 20 July 2016 one evil regular expression brought the
+considerable grief in real life: for example on 20 July 2016 one evil
-webpage \href{http://stackexchange.com}{Stack Exchange} to its
+regular expression brought the webpage
+\href{http://stackexchange.com}{Stack Exchange} to its
 knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
 In this instance, a regular expression intended to just trim white
 spaces from the beginning and the end of a line actually consumed
 massive amounts of CPU-resources and because of this the web servers
-ground to a halt. This happened when a post with 20,000 white spaces
+ground to a halt. This happened when a post with 20,000 white spaces was
-was submitted, but importantly the white spaces were neither at the
+submitted, but importantly the white spaces were neither at the
 beginning nor at the end. As a result, the regular expression matching
-engine needed to backtrack over many choices.
+engine needed to backtrack over many choices. The underlying problem is
+that many ``real life'' regular expression matching engines do not use
-The underlying problem is that many ``real life'' regular expression
+DFAs for matching. This is because they support regular expressions that
-matching engines do not use DFAs for matching. This is because they
+are not covered by the classical automata theory, and in this more
-support regular expressions that are not covered by the classical
+general setting there are quite a few research questions still
-automata theory, and in this more general setting there are quite a
+unanswered and fast algorithms still need to be developed (for example
-few research questions still unanswered and fast algorithms still need
+how to include bounded repetitions, negation and  back-references).
-to be developed (for example how to include bounded repetitions, negation
-and  back-references).
 There is also another under-researched problem to do with regular
 expressions and lexing, i.e.~the process of breaking up strings into
 sequences of tokens according to some regular expressions. In this
 setting one is not just interested in whether or not a regular
 Our focus is on an algorithm introduced by Sulzmann and Lu in 2014 for
 regular expression matching according to the POSIX strategy
 \cite{Sulzmann2014}. Their algorithm is based on an older algorithm by
 Brzozowski from 1964 where he introduced the notion of derivatives of
 regular expressions \cite{Brzozowski1964}. We shall briefly explain
-this algorithms next.
+this algorithm next.
 \section{The Algorithm by Brzozowski based on Derivatives of Regular
 Expressions}
 Suppose (basic) regular expressions are given by the following grammar:
 The intended meaning of the constructors is as follows: $\ZERO$
 cannot match any string, $\ONE$ can match the empty string, the
 character regular expression $c$ can match the character $c$, and so
 on.
-The brilliant contribution by Brzozowski is the notion of
+The ingenious contribution by Brzozowski is the notion of
 \emph{derivatives} of regular expressions.  The idea behind this
 notion is as follows: suppose a regular expression $r$ can match a
 string of the form $c\!::\! s$ (that is a list of characters starting
 with $c$), what does the regular expression look like that can match
 just $s$? Brzozowski gave a neat answer to this question. He started
 \[
 match\;s\;r \;\dn\; nullable(r\backslash s)
 \]
 \noindent
-This algorithm can be illustrated as follows:
+This algorithm can be illustrated graphically as follows
 \begin{equation}\label{graph:*}
 \begin{tikzcd}
 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
 \end{tikzcd}
 \end{equation}
 \noindent
 where we start with  a regular expression  $r_0$, build successive
 derivatives until we exhaust the string and then use \textit{nullable}
 to test whether the result can match the empty string. It can  be
 relatively  easily shown that this matcher is correct  (that is given
-$s$ and $r$, it generates YES if and only if $s \in L(r)$).
+an $s$ and a $r$, it generates YES if and only if $s \in L(r)$).
 \section{Values and the Algorithm by Sulzmann and Lu}
 One limitation, however, of Brzozowski's algorithm is that it only
 		\end{tabular}
 	\end{tabular}
 \end{center}
 \noindent
-The idea of values is to express parse trees. Suppose a flatten
+There is no value  corresponding to $\ZERO$; $\Empty$ corresponds to
+$\ONE$; $\Seq$ to the sequence regular expression and so on. The idea of
+values is to encode parse trees. To see this, suppose a \emph{flatten}
 operation, written $|v|$, which we can use to extract the underlying
-string of $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \,
+string of a value $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \,
 (\textit{Char y})|$ is the string $xy$. We omit the straightforward
-definition of flatten. Using flatten, we can describe how
+definition of flatten. Using flatten, we can describe how values encode
-values encode parse trees: $\Seq\,v_1\, v_2$ tells us how the
+parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @ |v_2|$
-string $|v_1| @ |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$
+matches the regex $r_1 \cdot r_2$: $r_1$ matches the substring $|v_1|$ and,
-matches $|v_1|$ and, respectively, $r_2$ matches $|v_2|$. Exactly how
+respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two are matched
-these two are matched is contained in the sub-structure of $v_1$ and
+is contained in the sub-structure of $v_1$ and $v_2$.
-$v_2$.
+To give a concrete example of how value work, consider the string $xy$
-To give a concrete example of how value works, consider the string $xy$
 and the regular expression $(x + (y + xy))^*$. We can view this regular
 expression as a tree and if the string $xy$ is matched by two Star
 ``iterations'', then the $x$ is matched by the left-most alternative in
 this tree and the $y$ by the right-left alternative. This suggests to
 record this matching as
 v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]
 \end{tikzcd}
 \end{center}
 \noindent
-We shall briefly explain this algorithm. For the convenience of
+For the convenience, we shall employ the following notations: the regular expression we
-explanation, we have the following notations: the regular expression we
+start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
-start with is $r_0$ and the string $s$ is composed characters $c_0 c_1
+\ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots  according to
-\ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots, using
 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
-arrive at the derivative $r_n$. We test whether this derivative is
+obtain at the derivative $r_n$. We test whether this derivative is
 $\textit{nullable}$ or not. If not, we know the string does not match
 $r$ and no value needs to be generated. If yes, we start building the
 parse tree incrementally by \emph{injecting} back the characters into
-the values $v_n, \ldots, v_0$. We first call the function
+the values $v_n, \ldots, v_0$. For this we first call the function
 $\textit{mkeps}$, which builds the parse tree for how the empty string
-has matched the empty regular expression $r_n$. This function is defined
+has matched the (nullable) regular expression $r_n$. This function is defined
 as
 	\begin{center}
 		\begin{tabular}{lcl}
 			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
 the parse tree $v_i$ for how the regex $r_i$ matches the string
 $s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we
 get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
 A correctness proof using induction can be routinely established.
-It is instructive to see how this algorithm works by a little example. Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to match it against the string $abc$(when $abc$ is written
+It is instructive to see how this algorithm works by a little example.
-as a regular expression, the most standard way of expressing it should be $a \cdot (b \cdot c)$. We omit the parenthesis and dots here for readability).
+Suppose we have a regular expression $(a+b+ab+c+abc)^*$ and we want to
-By POSIX rules the lexer should go for the longest matching, i.e. it should match the string $abc$ in one star iteration, using the longest string $abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this sub-expression for conciseness). Here is how the lexer achieves a parse tree for this matching.
+match it against the string $abc$(when $abc$ is written as a regular
-First, we build successive derivatives until we exhaust the string, as illustrated here( we simplified some regular expressions like $0 \cdot b$ to $0$ for conciseness. Similarly, we allow $\textit{ALT}$ to take a list of regular expressions as an argument instead of just 2 operands to reduce the nested depth of $\textit{ALT}$):
+expression, the most standard way of expressing it should be $a \cdot (b
+\cdot c)$. We omit the parenthesis and dots here for readability). By
+POSIX rules the lexer should go for the longest matching, i.e. it should
+match the string $abc$ in one star iteration, using the longest string
+$abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this
+sub-expression for conciseness). Here is how the lexer achieves a parse
+tree for this matching. First, we build successive derivatives until we
+exhaust the string, as illustrated here( we simplified some regular
+expressions like $0 \cdot b$ to $0$ for conciseness. Similarly, we allow
+$\textit{ALT}$ to take a list of regular expressions as an argument
+instead of just 2 operands to reduce the nested depth of
+$\textit{ALT}$):
 \[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r* \xrightarrow{\backslash b}\]
 \[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r* \xrightarrow{\backslash c}\]
 \[r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r*) +((0+1+0  + 0 + 0) \cdot r*+(0+0+0  + 1 + 0) \cdot r* )
 \]
 \noindent
 We would settle the correctness claim.
 It is relatively straightforward to establish that after 1 simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words,
 bmkeps r = bmkeps simp r
-as this basically comes down to proving actions like removing the additional $r$ in $r+r$  does not delete imporatnt POSIX information in a regular expression.
+as this basically comes down to proving actions like removing the additional $r$ in $r+r$  does not delete important POSIX information in a regular expression.
 The hardcore of this problem is to prove that
 bmkeps bders r = bmkeps bders simp r
-That is, if we do derivative on regular expression r and the simplificed version fo r , they can still provie the same POSIZ calue if there is one . This is not as strightforward as the previous proposition, as the two regular expression r and simp r  might become very different regular epxressions after repeated application ofd simp and derivative.
+That is, if we do derivative on regular expression r and the simplified version for, they can still prove the same POSIX value if there is one . This is not as straghtforward as the previous proposition, as the two regular expression r and simp r  might become very different regular expressions after repeated application ofd simp and derivative.
 The crucial point is to find the "gene" of a regular expression and how it is kept intact during simplification.
 To aid this, we are utilizing the helping function retrieve described by Sulzmann and Lu:
 \\definition of retrieve\\
-This function assembled the bitcode that corresponds to a parse tree for how the current derivative mathces the suffix of the string(the characters that have not yet appeared, but is stored in the value).
+This function assembled the bitcode that corresponds to a parse tree for how the current derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value).
 Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\
 $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\
 A little fact that needs to be stated to help comprehension:\\
 $r^\uparrow = a$($a$ stands for $annotated$).\\
-Fahad and Christian also used this fact to prove  the correctness of bit-coded algorithm without simplificaiton.
+Fahad and Christian also used this fact to prove  the correctness of bit-coded algorithm without simplification.
 Our purpose of using this, however, is try to establish \\
 $ \textit{retrieve} \; a \; v \;=\; \textit{retrieve}  \; \textit{simp}(a) \; v'.$\\
 The idea is that using $v'$,
-a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still  able to extract the bitsequence that gievs the same parsing information as the unsimplified one.
+a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still  able to extract the bitsequence that gives the same parsing information as the unsimplified one.
 After establishing this, we might be able to finally bridge the gap of proving\\
 $\textit{retrieve} \; r   \backslash  s \; v = \;\textit{retrieve} \; \textit{simp}(r)  \backslash  s \; v'$\\
 and subsequently\\
 $\textit{retrieve} \; r \backslash  s \; v\; = \; \textit{retrieve} \; r  \backslash_{simp}   s \; v'$.\\
-This proves that our simplified version of regular expression still contains all the bitcodes neeeded.
+This proves that our simplified version of regular expression still contains all the bitcodes needed.
-The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplifiction(the naive version as implemented in ADU of course can explode in some cases).
+The second task is to speed up the more aggressive simplification. Currently it is slower than a naive simplification(the naive version as implemented in ADU of course can explode in some cases).
 So it needs to be explored how to make it faster. Our possibility would be to explore again the connection to DFAs. This is very much work in progress.
 \section{Conclusion}
 In this PhD-project we are interested in fast algorithms for regular

changeset 58	f0360e17080e
parent 57	8ea861a19d70
child 59	8ff7b7508824