cst_tests: comparison etnms/etnms.tex

equal deleted inserted replaced

-:dfcf3fa58d7f
+:c8ef391dd6f7
 \begin{document}
 \maketitle
 \begin{abstract}
-Brzozowski introduced in 1964 a beautifully simple algorithm for
+Regular expressions, a useful concept in computer science
-regular expression matching based on the notion of derivatives of
+that was initially designed to match string patterns,
-regular expressions. In 2014, Sulzmann and Lu extended this
+need fast matching algorithms.
-algorithm to not just give a YES/NO answer for whether or not a
+If we do matching by naively converting
-regular expression matches a string, but in case it does also
+the regular expression to a
-answers with \emph{how} it matches the string.  This is important for
+Nondeterministic-Finite-Automaton(NFA) and simulating it,
-applications such as lexing (tokenising a string). The problem is to
+they can run into trouble on certain evil regular expression and string
-make the algorithm by Sulzmann and Lu fast on all inputs without
+pairs.
-breaking its correctness. Being fast depends on a complete set of
+This necessitates the introudction of a
-simplification rules, some of which
+method called derivatives of regular expressions\cite{Brzozowski1964}.
-have been put forward by Sulzmann and Lu. We have extended their
+With some variation introduced by Sulzmann and Lu\cite{Sulzmann2014},
-rules in order to obtain a tight bound on the size of regular expressions.
+it circumvents the problem of
-We have tested these extended rules, but have not
+catastrphoic backtracking elegantly when
-formally established their correctness. We have also not yet looked
+compared to backtracking NFA regex engines.
-at extended regular expressions, such as bounded repetitions,
+We believe that such a method can help to address
-negation and back-references.
+the status-quo where badly written code for
+regular expression matching can cause a software grief.
+We have come up with a set of simplification rules
+that makes the space and time requirements below a tight bound.
+We have proved certain properties of the algorithm
+and simplification.
+We have not yet worked out a proof for the correctness
+although test data suggested this.
 \end{abstract}
 \section{Introduction}
-%Regular expressions' derivatives, which have received
+The following brief introduction gives the background for
-%renewed interest in the new millenium, is a beautiful....
+derivatives--how they are originally defined and the
-While we believe derivatives of regular expressions, written
+variety\cite{Sulzmann2014} that is more suitable for lexing.
-$r\backslash s$, are a beautiful concept (in terms of ease of
+Regular expressions can be formalized as follows:\\
-implementing them in functional programming languages and in terms of
+\begin{center}
-reasoning about them formally), they have one major drawback: every
+	$r ::= \ZERO | \ONE | c | r_1 +r_2 | r_1 \cdot r_2 | r^*$
-derivative step can make regular expressions grow drastically in
+\end{center}
-size. This in turn has negative effect on the runtime of the
+Derivatives of regular expression can be
-corresponding lexing algorithms. Consider for example the regular
+seen as the regular expression that corresponds to
-expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The
+the language derived by computing
-corresponding derivative contains already 8668 nodes where we assume
+the left quotient of the original regular language
-the derivative is given as a tree. The reason for the poor runtime of
+w.r.t a character.
-the derivative-based lexing algorithms is that they need to traverse
+Derivative of regular expression is also a
-such trees over and over again. The solution is to find a complete set
+regular expression because
-of simplification rules that keep the sizes of derivatives uniformly
+the left quotient of a regular language is also regular.
-small.
+With this we know such a function exist without having
+to worry that it might not be defined.
-This has been partially addressed by the function $\blexer_{simp}$,
+In fact, regular expressions derivatives
-which after  the simplification the $(a+aa)^*$ example's 8000 nodes will be
+are not just defined but also simple and elegant.
-reduced to just 6 and stays constant in each derivative step.
+They are composed of two recursive functions,
-The part that still needs more work is the correctness proof of this
+and can be easily explained by intuition.
-function, namely,
+The function
-\begin{equation}\label{mainthm}
+$\nullable$ defined below,
-\blexers \; r \; s = \blexer \;r\;s
+tests whether the empty string belongs
-\end{equation}
+to the regular expression being checked.
-\noindent
-and this is what this report is mainly about. A condensed
-version of the last report will be provided in the next section
-to help the reader understand the report, and the attempts
-on the problem will follow.
-\section{Recapitulation of Concepts From the Last Report}
-\subsection*{Regular Expressions and Derivatives}
-Suppose (basic) regular expressions are given by the following grammar:
-\[			r ::=   \ZERO \mid  \ONE
-			 \mid  c
-			 \mid  r_1 \cdot r_2
-			 \mid  r_1 + r_2
-			 \mid r^*
-\]
-\noindent
-The ingenious contribution of Brzozowski is the notion of \emph{derivatives} of
-regular expressions, written~$\_ \backslash \_$. It uses the auxiliary notion of
-$\nullable$ defined below.
 \begin{center}
 		\begin{tabular}{lcl}
 			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\
 			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
 			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
 			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
 		\end{tabular}
-	\end{center}
+\end{center}
+\noindent
+The empty set does not contain any string and
+therefore not the empty string, the empty string
+regular expression contains the empty string
+by definition, the character regular expression
+is the singleton that contains character only,
+and therefore does not contain the empty string,
+the alternative regular expression(or "or" expression)
+might have one of its children regular expressions
+being nullable and any one of its children being nullable
+would suffice. The sequence regular expression
+would require both children to have the empty string
+to compose an empty string and the Kleene star
+operation naturally introduced the empty string.
+All nice and easy.
+This definition is used in the derivative operation
+as a condition:
 \begin{center}
 \begin{tabular}{lcl}
 		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
 		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
 	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
 	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
 \end{tabular}
 \end{center}
 \noindent
-And defines how a regular expression evolves into
+\noindent
+The function derivative, written $\backslash c$,
+defines how a regular expression evolves into
 a new regular expression after all the string it contains
 is chopped off a certain head character $c$.
+The most involved cases are the sequence
-The main property of the derivative operation is that
+and star case.
+The sequence case says that if the first regular expression
+contains an empty string then second component of the sequence
+might be chosen as the target regular expression to be chopped
+off its head character.
+The star regular expression unwraps the iteration of
+regular expression and attack the star regular expression
+to its back again to make sure there are 0 or more iterations
+following this unfolded iteration.
+The main property of the derivative operation
+that enables us to reason about the correctness of
+an algorithm using derivatives is
 \begin{center}
 $c\!::\!s \in L(r)$ holds
 if and only if $s \in L(r\backslash c)$.
 \end{center}
 derivatives until we exhaust the string and then use \textit{nullable}
 to test whether the result can match the empty string. It can  be
 relatively  easily shown that this matcher is correct  (that is given
 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
+Beautiful and simple definition.
+If we implement the above algorithm naively, however,
+the algorithm can be excruciatingly slow. For example, when starting with the regular
+expression $(a + aa)^*$ and building 12 successive derivatives
+w.r.t.~the character $a$, one obtains a derivative regular expression
+with more than 8000 nodes (when viewed as a tree). Operations like
+$\backslash$ and $\nullable$ need to traverse such trees and
+consequently the bigger the size of the derivative the slower the
+algorithm.
+Brzozowski was quick in finding that during this process a lot useless
+$\ONE$s and $\ZERO$s are generated and therefore not optimal.
+He also introduced some "similarity rules" such
+as $P+(Q+R) = (P+Q)+R$ to merge syntactically
+different but language-equivalent sub-regexes to further decrease the size
+of the intermediate regexes.
+More simplifications are possible, such as deleting duplicates
+and opening up nested alternatives to trigger even more simplifications.
+And suppose we apply simplification after each derivative step, and compose
+these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
+\textit{simp}(a \backslash c)$. Then we can build
+a matcher without having  cumbersome regular expressions.
+If we want the size of derivatives in the algorithm to
+stay even lower, we would need more aggressive simplifications.
+Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
+deleting duplicates whenever possible. For example, the parentheses in
+$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
+\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
+example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
+$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
+to achieve a very tight size bound, namely,
+the same size bound as that of the \emph{partial derivatives}.
+Building derivatives and then simplify them.
+So far so good. But what if we want to
+do lexing instead of just a YES/NO answer?
+This requires us to go back again to the world
+without simplification first for a moment.
+Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
+elegant(arguably as beautiful as the original
+derivatives definition) solution for this.
 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
-One limitation of Brzozowski's algorithm is that it only produces a
-YES/NO answer for whether a string is being matched by a regular
+They first defined the datatypes for storing the
-expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
+lexing information called a \emph{value} or
-to allow generation of an actual matching, called a \emph{value} or
 sometimes also \emph{lexical value}.  These values and regular
 expressions correspond to each other as illustrated in the following
 table:
 \begin{center}
 	\begin{tabular}{c@{\hspace{20mm}}c}
 		\begin{tabular}{@{}rrl@{}}
 			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
 		\end{tabular}
 	\end{tabular}
 \end{center}
 \noindent
+One regular expression can have multiple lexical values. For example
+for the regular expression $(a+b)^*$, it has a infinite list of
+values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
+$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
+$\ldots$, and vice versa.
+Even for the regular expression matching a certain string, there could
+still be more than one value corresponding to it.
+Take the example where $r= (a^*\cdot a^*)^*$ and the string
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
+The number of different ways of matching
+without allowing any value under a star to be flattened
+to an empty string can be given by the following formula:
+\begin{center}
+	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
+\end{center}
+and a closed form formula can be calculated to be
+\begin{equation}
+	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
+\end{equation}
+which is clearly in exponential order.
+A lexer aimed at getting all the possible values has an exponential
+worst case runtime. Therefore it is impractical to try to generate
+all possible matches in a run. In practice, we are usually
+interested about POSIX values, which by intuition always
+match the leftmost regular expression when there is a choice
+and always match a sub part as much as possible before proceeding
+to the next token. For example, the above example has the POSIX value
+$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
+The output of an algorithm we want would be a POSIX matching
+encoded as a value.
 The contribution of Sulzmann and Lu is an extension of Brzozowski's
 algorithm by a second phase (the first phase being building successive
 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
 is generated in case the regular expression matches  the string.
 Pictorially, the Sulzmann and Lu algorithm is as follows:
 $\textit{nullable}$ or not. If not, we know the string does not match
 $r$ and no value needs to be generated. If yes, we start building the
 values incrementally by \emph{injecting} back the characters into the
 earlier values $v_n, \ldots, v_0$. This is the second phase of the
 algorithm from the right to left. For the first value $v_n$, we call the
-function $\textit{mkeps}$, which builds the lexical value
+function $\textit{mkeps}$, which builds a POSIX lexical value
 for how the empty string has been matched by the (nullable) regular
 expression $r_n$. This function is defined as
 	\begin{center}
 		\begin{tabular}{lcl}
 \noindent
 After the $\mkeps$-call, we inject back the characters one by one in order to build
 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
 After injecting back $n$ characters, we get the lexical value for how $r_0$
-matches $s$. For this Sulzmann and Lu defined a function that reverses
+matches $s$. The POSIX value is maintained throught out the process.
+For this Sulzmann and Lu defined a function that reverses
 the ``chopping off'' of characters during the derivative phase. The
 corresponding function is called \emph{injection}, written
 $\textit{inj}$; it takes three arguments: the first one is a regular
 expression ${r_{i-1}}$, before the character is chopped off, the second
 is a character ${c_{i-1}}$, the character we want to inject and the
 \end{tabular}
 \end{center}
 \noindent This definition is by recursion on the ``shape'' of regular
 expressions and values.
+The clauses basically do one thing--identifying the ``holes'' on
+value to inject the character back into.
-\subsection*{Simplification of Regular Expressions}
+For instance, in the last clause for injecting back to a value
+that would turn into a new star value that corresponds to a star,
-The main drawback of building successive derivatives according
+we know it must be a sequence value. And we know that the first
-to Brzozowski's definition is that they can grow very quickly in size.
+value of that sequence corresponds to the child regex of the star
-This is mainly due to the fact that the derivative operation generates
+with the first character being chopped off--an iteration of the star
-often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if
+that had just been unfolded. This value is followed by the already
-implemented naively both algorithms by Brzozowski and by Sulzmann and Lu
+matched star iterations we collected before. So we inject the character
-are excruciatingly slow. For example when starting with the regular
+back to the first value and form a new value with this new iteration
-expression $(a + aa)^*$ and building 12 successive derivatives
+being added to the previous list of iterations, all under the $Stars$
-w.r.t.~the character $a$, one obtains a derivative regular expression
+top level.
-with more than 8000 nodes (when viewed as a tree). Operations like
-$\textit{der}$ and $\nullable$ need to traverse such trees and
+We have mentioned before that derivatives without simplification
-consequently the bigger the size of the derivative the slower the
+can get clumsy, and this is true for values as well--they reflect
-algorithm.
+the regular expressions size by definition.
-Fortunately, one can simplify regular expressions after each derivative
+One can introduce simplification on the regex and values, but have to
-step.
+be careful in not breaking the correctness as the injection
-Various simplifications of regular expressions are possible, such
+function heavily relies on the structure of the regexes and values
-as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
+being correct and match each other.
-\cdot \ONE$, and $r + r$ to just $r$.
+It can be achieved by recording some extra rectification functions
-Suppose we apply simplification after each derivative step, and compose
+during the derivatives step, and applying these rectifications in
-these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
+each run during the injection phase.
-\textit{simp}(a \backslash c)$. Then we can build values without having
+And we can prove that the POSIX value of how
-a cumbersome regular expression, and  fortunately if we are careful
-enough in making some extra rectifications, the POSIX value of how
 regular expressions match strings will not be affected---although is much harder
 to establish. Some initial results in this regard have been
 obtained in \cite{AusafDyckhoffUrban2016}.
+%Brzozowski, after giving the derivatives and simplification,
-If we want the size of derivatives in Sulzmann and Lu's algorithm to
+%did not explore lexing with simplification or he may well be
-stay even lower, we would need more aggressive simplifications.
+%stuck on an efficient simplificaiton with a proof.
-Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
+%He went on to explore the use of derivatives together with
-deleting duplicates whenever possible. For example, the parentheses in
+%automaton, and did not try lexing using derivatives.
-$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
-\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
+We want to get rid of complex and fragile rectification of values.
-example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
+Can we not create those intermediate values $v_1,\ldots v_n$,
-$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
+and get the lexing information that should be already there while
-to achieve a very tight size bound, namely,
+doing derivatives in one pass, without a second phase of injection?
-the same size bound as that of the \emph{partial derivatives}.
+In the meantime, can we make sure that simplifications
-And we want to get rid of complex and fragile rectification of values.
+are easily handled without breaking the correctness of the algorithm?
+Sulzmann and Lu solved this problem by
-In order to implement the idea of ``spilling out alternatives'' and to
+introducing additional informtaion to the
-make them compatible with the $\textit{inj}$-mechanism, we use
+regular expressions called \emph{bitcodes}.
-\emph{bitcodes}. They were first introduced by Sulzmann and Lu.
-Here bits and bitcodes (lists of bits) are defined as:
+\subsection*{Bit-coded Algorithm}
+Bits and bitcodes (lists of bits) are defined as:
 \begin{center}
 		$b ::=   1 \mid  0 \qquad
 bs ::= [] \mid b::bs
 $
 \noindent
 In this definition $\_\backslash s$ is the  generalisation  of the derivative
 operation from characters to strings (just like the derivatives for un-annotated
 regular expressions).
+Remember tha one of the important reasons we introduced bitcodes
+is that they can make simplification more structured and therefore guaranteeing
+the correctness.
 \subsection*{Our Simplification Rules}
-The main point of the bitcodes and annotated regular expressions is that
+In this section we introduce aggressive (in terms of size) simplification rules
-we can apply rather aggressive (in terms of size) simplification rules
+on annotated regular expressions
-in order to keep derivatives small. We have developed such
+in order to keep derivatives small. Such simplifications are promising
-``aggressive'' simplification rules and generated test data that show
+as we have
-that the expected bound can be achieved. Obviously we could only
+generated test data that show
+that a good tight bound can be achieved. Obviously we could only
 partially cover  the search space as there are infinitely many regular
 expressions and strings.
 One modification we introduced is to allow a list of annotated regular
 expressions in the $\sum$ constructor. This allows us to not just
 regular expression that is simplified at the final moment.
 We are almost there, but a last step is needed to make the proof work.
 Hopefully in the next few weeks we will be able to find one.
 This would be an important milestone for my dissertation.
+\section{Future Plans}
+Before the proof comes out, a plan is needed to make
+sure some progress is happening.
+We plan to get
+a practical implementation of our current
+method, which would involve  extending the set of allowed syntax
+to include common regular expression constructs
+such as bounded repitions, negation(complement)
+and character range or even Boolean functions.
+Adding these concise way of expressing regular expressions
+while keeping the correctness of the simplification
+makes the work more practical.
+For example, using the added constructs we can
+demonstrate that our implementation can actually
+help the cloudflare web-application firewall
+to run smoothly even if it had been fed with
+evil regular expression
+$.*(?:.*=.*)))$
+and strings like $x=xxxxxxxxxx....$.
+This could help prove that our algorithm is not just
+some nice theory work but is actually competent in reality
+as this regular expression and string pair made the cloudflare
+servers to grind
+to a halt on July 2nd, 2019.
+After all, simply the "basic" regular expressions
+we gave in the introduction is really so basic, that
+even Brzozowski's 1964 paper has included more
+varieties than that(Boolean functions instead of
+just alternatives)\cite{Brzozowski1964}.
+A function only implementing the basic ones
+would lack practical interests.
+Making the algorithm practical also means to make it fast.
+Right now the Scala implementation
+is slower than the naive algorithm without bitcodes, even with those
+drastic simplification measures, which is not suggested by theory.
+We might need to implement it in other languages such as C
+to see if the problem is language specific.
+This is about 3 months' work.
+We believe that there
+is still potential for more simplification as there are a lot
+of rules for regular expression similarity, however,
+this needs to be explored and checked.
+Looking for maximal simplification and
+best size bound and proving them would take around 3 months to complete.
+Context-free languages are a bit harder to deal with by derivatives
+when compared with regular expressions.
+The main reason is that the child node of a context grammar
+can be the grammar itself--this left-recursion makes
+the computation procedure an impossibility when trying to solve directly.
+An example would be the following\cite{might2011cfgder}:
+\begin{center}
+$L = L \cdot x | \ONE$
+\end{center}
+then a derivative would give us the following:
+\begin{center}
+$L\backslash x = (L \backslash x)\cdot x | \ONE$
+\end{center}
+which is essentially what we have previously.
+This recursion can go on forever if we do not use some
+other methods such as introducing methods
+or manually decide that the two equations
+are actually equivalent and $L$ and $L\backslash x$
+denote the same language. And then give a mathematical proof
+that this is actually the case.
+This seems well beyond any mechanical algorithm one can
+expect to be able to do.....
+People have tried various ways to deal with the problems
+of looping when computing derivatives of context-free grammars
+such as memoization and laziness, with some degree of success\cite{might2011cfgder}.
+However this field seems largely unexplored and further
+optimizations seem possible. It would be great
+if we could find a simple augmentation to the current
+derivative method so that the subset of context free language
+of interest can be handled.
+This may take longer than the two previous tasks, but we still
+give it 3 months time and see how it goes.
 \bibliographystyle{plain}
 \bibliography{root}
 \end{document}

changeset 148	c8ef391dd6f7
parent 145	a7c063981fa5