etnms/etnms.tex
changeset 148 c8ef391dd6f7
parent 145 a7c063981fa5
--- a/etnms/etnms.tex	Wed Mar 04 13:25:52 2020 +0000
+++ b/etnms/etnms.tex	Fri Apr 10 11:58:11 2020 +0100
@@ -91,75 +91,59 @@
 \maketitle
 
 \begin{abstract}
-  Brzozowski introduced in 1964 a beautifully simple algorithm for
-  regular expression matching based on the notion of derivatives of
-  regular expressions. In 2014, Sulzmann and Lu extended this
-  algorithm to not just give a YES/NO answer for whether or not a
-  regular expression matches a string, but in case it does also
-  answers with \emph{how} it matches the string.  This is important for
-  applications such as lexing (tokenising a string). The problem is to
-  make the algorithm by Sulzmann and Lu fast on all inputs without
-  breaking its correctness. Being fast depends on a complete set of 
-  simplification rules, some of which 
-  have been put forward by Sulzmann and Lu. We have extended their
-  rules in order to obtain a tight bound on the size of regular expressions.
-  We have tested these extended rules, but have not 
-  formally established their correctness. We have also not yet looked 
-  at extended regular expressions, such as bounded repetitions,
-  negation and back-references.
+Regular expressions, a useful concept in computer science
+that was initially designed to match string patterns,
+need fast matching algorithms.
+If we do matching by naively converting
+the regular expression to a
+Nondeterministic-Finite-Automaton(NFA) and simulating it,
+they can run into trouble on certain evil regular expression and string 
+pairs.
+This necessitates the introudction of a 
+method called derivatives of regular expressions\cite{Brzozowski1964}.
+With some variation introduced by Sulzmann and Lu\cite{Sulzmann2014},
+it circumvents the problem of 
+catastrphoic backtracking elegantly when
+compared to backtracking NFA regex engines.
+We believe that such a method can help to address 
+the status-quo where badly written code for 
+regular expression matching can cause a software grief.
+We have come up with a set of simplification rules
+that makes the space and time requirements below a tight bound.
+We have proved certain properties of the algorithm 
+and simplification.
+We have not yet worked out a proof for the correctness
+although test data suggested this.
 \end{abstract}
 
 
+
 \section{Introduction}
-%Regular expressions' derivatives, which have received 
-%renewed interest in the new millenium, is a beautiful....
-While we believe derivatives of regular expressions, written
-$r\backslash s$, are a beautiful concept (in terms of ease of
-implementing them in functional programming languages and in terms of
-reasoning about them formally), they have one major drawback: every
-derivative step can make regular expressions grow drastically in
-size. This in turn has negative effect on the runtime of the
-corresponding lexing algorithms. Consider for example the regular
-expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The
-corresponding derivative contains already 8668 nodes where we assume
-the derivative is given as a tree. The reason for the poor runtime of
-the derivative-based lexing algorithms is that they need to traverse
-such trees over and over again. The solution is to find a complete set
-of simplification rules that keep the sizes of derivatives uniformly
-small.
-
-This has been partially addressed by the function $\blexer_{simp}$,
-which after  the simplification the $(a+aa)^*$ example's 8000 nodes will be
-reduced to just 6 and stays constant in each derivative step.
-The part that still needs more work is the correctness proof of this 
-function, namely,
-\begin{equation}\label{mainthm}
-\blexers \; r \; s = \blexer \;r\;s
-\end{equation}
-
-\noindent
-and this is what this report is mainly about. A condensed 
-version of the last report will be provided in the next section
-to help the reader understand the report, and the attempts 
-on the problem will follow.
-
-
-\section{Recapitulation of Concepts From the Last Report}
-
-\subsection*{Regular Expressions and Derivatives}
-Suppose (basic) regular expressions are given by the following grammar:
-
-\[			r ::=   \ZERO \mid  \ONE
-			 \mid  c  
-			 \mid  r_1 \cdot r_2
-			 \mid  r_1 + r_2   
-			 \mid r^*         
-\]
-
-\noindent
-The ingenious contribution of Brzozowski is the notion of \emph{derivatives} of
-regular expressions, written~$\_ \backslash \_$. It uses the auxiliary notion of
-$\nullable$ defined below.
+The following brief introduction gives the background for 
+derivatives--how they are originally defined and the 
+variety\cite{Sulzmann2014} that is more suitable for lexing.
+Regular expressions can be formalized as follows:\\
+\begin{center}
+	$r ::= \ZERO | \ONE | c | r_1 +r_2 | r_1 \cdot r_2 | r^*$
+\end{center}
+Derivatives of regular expression can be 
+seen as the regular expression that corresponds to
+the language derived by computing
+the left quotient of the original regular language
+w.r.t a character.
+Derivative of regular expression is also a 
+regular expression because 
+the left quotient of a regular language is also regular.
+With this we know such a function exist without having
+to worry that it might not be defined.
+In fact, regular expressions derivatives
+are not just defined but also simple and elegant.
+They are composed of two recursive functions,
+and can be easily explained by intuition.
+The function
+$\nullable$ defined below,
+tests whether the empty string belongs
+to the regular expression being checked.
 
 \begin{center}
 		\begin{tabular}{lcl}
@@ -170,7 +154,24 @@
 			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
 			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
 		\end{tabular}
-	\end{center}
+\end{center}
+\noindent
+The empty set does not contain any string and
+therefore not the empty string, the empty string 
+regular expression contains the empty string
+by definition, the character regular expression
+is the singleton that contains character only,
+and therefore does not contain the empty string,
+the alternative regular expression(or "or" expression)
+might have one of its children regular expressions
+being nullable and any one of its children being nullable
+would suffice. The sequence regular expression
+would require both children to have the empty string
+to compose an empty string and the Kleene star
+operation naturally introduced the empty string.
+All nice and easy.
+This definition is used in the derivative operation
+as a condition:
 
 \begin{center}
 \begin{tabular}{lcl}
@@ -186,11 +187,26 @@
 \end{tabular}
 \end{center}
 \noindent
-And defines how a regular expression evolves into
+\noindent
+The function derivative, written $\backslash c$, 
+defines how a regular expression evolves into
 a new regular expression after all the string it contains
 is chopped off a certain head character $c$.
+The most involved cases are the sequence 
+and star case.
+The sequence case says that if the first regular expression
+contains an empty string then second component of the sequence
+might be chosen as the target regular expression to be chopped
+off its head character.
+The star regular expression unwraps the iteration of
+regular expression and attack the star regular expression
+to its back again to make sure there are 0 or more iterations
+following this unfolded iteration.
 
-The main property of the derivative operation is that
+
+The main property of the derivative operation
+that enables us to reason about the correctness of
+an algorithm using derivatives is 
 
 \begin{center}
 $c\!::\!s \in L(r)$ holds
@@ -232,18 +248,61 @@
 relatively  easily shown that this matcher is correct  (that is given
 an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
 
- 
+Beautiful and simple definition.
+
+If we implement the above algorithm naively, however,
+the algorithm can be excruciatingly slow. For example, when starting with the regular
+expression $(a + aa)^*$ and building 12 successive derivatives
+w.r.t.~the character $a$, one obtains a derivative regular expression
+with more than 8000 nodes (when viewed as a tree). Operations like
+$\backslash$ and $\nullable$ need to traverse such trees and
+consequently the bigger the size of the derivative the slower the
+algorithm. 
+
+Brzozowski was quick in finding that during this process a lot useless
+$\ONE$s and $\ZERO$s are generated and therefore not optimal.
+He also introduced some "similarity rules" such
+as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
+different but language-equivalent sub-regexes to further decrease the size
+of the intermediate regexes. 
+
+More simplifications are possible, such as deleting duplicates
+and opening up nested alternatives to trigger even more simplifications.
+And suppose we apply simplification after each derivative step, and compose
+these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
+\textit{simp}(a \backslash c)$. Then we can build
+a matcher without having  cumbersome regular expressions.
+
+
+If we want the size of derivatives in the algorithm to
+stay even lower, we would need more aggressive simplifications.
+Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
+deleting duplicates whenever possible. For example, the parentheses in
+$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
+\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
+example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
+$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
+to achieve a very tight size bound, namely,
+ the same size bound as that of the \emph{partial derivatives}. 
+
+Building derivatives and then simplify them.
+So far so good. But what if we want to 
+do lexing instead of just a YES/NO answer?
+This requires us to go back again to the world 
+without simplification first for a moment.
+Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
+elegant(arguably as beautiful as the original
+derivatives definition) solution for this.
+
 \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
 
-One limitation of Brzozowski's algorithm is that it only produces a
-YES/NO answer for whether a string is being matched by a regular
-expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
-to allow generation of an actual matching, called a \emph{value} or
+
+They first defined the datatypes for storing the 
+lexing information called a \emph{value} or
 sometimes also \emph{lexical value}.  These values and regular
 expressions correspond to each other as illustrated in the following
 table:
 
-
 \begin{center}
 	\begin{tabular}{c@{\hspace{20mm}}c}
 		\begin{tabular}{@{}rrl@{}}
@@ -271,6 +330,36 @@
 \end{center}
 
 \noindent
+One regular expression can have multiple lexical values. For example
+for the regular expression $(a+b)^*$, it has a infinite list of
+values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
+$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
+$\ldots$, and vice versa.
+Even for the regular expression matching a certain string, there could 
+still be more than one value corresponding to it.
+Take the example where $r= (a^*\cdot a^*)^*$ and the string 
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
+The number of different ways of matching 
+without allowing any value under a star to be flattened
+to an empty string can be given by the following formula:
+\begin{center}
+	$C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
+\end{center}
+and a closed form formula can be calculated to be
+\begin{equation}
+	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
+\end{equation}
+which is clearly in exponential order.
+A lexer aimed at getting all the possible values has an exponential
+worst case runtime. Therefore it is impractical to try to generate
+all possible matches in a run. In practice, we are usually 
+interested about POSIX values, which by intuition always
+match the leftmost regular expression when there is a choice
+and always match a sub part as much as possible before proceeding
+to the next token. For example, the above example has the POSIX value
+$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
+The output of an algorithm we want would be a POSIX matching
+encoded as a value.
 The contribution of Sulzmann and Lu is an extension of Brzozowski's
 algorithm by a second phase (the first phase being building successive
 derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
@@ -298,7 +387,7 @@
 values incrementally by \emph{injecting} back the characters into the
 earlier values $v_n, \ldots, v_0$. This is the second phase of the
 algorithm from the right to left. For the first value $v_n$, we call the
-function $\textit{mkeps}$, which builds the lexical value
+function $\textit{mkeps}$, which builds a POSIX lexical value
 for how the empty string has been matched by the (nullable) regular
 expression $r_n$. This function is defined as
 
@@ -320,7 +409,8 @@
 the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
 ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
 After injecting back $n$ characters, we get the lexical value for how $r_0$
-matches $s$. For this Sulzmann and Lu defined a function that reverses
+matches $s$. The POSIX value is maintained throught out the process.
+For this Sulzmann and Lu defined a function that reverses
 the ``chopping off'' of characters during the derivative phase. The
 corresponding function is called \emph{injection}, written
 $\textit{inj}$; it takes three arguments: the first one is a regular
@@ -345,55 +435,54 @@
 
 \noindent This definition is by recursion on the ``shape'' of regular
 expressions and values. 
-
-
-\subsection*{Simplification of Regular Expressions}
+The clauses basically do one thing--identifying the ``holes'' on 
+value to inject the character back into.
+For instance, in the last clause for injecting back to a value
+that would turn into a new star value that corresponds to a star,
+we know it must be a sequence value. And we know that the first 
+value of that sequence corresponds to the child regex of the star
+with the first character being chopped off--an iteration of the star
+that had just been unfolded. This value is followed by the already
+matched star iterations we collected before. So we inject the character 
+back to the first value and form a new value with this new iteration
+being added to the previous list of iterations, all under the $Stars$
+top level.
 
-The main drawback of building successive derivatives according
-to Brzozowski's definition is that they can grow very quickly in size.
-This is mainly due to the fact that the derivative operation generates
-often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if
-implemented naively both algorithms by Brzozowski and by Sulzmann and Lu
-are excruciatingly slow. For example when starting with the regular
-expression $(a + aa)^*$ and building 12 successive derivatives
-w.r.t.~the character $a$, one obtains a derivative regular expression
-with more than 8000 nodes (when viewed as a tree). Operations like
-$\textit{der}$ and $\nullable$ need to traverse such trees and
-consequently the bigger the size of the derivative the slower the
-algorithm. 
+We have mentioned before that derivatives without simplification 
+can get clumsy, and this is true for values as well--they reflect
+the regular expressions size by definition.
 
-Fortunately, one can simplify regular expressions after each derivative
-step. 
-Various simplifications of regular expressions are possible, such
-as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
-\cdot \ONE$, and $r + r$ to just $r$.
-Suppose we apply simplification after each derivative step, and compose
-these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
-\textit{simp}(a \backslash c)$. Then we can build values without having
-a cumbersome regular expression, and  fortunately if we are careful
-enough in making some extra rectifications, the POSIX value of how
+One can introduce simplification on the regex and values, but have to
+be careful in not breaking the correctness as the injection 
+function heavily relies on the structure of the regexes and values
+being correct and match each other.
+It can be achieved by recording some extra rectification functions
+during the derivatives step, and applying these rectifications in 
+each run during the injection phase.
+And we can prove that the POSIX value of how
 regular expressions match strings will not be affected---although is much harder
 to establish. Some initial results in this regard have been
 obtained in \cite{AusafDyckhoffUrban2016}. 
 
+%Brzozowski, after giving the derivatives and simplification,
+%did not explore lexing with simplification or he may well be 
+%stuck on an efficient simplificaiton with a proof.
+%He went on to explore the use of derivatives together with 
+%automaton, and did not try lexing using derivatives.
 
-If we want the size of derivatives in Sulzmann and Lu's algorithm to
-stay even lower, we would need more aggressive simplifications.
-Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
-deleting duplicates whenever possible. For example, the parentheses in
-$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
-\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
-example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
-$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
-to achieve a very tight size bound, namely,
- the same size bound as that of the \emph{partial derivatives}. 
-And we want to get rid of complex and fragile rectification of values.
+We want to get rid of complex and fragile rectification of values.
+Can we not create those intermediate values $v_1,\ldots v_n$,
+and get the lexing information that should be already there while
+doing derivatives in one pass, without a second phase of injection?
+In the meantime, can we make sure that simplifications
+are easily handled without breaking the correctness of the algorithm?
 
+Sulzmann and Lu solved this problem by
+introducing additional informtaion to the 
+regular expressions called \emph{bitcodes}.
 
-In order to implement the idea of ``spilling out alternatives'' and to
-make them compatible with the $\textit{inj}$-mechanism, we use
-\emph{bitcodes}. They were first introduced by Sulzmann and Lu.
-Here bits and bitcodes (lists of bits) are defined as:
+\subsection*{Bit-coded Algorithm}
+Bits and bitcodes (lists of bits) are defined as:
 
 \begin{center}
 		$b ::=   1 \mid  0 \qquad
@@ -674,14 +763,18 @@
 operation from characters to strings (just like the derivatives for un-annotated
 regular expressions).
 
+Remember tha one of the important reasons we introduced bitcodes
+is that they can make simplification more structured and therefore guaranteeing
+the correctness.
 
 \subsection*{Our Simplification Rules}
 
-The main point of the bitcodes and annotated regular expressions is that
-we can apply rather aggressive (in terms of size) simplification rules
-in order to keep derivatives small. We have developed such
-``aggressive'' simplification rules and generated test data that show
-that the expected bound can be achieved. Obviously we could only
+In this section we introduce aggressive (in terms of size) simplification rules
+on annotated regular expressions
+in order to keep derivatives small. Such simplifications are promising
+as we have
+generated test data that show
+that a good tight bound can be achieved. Obviously we could only
 partially cover  the search space as there are infinitely many regular
 expressions and strings. 
 
@@ -1756,6 +1849,92 @@
 Hopefully in the next few weeks we will be able to find one.
 This would be an important milestone for my dissertation.
 
+
+\section{Future Plans}
+
+Before the proof comes out, a plan is needed to make 
+sure some progress is happening.
+We plan to get
+a practical implementation of our current
+method, which would involve  extending the set of allowed syntax
+to include common regular expression constructs
+such as bounded repitions, negation(complement) 
+and character range or even Boolean functions.
+Adding these concise way of expressing regular expressions
+while keeping the correctness of the simplification
+makes the work more practical.
+For example, using the added constructs we can 
+demonstrate that our implementation can actually 
+help the cloudflare web-application firewall
+to run smoothly even if it had been fed with 
+evil regular expression 
+$.*(?:.*=.*)))$
+and strings like $x=xxxxxxxxxx....$.
+This could help prove that our algorithm is not just 
+some nice theory work but is actually competent in reality
+as this regular expression and string pair made the cloudflare
+servers to grind
+to a halt on July 2nd, 2019.
+After all, simply the "basic" regular expressions
+we gave in the introduction is really so basic, that
+even Brzozowski's 1964 paper has included more 
+varieties than that(Boolean functions instead of 
+just alternatives)\cite{Brzozowski1964}.
+A function only implementing the basic ones
+would lack practical interests.
+Making the algorithm practical also means to make it fast.
+Right now the Scala implementation
+is slower than the naive algorithm without bitcodes, even with those
+drastic simplification measures, which is not suggested by theory.
+We might need to implement it in other languages such as C
+to see if the problem is language specific.
+This is about 3 months' work.
+
+We believe that there
+is still potential for more simplification as there are a lot
+of rules for regular expression similarity, however,
+this needs to be explored and checked.
+Looking for maximal simplification and 
+best size bound and proving them would take around 3 months to complete.
+
+Context-free languages are a bit harder to deal with by derivatives
+when compared with regular expressions.
+The main reason is that the child node of a context grammar
+can be the grammar itself--this left-recursion makes 
+the computation procedure an impossibility when trying to solve directly.
+An example would be the following\cite{might2011cfgder}:
+\begin{center}
+$L = L \cdot x | \ONE$
+\end{center}
+then a derivative would give us the following:
+\begin{center}
+$L\backslash x = (L \backslash x)\cdot x | \ONE$
+\end{center}
+which is essentially what we have previously.
+This recursion can go on forever if we do not use some 
+other methods such as introducing methods
+or manually decide that the two equations 
+are actually equivalent and $L$ and $L\backslash x$
+denote the same language. And then give a mathematical proof
+that this is actually the case.
+This seems well beyond any mechanical algorithm one can 
+expect to be able to do.....
+People have tried various ways to deal with the problems
+of looping when computing derivatives of context-free grammars
+such as memoization and laziness, with some degree of success\cite{might2011cfgder}.
+However this field seems largely unexplored and further 
+optimizations seem possible. It would be great
+if we could find a simple augmentation to the current 
+derivative method so that the subset of context free language 
+of interest can be handled.
+This may take longer than the two previous tasks, but we still
+give it 3 months time and see how it goes.
+
+
+
+ 
+
+
 \bibliographystyle{plain}
 \bibliography{root}