--- a/etnms/etnms.tex Wed Mar 04 13:25:52 2020 +0000
+++ b/etnms/etnms.tex Fri Apr 10 11:58:11 2020 +0100
@@ -91,75 +91,59 @@
\maketitle
\begin{abstract}
- Brzozowski introduced in 1964 a beautifully simple algorithm for
- regular expression matching based on the notion of derivatives of
- regular expressions. In 2014, Sulzmann and Lu extended this
- algorithm to not just give a YES/NO answer for whether or not a
- regular expression matches a string, but in case it does also
- answers with \emph{how} it matches the string. This is important for
- applications such as lexing (tokenising a string). The problem is to
- make the algorithm by Sulzmann and Lu fast on all inputs without
- breaking its correctness. Being fast depends on a complete set of
- simplification rules, some of which
- have been put forward by Sulzmann and Lu. We have extended their
- rules in order to obtain a tight bound on the size of regular expressions.
- We have tested these extended rules, but have not
- formally established their correctness. We have also not yet looked
- at extended regular expressions, such as bounded repetitions,
- negation and back-references.
+Regular expressions, a useful concept in computer science
+that was initially designed to match string patterns,
+need fast matching algorithms.
+If we do matching by naively converting
+the regular expression to a
+Nondeterministic-Finite-Automaton(NFA) and simulating it,
+they can run into trouble on certain evil regular expression and string
+pairs.
+This necessitates the introudction of a
+method called derivatives of regular expressions\cite{Brzozowski1964}.
+With some variation introduced by Sulzmann and Lu\cite{Sulzmann2014},
+it circumvents the problem of
+catastrphoic backtracking elegantly when
+compared to backtracking NFA regex engines.
+We believe that such a method can help to address
+the status-quo where badly written code for
+regular expression matching can cause a software grief.
+We have come up with a set of simplification rules
+that makes the space and time requirements below a tight bound.
+We have proved certain properties of the algorithm
+and simplification.
+We have not yet worked out a proof for the correctness
+although test data suggested this.
\end{abstract}
+
\section{Introduction}
-%Regular expressions' derivatives, which have received
-%renewed interest in the new millenium, is a beautiful....
-While we believe derivatives of regular expressions, written
-$r\backslash s$, are a beautiful concept (in terms of ease of
-implementing them in functional programming languages and in terms of
-reasoning about them formally), they have one major drawback: every
-derivative step can make regular expressions grow drastically in
-size. This in turn has negative effect on the runtime of the
-corresponding lexing algorithms. Consider for example the regular
-expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The
-corresponding derivative contains already 8668 nodes where we assume
-the derivative is given as a tree. The reason for the poor runtime of
-the derivative-based lexing algorithms is that they need to traverse
-such trees over and over again. The solution is to find a complete set
-of simplification rules that keep the sizes of derivatives uniformly
-small.
-
-This has been partially addressed by the function $\blexer_{simp}$,
-which after the simplification the $(a+aa)^*$ example's 8000 nodes will be
-reduced to just 6 and stays constant in each derivative step.
-The part that still needs more work is the correctness proof of this
-function, namely,
-\begin{equation}\label{mainthm}
-\blexers \; r \; s = \blexer \;r\;s
-\end{equation}
-
-\noindent
-and this is what this report is mainly about. A condensed
-version of the last report will be provided in the next section
-to help the reader understand the report, and the attempts
-on the problem will follow.
-
-
-\section{Recapitulation of Concepts From the Last Report}
-
-\subsection*{Regular Expressions and Derivatives}
-Suppose (basic) regular expressions are given by the following grammar:
-
-\[ r ::= \ZERO \mid \ONE
- \mid c
- \mid r_1 \cdot r_2
- \mid r_1 + r_2
- \mid r^*
-\]
-
-\noindent
-The ingenious contribution of Brzozowski is the notion of \emph{derivatives} of
-regular expressions, written~$\_ \backslash \_$. It uses the auxiliary notion of
-$\nullable$ defined below.
+The following brief introduction gives the background for
+derivatives--how they are originally defined and the
+variety\cite{Sulzmann2014} that is more suitable for lexing.
+Regular expressions can be formalized as follows:\\
+\begin{center}
+ $r ::= \ZERO | \ONE | c | r_1 +r_2 | r_1 \cdot r_2 | r^*$
+\end{center}
+Derivatives of regular expression can be
+seen as the regular expression that corresponds to
+the language derived by computing
+the left quotient of the original regular language
+w.r.t a character.
+Derivative of regular expression is also a
+regular expression because
+the left quotient of a regular language is also regular.
+With this we know such a function exist without having
+to worry that it might not be defined.
+In fact, regular expressions derivatives
+are not just defined but also simple and elegant.
+They are composed of two recursive functions,
+and can be easily explained by intuition.
+The function
+$\nullable$ defined below,
+tests whether the empty string belongs
+to the regular expression being checked.
\begin{center}
\begin{tabular}{lcl}
@@ -170,7 +154,24 @@
$\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
$\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
\end{tabular}
- \end{center}
+\end{center}
+\noindent
+The empty set does not contain any string and
+therefore not the empty string, the empty string
+regular expression contains the empty string
+by definition, the character regular expression
+is the singleton that contains character only,
+and therefore does not contain the empty string,
+the alternative regular expression(or "or" expression)
+might have one of its children regular expressions
+being nullable and any one of its children being nullable
+would suffice. The sequence regular expression
+would require both children to have the empty string
+to compose an empty string and the Kleene star
+operation naturally introduced the empty string.
+All nice and easy.
+This definition is used in the derivative operation
+as a condition:
\begin{center}
\begin{tabular}{lcl}
@@ -186,11 +187,26 @@
\end{tabular}
\end{center}
\noindent
-And defines how a regular expression evolves into
+\noindent
+The function derivative, written $\backslash c$,
+defines how a regular expression evolves into
a new regular expression after all the string it contains
is chopped off a certain head character $c$.
+The most involved cases are the sequence
+and star case.
+The sequence case says that if the first regular expression
+contains an empty string then second component of the sequence
+might be chosen as the target regular expression to be chopped
+off its head character.
+The star regular expression unwraps the iteration of
+regular expression and attack the star regular expression
+to its back again to make sure there are 0 or more iterations
+following this unfolded iteration.
-The main property of the derivative operation is that
+
+The main property of the derivative operation
+that enables us to reason about the correctness of
+an algorithm using derivatives is
\begin{center}
$c\!::\!s \in L(r)$ holds
@@ -232,18 +248,61 @@
relatively easily shown that this matcher is correct (that is given
an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
-
+Beautiful and simple definition.
+
+If we implement the above algorithm naively, however,
+the algorithm can be excruciatingly slow. For example, when starting with the regular
+expression $(a + aa)^*$ and building 12 successive derivatives
+w.r.t.~the character $a$, one obtains a derivative regular expression
+with more than 8000 nodes (when viewed as a tree). Operations like
+$\backslash$ and $\nullable$ need to traverse such trees and
+consequently the bigger the size of the derivative the slower the
+algorithm.
+
+Brzozowski was quick in finding that during this process a lot useless
+$\ONE$s and $\ZERO$s are generated and therefore not optimal.
+He also introduced some "similarity rules" such
+as $P+(Q+R) = (P+Q)+R$ to merge syntactically
+different but language-equivalent sub-regexes to further decrease the size
+of the intermediate regexes.
+
+More simplifications are possible, such as deleting duplicates
+and opening up nested alternatives to trigger even more simplifications.
+And suppose we apply simplification after each derivative step, and compose
+these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
+\textit{simp}(a \backslash c)$. Then we can build
+a matcher without having cumbersome regular expressions.
+
+
+If we want the size of derivatives in the algorithm to
+stay even lower, we would need more aggressive simplifications.
+Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
+deleting duplicates whenever possible. For example, the parentheses in
+$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
+\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
+example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
+$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
+to achieve a very tight size bound, namely,
+ the same size bound as that of the \emph{partial derivatives}.
+
+Building derivatives and then simplify them.
+So far so good. But what if we want to
+do lexing instead of just a YES/NO answer?
+This requires us to go back again to the world
+without simplification first for a moment.
+Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
+elegant(arguably as beautiful as the original
+derivatives definition) solution for this.
+
\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
-One limitation of Brzozowski's algorithm is that it only produces a
-YES/NO answer for whether a string is being matched by a regular
-expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
-to allow generation of an actual matching, called a \emph{value} or
+
+They first defined the datatypes for storing the
+lexing information called a \emph{value} or
sometimes also \emph{lexical value}. These values and regular
expressions correspond to each other as illustrated in the following
table:
-
\begin{center}
\begin{tabular}{c@{\hspace{20mm}}c}
\begin{tabular}{@{}rrl@{}}
@@ -271,6 +330,36 @@
\end{center}
\noindent
+One regular expression can have multiple lexical values. For example
+for the regular expression $(a+b)^*$, it has a infinite list of
+values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
+$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
+$\ldots$, and vice versa.
+Even for the regular expression matching a certain string, there could
+still be more than one value corresponding to it.
+Take the example where $r= (a^*\cdot a^*)^*$ and the string
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
+The number of different ways of matching
+without allowing any value under a star to be flattened
+to an empty string can be given by the following formula:
+\begin{center}
+ $C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$
+\end{center}
+and a closed form formula can be calculated to be
+\begin{equation}
+ C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
+\end{equation}
+which is clearly in exponential order.
+A lexer aimed at getting all the possible values has an exponential
+worst case runtime. Therefore it is impractical to try to generate
+all possible matches in a run. In practice, we are usually
+interested about POSIX values, which by intuition always
+match the leftmost regular expression when there is a choice
+and always match a sub part as much as possible before proceeding
+to the next token. For example, the above example has the POSIX value
+$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
+The output of an algorithm we want would be a POSIX matching
+encoded as a value.
The contribution of Sulzmann and Lu is an extension of Brzozowski's
algorithm by a second phase (the first phase being building successive
derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
@@ -298,7 +387,7 @@
values incrementally by \emph{injecting} back the characters into the
earlier values $v_n, \ldots, v_0$. This is the second phase of the
algorithm from the right to left. For the first value $v_n$, we call the
-function $\textit{mkeps}$, which builds the lexical value
+function $\textit{mkeps}$, which builds a POSIX lexical value
for how the empty string has been matched by the (nullable) regular
expression $r_n$. This function is defined as
@@ -320,7 +409,8 @@
the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
After injecting back $n$ characters, we get the lexical value for how $r_0$
-matches $s$. For this Sulzmann and Lu defined a function that reverses
+matches $s$. The POSIX value is maintained throught out the process.
+For this Sulzmann and Lu defined a function that reverses
the ``chopping off'' of characters during the derivative phase. The
corresponding function is called \emph{injection}, written
$\textit{inj}$; it takes three arguments: the first one is a regular
@@ -345,55 +435,54 @@
\noindent This definition is by recursion on the ``shape'' of regular
expressions and values.
-
-
-\subsection*{Simplification of Regular Expressions}
+The clauses basically do one thing--identifying the ``holes'' on
+value to inject the character back into.
+For instance, in the last clause for injecting back to a value
+that would turn into a new star value that corresponds to a star,
+we know it must be a sequence value. And we know that the first
+value of that sequence corresponds to the child regex of the star
+with the first character being chopped off--an iteration of the star
+that had just been unfolded. This value is followed by the already
+matched star iterations we collected before. So we inject the character
+back to the first value and form a new value with this new iteration
+being added to the previous list of iterations, all under the $Stars$
+top level.
-The main drawback of building successive derivatives according
-to Brzozowski's definition is that they can grow very quickly in size.
-This is mainly due to the fact that the derivative operation generates
-often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, if
-implemented naively both algorithms by Brzozowski and by Sulzmann and Lu
-are excruciatingly slow. For example when starting with the regular
-expression $(a + aa)^*$ and building 12 successive derivatives
-w.r.t.~the character $a$, one obtains a derivative regular expression
-with more than 8000 nodes (when viewed as a tree). Operations like
-$\textit{der}$ and $\nullable$ need to traverse such trees and
-consequently the bigger the size of the derivative the slower the
-algorithm.
+We have mentioned before that derivatives without simplification
+can get clumsy, and this is true for values as well--they reflect
+the regular expressions size by definition.
-Fortunately, one can simplify regular expressions after each derivative
-step.
-Various simplifications of regular expressions are possible, such
-as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
-\cdot \ONE$, and $r + r$ to just $r$.
-Suppose we apply simplification after each derivative step, and compose
-these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
-\textit{simp}(a \backslash c)$. Then we can build values without having
-a cumbersome regular expression, and fortunately if we are careful
-enough in making some extra rectifications, the POSIX value of how
+One can introduce simplification on the regex and values, but have to
+be careful in not breaking the correctness as the injection
+function heavily relies on the structure of the regexes and values
+being correct and match each other.
+It can be achieved by recording some extra rectification functions
+during the derivatives step, and applying these rectifications in
+each run during the injection phase.
+And we can prove that the POSIX value of how
regular expressions match strings will not be affected---although is much harder
to establish. Some initial results in this regard have been
obtained in \cite{AusafDyckhoffUrban2016}.
+%Brzozowski, after giving the derivatives and simplification,
+%did not explore lexing with simplification or he may well be
+%stuck on an efficient simplificaiton with a proof.
+%He went on to explore the use of derivatives together with
+%automaton, and did not try lexing using derivatives.
-If we want the size of derivatives in Sulzmann and Lu's algorithm to
-stay even lower, we would need more aggressive simplifications.
-Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
-deleting duplicates whenever possible. For example, the parentheses in
-$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
-\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
-example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
-$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
-to achieve a very tight size bound, namely,
- the same size bound as that of the \emph{partial derivatives}.
-And we want to get rid of complex and fragile rectification of values.
+We want to get rid of complex and fragile rectification of values.
+Can we not create those intermediate values $v_1,\ldots v_n$,
+and get the lexing information that should be already there while
+doing derivatives in one pass, without a second phase of injection?
+In the meantime, can we make sure that simplifications
+are easily handled without breaking the correctness of the algorithm?
+Sulzmann and Lu solved this problem by
+introducing additional informtaion to the
+regular expressions called \emph{bitcodes}.
-In order to implement the idea of ``spilling out alternatives'' and to
-make them compatible with the $\textit{inj}$-mechanism, we use
-\emph{bitcodes}. They were first introduced by Sulzmann and Lu.
-Here bits and bitcodes (lists of bits) are defined as:
+\subsection*{Bit-coded Algorithm}
+Bits and bitcodes (lists of bits) are defined as:
\begin{center}
$b ::= 1 \mid 0 \qquad
@@ -674,14 +763,18 @@
operation from characters to strings (just like the derivatives for un-annotated
regular expressions).
+Remember tha one of the important reasons we introduced bitcodes
+is that they can make simplification more structured and therefore guaranteeing
+the correctness.
\subsection*{Our Simplification Rules}
-The main point of the bitcodes and annotated regular expressions is that
-we can apply rather aggressive (in terms of size) simplification rules
-in order to keep derivatives small. We have developed such
-``aggressive'' simplification rules and generated test data that show
-that the expected bound can be achieved. Obviously we could only
+In this section we introduce aggressive (in terms of size) simplification rules
+on annotated regular expressions
+in order to keep derivatives small. Such simplifications are promising
+as we have
+generated test data that show
+that a good tight bound can be achieved. Obviously we could only
partially cover the search space as there are infinitely many regular
expressions and strings.
@@ -1756,6 +1849,92 @@
Hopefully in the next few weeks we will be able to find one.
This would be an important milestone for my dissertation.
+
+\section{Future Plans}
+
+Before the proof comes out, a plan is needed to make
+sure some progress is happening.
+We plan to get
+a practical implementation of our current
+method, which would involve extending the set of allowed syntax
+to include common regular expression constructs
+such as bounded repitions, negation(complement)
+and character range or even Boolean functions.
+Adding these concise way of expressing regular expressions
+while keeping the correctness of the simplification
+makes the work more practical.
+For example, using the added constructs we can
+demonstrate that our implementation can actually
+help the cloudflare web-application firewall
+to run smoothly even if it had been fed with
+evil regular expression
+$.*(?:.*=.*)))$
+and strings like $x=xxxxxxxxxx....$.
+This could help prove that our algorithm is not just
+some nice theory work but is actually competent in reality
+as this regular expression and string pair made the cloudflare
+servers to grind
+to a halt on July 2nd, 2019.
+After all, simply the "basic" regular expressions
+we gave in the introduction is really so basic, that
+even Brzozowski's 1964 paper has included more
+varieties than that(Boolean functions instead of
+just alternatives)\cite{Brzozowski1964}.
+A function only implementing the basic ones
+would lack practical interests.
+Making the algorithm practical also means to make it fast.
+Right now the Scala implementation
+is slower than the naive algorithm without bitcodes, even with those
+drastic simplification measures, which is not suggested by theory.
+We might need to implement it in other languages such as C
+to see if the problem is language specific.
+This is about 3 months' work.
+
+We believe that there
+is still potential for more simplification as there are a lot
+of rules for regular expression similarity, however,
+this needs to be explored and checked.
+Looking for maximal simplification and
+best size bound and proving them would take around 3 months to complete.
+
+Context-free languages are a bit harder to deal with by derivatives
+when compared with regular expressions.
+The main reason is that the child node of a context grammar
+can be the grammar itself--this left-recursion makes
+the computation procedure an impossibility when trying to solve directly.
+An example would be the following\cite{might2011cfgder}:
+\begin{center}
+$L = L \cdot x | \ONE$
+\end{center}
+then a derivative would give us the following:
+\begin{center}
+$L\backslash x = (L \backslash x)\cdot x | \ONE$
+\end{center}
+which is essentially what we have previously.
+This recursion can go on forever if we do not use some
+other methods such as introducing methods
+or manually decide that the two equations
+are actually equivalent and $L$ and $L\backslash x$
+denote the same language. And then give a mathematical proof
+that this is actually the case.
+This seems well beyond any mechanical algorithm one can
+expect to be able to do.....
+People have tried various ways to deal with the problems
+of looping when computing derivatives of context-free grammars
+such as memoization and laziness, with some degree of success\cite{might2011cfgder}.
+However this field seems largely unexplored and further
+optimizations seem possible. It would be great
+if we could find a simple augmentation to the current
+derivative method so that the subset of context free language
+of interest can be handled.
+This may take longer than the two previous tasks, but we still
+give it 3 months time and see how it goes.
+
+
+
+
+
+
\bibliographystyle{plain}
\bibliography{root}