diff -r dfcf3fa58d7f -r c8ef391dd6f7 etnms/etnms.tex --- a/etnms/etnms.tex Wed Mar 04 13:25:52 2020 +0000 +++ b/etnms/etnms.tex Fri Apr 10 11:58:11 2020 +0100 @@ -91,75 +91,59 @@ \maketitle \begin{abstract} - Brzozowski introduced in 1964 a beautifully simple algorithm for - regular expression matching based on the notion of derivatives of - regular expressions. In 2014, Sulzmann and Lu extended this - algorithm to not just give a YES/NO answer for whether or not a - regular expression matches a string, but in case it does also - answers with \emph{how} it matches the string. This is important for - applications such as lexing (tokenising a string). The problem is to - make the algorithm by Sulzmann and Lu fast on all inputs without - breaking its correctness. Being fast depends on a complete set of - simplification rules, some of which - have been put forward by Sulzmann and Lu. We have extended their - rules in order to obtain a tight bound on the size of regular expressions. - We have tested these extended rules, but have not - formally established their correctness. We have also not yet looked - at extended regular expressions, such as bounded repetitions, - negation and back-references. +Regular expressions, a useful concept in computer science +that was initially designed to match string patterns, +need fast matching algorithms. +If we do matching by naively converting +the regular expression to a +Nondeterministic-Finite-Automaton(NFA) and simulating it, +they can run into trouble on certain evil regular expression and string +pairs. +This necessitates the introudction of a +method called derivatives of regular expressions\cite{Brzozowski1964}. +With some variation introduced by Sulzmann and Lu\cite{Sulzmann2014}, +it circumvents the problem of +catastrphoic backtracking elegantly when +compared to backtracking NFA regex engines. +We believe that such a method can help to address +the status-quo where badly written code for +regular expression matching can cause a software grief. +We have come up with a set of simplification rules +that makes the space and time requirements below a tight bound. +We have proved certain properties of the algorithm +and simplification. +We have not yet worked out a proof for the correctness +although test data suggested this. \end{abstract} + \section{Introduction} -%Regular expressions' derivatives, which have received -%renewed interest in the new millenium, is a beautiful.... -While we believe derivatives of regular expressions, written -$r\backslash s$, are a beautiful concept (in terms of ease of -implementing them in functional programming languages and in terms of -reasoning about them formally), they have one major drawback: every -derivative step can make regular expressions grow drastically in -size. This in turn has negative effect on the runtime of the -corresponding lexing algorithms. Consider for example the regular -expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The -corresponding derivative contains already 8668 nodes where we assume -the derivative is given as a tree. The reason for the poor runtime of -the derivative-based lexing algorithms is that they need to traverse -such trees over and over again. The solution is to find a complete set -of simplification rules that keep the sizes of derivatives uniformly -small. - -This has been partially addressed by the function $\blexer_{simp}$, -which after the simplification the $(a+aa)^*$ example's 8000 nodes will be -reduced to just 6 and stays constant in each derivative step. -The part that still needs more work is the correctness proof of this -function, namely, -\begin{equation}\label{mainthm} -\blexers \; r \; s = \blexer \;r\;s -\end{equation} - -\noindent -and this is what this report is mainly about. A condensed -version of the last report will be provided in the next section -to help the reader understand the report, and the attempts -on the problem will follow. - - -\section{Recapitulation of Concepts From the Last Report} - -\subsection*{Regular Expressions and Derivatives} -Suppose (basic) regular expressions are given by the following grammar: - -\[ r ::= \ZERO \mid \ONE - \mid c - \mid r_1 \cdot r_2 - \mid r_1 + r_2 - \mid r^* -\] - -\noindent -The ingenious contribution of Brzozowski is the notion of \emph{derivatives} of -regular expressions, written~$\_ \backslash \_$. It uses the auxiliary notion of -$\nullable$ defined below. +The following brief introduction gives the background for +derivatives--how they are originally defined and the +variety\cite{Sulzmann2014} that is more suitable for lexing. +Regular expressions can be formalized as follows:\\ +\begin{center} + $r ::= \ZERO | \ONE | c | r_1 +r_2 | r_1 \cdot r_2 | r^*$ +\end{center} +Derivatives of regular expression can be +seen as the regular expression that corresponds to +the language derived by computing +the left quotient of the original regular language +w.r.t a character. +Derivative of regular expression is also a +regular expression because +the left quotient of a regular language is also regular. +With this we know such a function exist without having +to worry that it might not be defined. +In fact, regular expressions derivatives +are not just defined but also simple and elegant. +They are composed of two recursive functions, +and can be easily explained by intuition. +The function +$\nullable$ defined below, +tests whether the empty string belongs +to the regular expression being checked. \begin{center} \begin{tabular}{lcl} @@ -170,7 +154,24 @@ $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\ $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\ \end{tabular} - \end{center} +\end{center} +\noindent +The empty set does not contain any string and +therefore not the empty string, the empty string +regular expression contains the empty string +by definition, the character regular expression +is the singleton that contains character only, +and therefore does not contain the empty string, +the alternative regular expression(or "or" expression) +might have one of its children regular expressions +being nullable and any one of its children being nullable +would suffice. The sequence regular expression +would require both children to have the empty string +to compose an empty string and the Kleene star +operation naturally introduced the empty string. +All nice and easy. +This definition is used in the derivative operation +as a condition: \begin{center} \begin{tabular}{lcl} @@ -186,11 +187,26 @@ \end{tabular} \end{center} \noindent -And defines how a regular expression evolves into +\noindent +The function derivative, written $\backslash c$, +defines how a regular expression evolves into a new regular expression after all the string it contains is chopped off a certain head character $c$. +The most involved cases are the sequence +and star case. +The sequence case says that if the first regular expression +contains an empty string then second component of the sequence +might be chosen as the target regular expression to be chopped +off its head character. +The star regular expression unwraps the iteration of +regular expression and attack the star regular expression +to its back again to make sure there are 0 or more iterations +following this unfolded iteration. -The main property of the derivative operation is that + +The main property of the derivative operation +that enables us to reason about the correctness of +an algorithm using derivatives is \begin{center} $c\!::\!s \in L(r)$ holds @@ -232,18 +248,61 @@ relatively easily shown that this matcher is correct (that is given an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$). - +Beautiful and simple definition. + +If we implement the above algorithm naively, however, +the algorithm can be excruciatingly slow. For example, when starting with the regular +expression $(a + aa)^*$ and building 12 successive derivatives +w.r.t.~the character $a$, one obtains a derivative regular expression +with more than 8000 nodes (when viewed as a tree). Operations like +$\backslash$ and $\nullable$ need to traverse such trees and +consequently the bigger the size of the derivative the slower the +algorithm. + +Brzozowski was quick in finding that during this process a lot useless +$\ONE$s and $\ZERO$s are generated and therefore not optimal. +He also introduced some "similarity rules" such +as $P+(Q+R) = (P+Q)+R$ to merge syntactically +different but language-equivalent sub-regexes to further decrease the size +of the intermediate regexes. + +More simplifications are possible, such as deleting duplicates +and opening up nested alternatives to trigger even more simplifications. +And suppose we apply simplification after each derivative step, and compose +these two operations together as an atomic one: $a \backslash_{simp}\,c \dn +\textit{simp}(a \backslash c)$. Then we can build +a matcher without having cumbersome regular expressions. + + +If we want the size of derivatives in the algorithm to +stay even lower, we would need more aggressive simplifications. +Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as +deleting duplicates whenever possible. For example, the parentheses in +$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b +\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another +example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just +$a^*+a+\ONE$. Adding these more aggressive simplification rules help us +to achieve a very tight size bound, namely, + the same size bound as that of the \emph{partial derivatives}. + +Building derivatives and then simplify them. +So far so good. But what if we want to +do lexing instead of just a YES/NO answer? +This requires us to go back again to the world +without simplification first for a moment. +Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and +elegant(arguably as beautiful as the original +derivatives definition) solution for this. + \subsection*{Values and the Lexing Algorithm by Sulzmann and Lu} -One limitation of Brzozowski's algorithm is that it only produces a -YES/NO answer for whether a string is being matched by a regular -expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm -to allow generation of an actual matching, called a \emph{value} or + +They first defined the datatypes for storing the +lexing information called a \emph{value} or sometimes also \emph{lexical value}. These values and regular expressions correspond to each other as illustrated in the following table: - \begin{center} \begin{tabular}{c@{\hspace{20mm}}c} \begin{tabular}{@{}rrl@{}} @@ -271,6 +330,36 @@ \end{center} \noindent +One regular expression can have multiple lexical values. For example +for the regular expression $(a+b)^*$, it has a infinite list of +values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$, +$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$, +$\ldots$, and vice versa. +Even for the regular expression matching a certain string, there could +still be more than one value corresponding to it. +Take the example where $r= (a^*\cdot a^*)^*$ and the string +$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$. +The number of different ways of matching +without allowing any value under a star to be flattened +to an empty string can be given by the following formula: +\begin{center} + $C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}$ +\end{center} +and a closed form formula can be calculated to be +\begin{equation} + C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}} +\end{equation} +which is clearly in exponential order. +A lexer aimed at getting all the possible values has an exponential +worst case runtime. Therefore it is impractical to try to generate +all possible matches in a run. In practice, we are usually +interested about POSIX values, which by intuition always +match the leftmost regular expression when there is a choice +and always match a sub part as much as possible before proceeding +to the next token. For example, the above example has the POSIX value +$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$. +The output of an algorithm we want would be a POSIX matching +encoded as a value. The contribution of Sulzmann and Lu is an extension of Brzozowski's algorithm by a second phase (the first phase being building successive derivatives---see \eqref{graph:*}). In this second phase, a POSIX value @@ -298,7 +387,7 @@ values incrementally by \emph{injecting} back the characters into the earlier values $v_n, \ldots, v_0$. This is the second phase of the algorithm from the right to left. For the first value $v_n$, we call the -function $\textit{mkeps}$, which builds the lexical value +function $\textit{mkeps}$, which builds a POSIX lexical value for how the empty string has been matched by the (nullable) regular expression $r_n$. This function is defined as @@ -320,7 +409,8 @@ the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$ ($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$. After injecting back $n$ characters, we get the lexical value for how $r_0$ -matches $s$. For this Sulzmann and Lu defined a function that reverses +matches $s$. The POSIX value is maintained throught out the process. +For this Sulzmann and Lu defined a function that reverses the ``chopping off'' of characters during the derivative phase. The corresponding function is called \emph{injection}, written $\textit{inj}$; it takes three arguments: the first one is a regular @@ -345,55 +435,54 @@ \noindent This definition is by recursion on the ``shape'' of regular expressions and values. - - -\subsection*{Simplification of Regular Expressions} +The clauses basically do one thing--identifying the ``holes'' on +value to inject the character back into. +For instance, in the last clause for injecting back to a value +that would turn into a new star value that corresponds to a star, +we know it must be a sequence value. And we know that the first +value of that sequence corresponds to the child regex of the star +with the first character being chopped off--an iteration of the star +that had just been unfolded. This value is followed by the already +matched star iterations we collected before. So we inject the character +back to the first value and form a new value with this new iteration +being added to the previous list of iterations, all under the $Stars$ +top level. -The main drawback of building successive derivatives according -to Brzozowski's definition is that they can grow very quickly in size. -This is mainly due to the fact that the derivative operation generates -often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, if -implemented naively both algorithms by Brzozowski and by Sulzmann and Lu -are excruciatingly slow. For example when starting with the regular -expression $(a + aa)^*$ and building 12 successive derivatives -w.r.t.~the character $a$, one obtains a derivative regular expression -with more than 8000 nodes (when viewed as a tree). Operations like -$\textit{der}$ and $\nullable$ need to traverse such trees and -consequently the bigger the size of the derivative the slower the -algorithm. +We have mentioned before that derivatives without simplification +can get clumsy, and this is true for values as well--they reflect +the regular expressions size by definition. -Fortunately, one can simplify regular expressions after each derivative -step. -Various simplifications of regular expressions are possible, such -as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r -\cdot \ONE$, and $r + r$ to just $r$. -Suppose we apply simplification after each derivative step, and compose -these two operations together as an atomic one: $a \backslash_{simp}\,c \dn -\textit{simp}(a \backslash c)$. Then we can build values without having -a cumbersome regular expression, and fortunately if we are careful -enough in making some extra rectifications, the POSIX value of how +One can introduce simplification on the regex and values, but have to +be careful in not breaking the correctness as the injection +function heavily relies on the structure of the regexes and values +being correct and match each other. +It can be achieved by recording some extra rectification functions +during the derivatives step, and applying these rectifications in +each run during the injection phase. +And we can prove that the POSIX value of how regular expressions match strings will not be affected---although is much harder to establish. Some initial results in this regard have been obtained in \cite{AusafDyckhoffUrban2016}. +%Brzozowski, after giving the derivatives and simplification, +%did not explore lexing with simplification or he may well be +%stuck on an efficient simplificaiton with a proof. +%He went on to explore the use of derivatives together with +%automaton, and did not try lexing using derivatives. -If we want the size of derivatives in Sulzmann and Lu's algorithm to -stay even lower, we would need more aggressive simplifications. -Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as -deleting duplicates whenever possible. For example, the parentheses in -$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b -\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another -example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just -$a^*+a+\ONE$. Adding these more aggressive simplification rules help us -to achieve a very tight size bound, namely, - the same size bound as that of the \emph{partial derivatives}. -And we want to get rid of complex and fragile rectification of values. +We want to get rid of complex and fragile rectification of values. +Can we not create those intermediate values $v_1,\ldots v_n$, +and get the lexing information that should be already there while +doing derivatives in one pass, without a second phase of injection? +In the meantime, can we make sure that simplifications +are easily handled without breaking the correctness of the algorithm? +Sulzmann and Lu solved this problem by +introducing additional informtaion to the +regular expressions called \emph{bitcodes}. -In order to implement the idea of ``spilling out alternatives'' and to -make them compatible with the $\textit{inj}$-mechanism, we use -\emph{bitcodes}. They were first introduced by Sulzmann and Lu. -Here bits and bitcodes (lists of bits) are defined as: +\subsection*{Bit-coded Algorithm} +Bits and bitcodes (lists of bits) are defined as: \begin{center} $b ::= 1 \mid 0 \qquad @@ -674,14 +763,18 @@ operation from characters to strings (just like the derivatives for un-annotated regular expressions). +Remember tha one of the important reasons we introduced bitcodes +is that they can make simplification more structured and therefore guaranteeing +the correctness. \subsection*{Our Simplification Rules} -The main point of the bitcodes and annotated regular expressions is that -we can apply rather aggressive (in terms of size) simplification rules -in order to keep derivatives small. We have developed such -``aggressive'' simplification rules and generated test data that show -that the expected bound can be achieved. Obviously we could only +In this section we introduce aggressive (in terms of size) simplification rules +on annotated regular expressions +in order to keep derivatives small. Such simplifications are promising +as we have +generated test data that show +that a good tight bound can be achieved. Obviously we could only partially cover the search space as there are infinitely many regular expressions and strings. @@ -1756,6 +1849,92 @@ Hopefully in the next few weeks we will be able to find one. This would be an important milestone for my dissertation. + +\section{Future Plans} + +Before the proof comes out, a plan is needed to make +sure some progress is happening. +We plan to get +a practical implementation of our current +method, which would involve extending the set of allowed syntax +to include common regular expression constructs +such as bounded repitions, negation(complement) +and character range or even Boolean functions. +Adding these concise way of expressing regular expressions +while keeping the correctness of the simplification +makes the work more practical. +For example, using the added constructs we can +demonstrate that our implementation can actually +help the cloudflare web-application firewall +to run smoothly even if it had been fed with +evil regular expression +$.*(?:.*=.*)))$ +and strings like $x=xxxxxxxxxx....$. +This could help prove that our algorithm is not just +some nice theory work but is actually competent in reality +as this regular expression and string pair made the cloudflare +servers to grind +to a halt on July 2nd, 2019. +After all, simply the "basic" regular expressions +we gave in the introduction is really so basic, that +even Brzozowski's 1964 paper has included more +varieties than that(Boolean functions instead of +just alternatives)\cite{Brzozowski1964}. +A function only implementing the basic ones +would lack practical interests. +Making the algorithm practical also means to make it fast. +Right now the Scala implementation +is slower than the naive algorithm without bitcodes, even with those +drastic simplification measures, which is not suggested by theory. +We might need to implement it in other languages such as C +to see if the problem is language specific. +This is about 3 months' work. + +We believe that there +is still potential for more simplification as there are a lot +of rules for regular expression similarity, however, +this needs to be explored and checked. +Looking for maximal simplification and +best size bound and proving them would take around 3 months to complete. + +Context-free languages are a bit harder to deal with by derivatives +when compared with regular expressions. +The main reason is that the child node of a context grammar +can be the grammar itself--this left-recursion makes +the computation procedure an impossibility when trying to solve directly. +An example would be the following\cite{might2011cfgder}: +\begin{center} +$L = L \cdot x | \ONE$ +\end{center} +then a derivative would give us the following: +\begin{center} +$L\backslash x = (L \backslash x)\cdot x | \ONE$ +\end{center} +which is essentially what we have previously. +This recursion can go on forever if we do not use some +other methods such as introducing methods +or manually decide that the two equations +are actually equivalent and $L$ and $L\backslash x$ +denote the same language. And then give a mathematical proof +that this is actually the case. +This seems well beyond any mechanical algorithm one can +expect to be able to do..... +People have tried various ways to deal with the problems +of looping when computing derivatives of context-free grammars +such as memoization and laziness, with some degree of success\cite{might2011cfgder}. +However this field seems largely unexplored and further +optimizations seem possible. It would be great +if we could find a simple augmentation to the current +derivative method so that the subset of context free language +of interest can be handled. +This may take longer than the two previous tasks, but we still +give it 3 months time and see how it goes. + + + + + + \bibliographystyle{plain} \bibliography{root}