--- a/ninems/ninems.tex Sun Jul 07 21:38:35 2019 +0100
+++ b/ninems/ninems.tex Sun Jul 07 22:09:24 2019 +0100
@@ -651,7 +651,7 @@
Which one of these alternatives is chosen later does not affect their relative position because our algorithm
does not change this order. If this parsing information can be determined and does
not change because of later derivatives,
-there is no point in traversing this information twice. This leads to an optimization---if we store the information for parse trees in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an $n$-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
+there is no point in traversing this information twice. This leads to an optimisation---if we store the information for parse trees inside the regular expression, update it when we do derivative on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing the rest $n$ injections. This leads to Sulzmann and Lu's novel idea of using bit-codes in derivatives.
In the next section, we shall focus on the bit-coded algorithm and the
process of simplification of regular expressions. This is needed in
@@ -665,34 +665,40 @@
Using bit-codes to guide parsing is not a novel idea. It was applied to
context free grammars and then adapted by Henglein and Nielson for
-efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and
+efficient regular expression parsing using DFAs \cite{nielson11bcre}. Sulzmann and
Lu took a step further by integrating bitcodes into derivatives.
+The reason why we want to use bitcodes in
+this project is that we want to introduce aggressive
+simplifications.
-The argument for complicating the data structures from basic regular
-expressions to those with bitcodes is that we can introduce
-simplification without making the algorithm crash or overly complex to
-reason about. The reason why we need simplification is due to the
-shortcoming of the previous algorithm. The main drawback of building
+The main drawback of building
successive derivatives according to Brzozowski's definition is that they
can grow very quickly in size. This is mainly due to the fact that the
derivative operation generates often ``useless'' $\ZERO$s and $\ONE$s in
derivatives. As a result, if implemented naively both algorithms by
-Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example
-when starting with the regular expression $(a + aa)^*$ and building 12
+Brzozowski and by Sulzmann and Lu are excruciatingly slow.
+For example when starting with the regular expression $(a + aa)^*$ and building 12
successive derivatives w.r.t.~the character $a$, one obtains a
derivative regular expression with more than 8000 nodes (when viewed as
a tree). Operations like derivative and $\nullable$ need to traverse
such trees and consequently the bigger the size of the derivative the
-slower the algorithm. Fortunately, one can simplify regular expressions
+slower the algorithm.
+Fortunately, one can simplify regular expressions
after each derivative step. Various simplifications of regular
expressions are possible, such as the simplifications of $\ZERO + r$, $r
+ \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just $r$. These
simplifications do not affect the answer for whether a regular
expression matches a string or not, but fortunately also do not affect
the POSIX strategy of how regular expressions match strings---although
-the latter is much harder to establish. Some initial results in this
+the latter is much harder to establish.
+The argument for complicating the data structures from basic regular
+expressions to those with bitcodes is that we can introduce
+simplification without making the algorithm crash or overly complex to
+reason about. The latter is crucial for a correctness proof.
+Some initial results in this
regard have been obtained in \cite{AusafDyckhoffUrban2016}. However,
-what has not been achieved yet is a very tight bound for the size. Such
+what has not been achieved yet is correctness for the bit-coded algorithm
+that involves simplifications and a very tight bound for the size. Such
a tight bound is suggested by work of Antimirov who proved that
(partial) derivatives can be bound by the number of characters contained
in the initial regular expression \cite{Antimirov95}.
@@ -709,8 +715,6 @@
$\textit{pder} \; c \; r^*$ & $\dn$ & $ \{ r' \cdot r^* \mid r' \in pder \; c \; r \} $ \\
\end{tabular}
\end{center}
-
-
it is essentially a set of regular expressions that come from the sub-structure of the original regular expression.
Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression. Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives.