cst_tests: changeset 66:8c8c82c0515f

--- a/ninems/ninems.tex	Sun Jul 07 21:38:35 2019 +0100
+++ b/ninems/ninems.tex	Sun Jul 07 22:09:24 2019 +0100
@@ -651,7 +651,7 @@
 Which one of these alternatives is chosen later does not affect their relative position because our algorithm 
 does not change this order. If this parsing information can be determined and does
 not change because of later derivatives,
-there is no point in traversing this information twice. This leads to an optimization---if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an $n$-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
+there is no point in traversing this information twice. This leads to an optimisation---if we store the information for parse trees inside the regular expression, update it when we do derivative on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing the rest $n$ injections. This leads to Sulzmann and Lu's novel idea of using bit-codes in derivatives.
 
 In the next section, we shall focus on the bit-coded algorithm and the
 process of simplification of regular expressions. This is needed in
@@ -665,34 +665,40 @@
 
 Using bit-codes to guide  parsing is not a novel idea. It was applied to
 context free grammars and then adapted by Henglein and Nielson for
-efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and
+efficient regular expression parsing using DFAs \cite{nielson11bcre}. Sulzmann and
 Lu took a step further by integrating bitcodes into derivatives.
+The reason why we want to use bitcodes in 
+this project is that we want to introduce aggressive 
+simplifications. 
 
-The argument for complicating the data structures from basic regular
-expressions to those with bitcodes is that we can introduce
-simplification without making the algorithm crash or overly complex to
-reason about. The reason why we need simplification is due to the
-shortcoming of the previous algorithm. The main drawback of building
+The main drawback of building
 successive derivatives according to Brzozowski's definition is that they
 can grow very quickly in size. This is mainly due to the fact that the
 derivative operation generates often ``useless'' $\ZERO$s and $\ONE$s in
 derivatives.  As a result, if implemented naively both algorithms by
-Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example
-when starting with the regular expression $(a + aa)^*$ and building 12
+Brzozowski and by Sulzmann and Lu are excruciatingly slow. 
+For example when starting with the regular expression $(a + aa)^*$ and building 12
 successive derivatives w.r.t.~the character $a$, one obtains a
 derivative regular expression with more than 8000 nodes (when viewed as
 a tree). Operations like derivative and $\nullable$ need to traverse
 such trees and consequently the bigger the size of the derivative the
-slower the algorithm. Fortunately, one can simplify regular expressions
+slower the algorithm. 
+Fortunately, one can simplify regular expressions
 after each derivative step. Various simplifications of regular
 expressions are possible, such as the simplifications of $\ZERO + r$, $r
 + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just $r$. These
 simplifications do not affect the answer for whether a regular
 expression matches a string or not, but fortunately also do not affect
 the POSIX strategy of how regular expressions match strings---although
-the latter is much harder to establish. Some initial results in this
+the latter is much harder to establish. 
+The argument for complicating the data structures from basic regular
+expressions to those with bitcodes is that we can introduce
+simplification without making the algorithm crash or overly complex to
+reason about. The latter is crucial for a correctness proof.
+Some initial results in this
 regard have been obtained in \cite{AusafDyckhoffUrban2016}. However,
-what has not been achieved yet is a very tight bound for the size. Such
+what has not been achieved yet is correctness for the bit-coded algorithm
+that involves simplifications and a very tight bound for the size. Such
 a tight bound is suggested by work of Antimirov who proved that
 (partial) derivatives can be bound by the number of characters contained
 in the initial regular expression \cite{Antimirov95}.
@@ -709,8 +715,6 @@
      $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{  r' \cdot r^* \mid r' \in pder \; c \; r   \}  $ \\  
  \end{tabular}
  \end{center}
-
-
 it is essentially a set of regular expressions that come from the sub-structure of the original regular expression. 
 Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression.  Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives.
author	Chengsong
	Sun, 07 Jul 2019 22:09:24 +0100
changeset 66	8c8c82c0515f
parent 65	df2e0faccb23
child 67	e974c5477a8c