ninems/ninems.tex
changeset 66 8c8c82c0515f
parent 65 df2e0faccb23
child 67 e974c5477a8c
equal deleted inserted replaced
65:df2e0faccb23 66:8c8c82c0515f
   649 In order to differentiate between these choices, 
   649 In order to differentiate between these choices, 
   650 we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. 
   650 we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. 
   651 Which one of these alternatives is chosen later does not affect their relative position because our algorithm 
   651 Which one of these alternatives is chosen later does not affect their relative position because our algorithm 
   652 does not change this order. If this parsing information can be determined and does
   652 does not change this order. If this parsing information can be determined and does
   653 not change because of later derivatives,
   653 not change because of later derivatives,
   654 there is no point in traversing this information twice. This leads to an optimization---if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an $n$-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
   654 there is no point in traversing this information twice. This leads to an optimisation---if we store the information for parse trees inside the regular expression, update it when we do derivative on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing the rest $n$ injections. This leads to Sulzmann and Lu's novel idea of using bit-codes in derivatives.
   655 
   655 
   656 In the next section, we shall focus on the bit-coded algorithm and the
   656 In the next section, we shall focus on the bit-coded algorithm and the
   657 process of simplification of regular expressions. This is needed in
   657 process of simplification of regular expressions. This is needed in
   658 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   658 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
   659 and Lu's algorithms.  This is where the PhD-project aims to advance the
   659 and Lu's algorithms.  This is where the PhD-project aims to advance the
   663 \section{Simplification of Regular Expressions}
   663 \section{Simplification of Regular Expressions}
   664 
   664 
   665 
   665 
   666 Using bit-codes to guide  parsing is not a novel idea. It was applied to
   666 Using bit-codes to guide  parsing is not a novel idea. It was applied to
   667 context free grammars and then adapted by Henglein and Nielson for
   667 context free grammars and then adapted by Henglein and Nielson for
   668 efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and
   668 efficient regular expression parsing using DFAs \cite{nielson11bcre}. Sulzmann and
   669 Lu took a step further by integrating bitcodes into derivatives.
   669 Lu took a step further by integrating bitcodes into derivatives.
   670 
   670 The reason why we want to use bitcodes in 
   671 The argument for complicating the data structures from basic regular
   671 this project is that we want to introduce aggressive 
   672 expressions to those with bitcodes is that we can introduce
   672 simplifications. 
   673 simplification without making the algorithm crash or overly complex to
   673 
   674 reason about. The reason why we need simplification is due to the
   674 The main drawback of building
   675 shortcoming of the previous algorithm. The main drawback of building
       
   676 successive derivatives according to Brzozowski's definition is that they
   675 successive derivatives according to Brzozowski's definition is that they
   677 can grow very quickly in size. This is mainly due to the fact that the
   676 can grow very quickly in size. This is mainly due to the fact that the
   678 derivative operation generates often ``useless'' $\ZERO$s and $\ONE$s in
   677 derivative operation generates often ``useless'' $\ZERO$s and $\ONE$s in
   679 derivatives.  As a result, if implemented naively both algorithms by
   678 derivatives.  As a result, if implemented naively both algorithms by
   680 Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example
   679 Brzozowski and by Sulzmann and Lu are excruciatingly slow. 
   681 when starting with the regular expression $(a + aa)^*$ and building 12
   680 For example when starting with the regular expression $(a + aa)^*$ and building 12
   682 successive derivatives w.r.t.~the character $a$, one obtains a
   681 successive derivatives w.r.t.~the character $a$, one obtains a
   683 derivative regular expression with more than 8000 nodes (when viewed as
   682 derivative regular expression with more than 8000 nodes (when viewed as
   684 a tree). Operations like derivative and $\nullable$ need to traverse
   683 a tree). Operations like derivative and $\nullable$ need to traverse
   685 such trees and consequently the bigger the size of the derivative the
   684 such trees and consequently the bigger the size of the derivative the
   686 slower the algorithm. Fortunately, one can simplify regular expressions
   685 slower the algorithm. 
       
   686 Fortunately, one can simplify regular expressions
   687 after each derivative step. Various simplifications of regular
   687 after each derivative step. Various simplifications of regular
   688 expressions are possible, such as the simplifications of $\ZERO + r$, $r
   688 expressions are possible, such as the simplifications of $\ZERO + r$, $r
   689 + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just $r$. These
   689 + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just $r$. These
   690 simplifications do not affect the answer for whether a regular
   690 simplifications do not affect the answer for whether a regular
   691 expression matches a string or not, but fortunately also do not affect
   691 expression matches a string or not, but fortunately also do not affect
   692 the POSIX strategy of how regular expressions match strings---although
   692 the POSIX strategy of how regular expressions match strings---although
   693 the latter is much harder to establish. Some initial results in this
   693 the latter is much harder to establish. 
       
   694 The argument for complicating the data structures from basic regular
       
   695 expressions to those with bitcodes is that we can introduce
       
   696 simplification without making the algorithm crash or overly complex to
       
   697 reason about. The latter is crucial for a correctness proof.
       
   698 Some initial results in this
   694 regard have been obtained in \cite{AusafDyckhoffUrban2016}. However,
   699 regard have been obtained in \cite{AusafDyckhoffUrban2016}. However,
   695 what has not been achieved yet is a very tight bound for the size. Such
   700 what has not been achieved yet is correctness for the bit-coded algorithm
       
   701 that involves simplifications and a very tight bound for the size. Such
   696 a tight bound is suggested by work of Antimirov who proved that
   702 a tight bound is suggested by work of Antimirov who proved that
   697 (partial) derivatives can be bound by the number of characters contained
   703 (partial) derivatives can be bound by the number of characters contained
   698 in the initial regular expression \cite{Antimirov95}.
   704 in the initial regular expression \cite{Antimirov95}.
   699 
   705 
   700 Antimirov defined the \emph{partial derivatives} of regular expressions to be this:
   706 Antimirov defined the \emph{partial derivatives} of regular expressions to be this:
   707   $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \;  r_2$ \\
   713   $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \;  r_2$ \\
   708    $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \}  \cup pder \; c \; r_2 \; \textit{else} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \} $ \\ 
   714    $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \}  \cup pder \; c \; r_2 \; \textit{else} \; \{  r \cdot r_2 \mid r \in pder \; c \; r_1   \} $ \\ 
   709      $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{  r' \cdot r^* \mid r' \in pder \; c \; r   \}  $ \\  
   715      $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{  r' \cdot r^* \mid r' \in pder \; c \; r   \}  $ \\  
   710  \end{tabular}
   716  \end{tabular}
   711  \end{center}
   717  \end{center}
   712 
       
   713 
       
   714 it is essentially a set of regular expressions that come from the sub-structure of the original regular expression. 
   718 it is essentially a set of regular expressions that come from the sub-structure of the original regular expression. 
   715 Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression.  Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives.
   719 Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression.  Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives.
   716 
   720 
   717  %We believe, and have generated test
   721  %We believe, and have generated test
   718 %data, that a similar bound can be obtained for the derivatives in
   722 %data, that a similar bound can be obtained for the derivatives in