649 In order to differentiate between these choices, |
649 In order to differentiate between these choices, |
650 we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. |
650 we just need to remember their positions--$a$ is on the left, $ab$ is in the middle , and $abc$ is on the right. |
651 Which one of these alternatives is chosen later does not affect their relative position because our algorithm |
651 Which one of these alternatives is chosen later does not affect their relative position because our algorithm |
652 does not change this order. If this parsing information can be determined and does |
652 does not change this order. If this parsing information can be determined and does |
653 not change because of later derivatives, |
653 not change because of later derivatives, |
654 there is no point in traversing this information twice. This leads to an optimization---if we store the information for parse trees in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an $n$-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives. |
654 there is no point in traversing this information twice. This leads to an optimisation---if we store the information for parse trees inside the regular expression, update it when we do derivative on them, and collect the information when finished with derivatives and call $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing the rest $n$ injections. This leads to Sulzmann and Lu's novel idea of using bit-codes in derivatives. |
655 |
655 |
656 In the next section, we shall focus on the bit-coded algorithm and the |
656 In the next section, we shall focus on the bit-coded algorithm and the |
657 process of simplification of regular expressions. This is needed in |
657 process of simplification of regular expressions. This is needed in |
658 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
658 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
659 and Lu's algorithms. This is where the PhD-project aims to advance the |
659 and Lu's algorithms. This is where the PhD-project aims to advance the |
663 \section{Simplification of Regular Expressions} |
663 \section{Simplification of Regular Expressions} |
664 |
664 |
665 |
665 |
666 Using bit-codes to guide parsing is not a novel idea. It was applied to |
666 Using bit-codes to guide parsing is not a novel idea. It was applied to |
667 context free grammars and then adapted by Henglein and Nielson for |
667 context free grammars and then adapted by Henglein and Nielson for |
668 efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and |
668 efficient regular expression parsing using DFAs \cite{nielson11bcre}. Sulzmann and |
669 Lu took a step further by integrating bitcodes into derivatives. |
669 Lu took a step further by integrating bitcodes into derivatives. |
670 |
670 The reason why we want to use bitcodes in |
671 The argument for complicating the data structures from basic regular |
671 this project is that we want to introduce aggressive |
672 expressions to those with bitcodes is that we can introduce |
672 simplifications. |
673 simplification without making the algorithm crash or overly complex to |
673 |
674 reason about. The reason why we need simplification is due to the |
674 The main drawback of building |
675 shortcoming of the previous algorithm. The main drawback of building |
|
676 successive derivatives according to Brzozowski's definition is that they |
675 successive derivatives according to Brzozowski's definition is that they |
677 can grow very quickly in size. This is mainly due to the fact that the |
676 can grow very quickly in size. This is mainly due to the fact that the |
678 derivative operation generates often ``useless'' $\ZERO$s and $\ONE$s in |
677 derivative operation generates often ``useless'' $\ZERO$s and $\ONE$s in |
679 derivatives. As a result, if implemented naively both algorithms by |
678 derivatives. As a result, if implemented naively both algorithms by |
680 Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example |
679 Brzozowski and by Sulzmann and Lu are excruciatingly slow. |
681 when starting with the regular expression $(a + aa)^*$ and building 12 |
680 For example when starting with the regular expression $(a + aa)^*$ and building 12 |
682 successive derivatives w.r.t.~the character $a$, one obtains a |
681 successive derivatives w.r.t.~the character $a$, one obtains a |
683 derivative regular expression with more than 8000 nodes (when viewed as |
682 derivative regular expression with more than 8000 nodes (when viewed as |
684 a tree). Operations like derivative and $\nullable$ need to traverse |
683 a tree). Operations like derivative and $\nullable$ need to traverse |
685 such trees and consequently the bigger the size of the derivative the |
684 such trees and consequently the bigger the size of the derivative the |
686 slower the algorithm. Fortunately, one can simplify regular expressions |
685 slower the algorithm. |
|
686 Fortunately, one can simplify regular expressions |
687 after each derivative step. Various simplifications of regular |
687 after each derivative step. Various simplifications of regular |
688 expressions are possible, such as the simplifications of $\ZERO + r$, $r |
688 expressions are possible, such as the simplifications of $\ZERO + r$, $r |
689 + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just $r$. These |
689 + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just $r$. These |
690 simplifications do not affect the answer for whether a regular |
690 simplifications do not affect the answer for whether a regular |
691 expression matches a string or not, but fortunately also do not affect |
691 expression matches a string or not, but fortunately also do not affect |
692 the POSIX strategy of how regular expressions match strings---although |
692 the POSIX strategy of how regular expressions match strings---although |
693 the latter is much harder to establish. Some initial results in this |
693 the latter is much harder to establish. |
|
694 The argument for complicating the data structures from basic regular |
|
695 expressions to those with bitcodes is that we can introduce |
|
696 simplification without making the algorithm crash or overly complex to |
|
697 reason about. The latter is crucial for a correctness proof. |
|
698 Some initial results in this |
694 regard have been obtained in \cite{AusafDyckhoffUrban2016}. However, |
699 regard have been obtained in \cite{AusafDyckhoffUrban2016}. However, |
695 what has not been achieved yet is a very tight bound for the size. Such |
700 what has not been achieved yet is correctness for the bit-coded algorithm |
|
701 that involves simplifications and a very tight bound for the size. Such |
696 a tight bound is suggested by work of Antimirov who proved that |
702 a tight bound is suggested by work of Antimirov who proved that |
697 (partial) derivatives can be bound by the number of characters contained |
703 (partial) derivatives can be bound by the number of characters contained |
698 in the initial regular expression \cite{Antimirov95}. |
704 in the initial regular expression \cite{Antimirov95}. |
699 |
705 |
700 Antimirov defined the \emph{partial derivatives} of regular expressions to be this: |
706 Antimirov defined the \emph{partial derivatives} of regular expressions to be this: |
707 $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \; r_2$ \\ |
713 $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \; r_2$ \\ |
708 $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} \cup pder \; c \; r_2 \; \textit{else} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} $ \\ |
714 $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} \cup pder \; c \; r_2 \; \textit{else} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} $ \\ |
709 $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{ r' \cdot r^* \mid r' \in pder \; c \; r \} $ \\ |
715 $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{ r' \cdot r^* \mid r' \in pder \; c \; r \} $ \\ |
710 \end{tabular} |
716 \end{tabular} |
711 \end{center} |
717 \end{center} |
712 |
|
713 |
|
714 it is essentially a set of regular expressions that come from the sub-structure of the original regular expression. |
718 it is essentially a set of regular expressions that come from the sub-structure of the original regular expression. |
715 Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression. Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives. |
719 Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression. Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives. |
716 |
720 |
717 %We believe, and have generated test |
721 %We believe, and have generated test |
718 %data, that a similar bound can be obtained for the derivatives in |
722 %data, that a similar bound can be obtained for the derivatives in |