547 by the number of characters contained in the initial regular |
547 by the number of characters contained in the initial regular |
548 expression \cite{Antimirov95}. |
548 expression \cite{Antimirov95}. |
549 |
549 |
550 Antimirov defined the "partial derivatives" of regular expressions to be this: |
550 Antimirov defined the "partial derivatives" of regular expressions to be this: |
551 %TODO definition of partial derivatives |
551 %TODO definition of partial derivatives |
552 \[ pder c 0 = emtpy set\] |
552 \begin{center} |
553 \[pder c 1 = emptry set\] |
553 \begin{tabular}{lcl} |
554 \[pder c c = if c is equal to d then set containing lambda only else empty set\] |
554 $\textit{pder} \; c \; 0$ & $\dn$ & $\emptyset$\\ |
555 \[pder c r_1+r_2 = pder c r_1 \cup pder c r_2\] |
555 $\textit{pder} \; c \; 1$ & $\dn$ & $\emptyset$ \\ |
556 \[pder c r_1 \cdot r_2 = if r_1 nullable then set of r \cdot r_2' where r is in pder c r_1 \cup pder c r_2\] |
556 $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{ 1 \} \; \textit{else} \; \emptyset$ \\ |
557 \[else set of r \cdot r_2' where r is in pder c r_1\] |
557 $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \; r_2$ \\ |
558 \[pder c r^* = set of r \cdot r^* where r \in pder c r\] |
558 $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} \cup pder \; c \; r_2 \; \textit{else} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} $ \\ |
559 |
559 $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{ r' \cdot r^* \mid r' \in pder \; c \; r \} $ \\ |
|
560 \end{tabular} |
|
561 \end{center} |
|
562 |
560 |
563 |
561 it is essentially a set of regular expressions that come from the sub-structure of the original regular expression. |
564 it is essentially a set of regular expressions that come from the sub-structure of the original regular expression. |
562 Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression. Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives. |
565 Antimirov has proved a nice size bound of the size of partial derivatives. Roughly speaking the size will not exceed the fourth power of the number of nodes in that regular expression. Interestingly, we observed from experiment that after the simplification step, our regular expression has the same size or is smaller than the partial derivatives. This allows us to prove a tight bound on the size of regular expression during the running time of the algorithm if we can establish the connection between our simplification rules and partial derivatives. |
563 |
566 |
564 %We believe, and have generated test |
567 %We believe, and have generated test |
753 inspired by Antimirov's work on partial derivatives. They maintain the |
756 inspired by Antimirov's work on partial derivatives. They maintain the |
754 idea that only the first ``copy'' of a regular expression in an |
757 idea that only the first ``copy'' of a regular expression in an |
755 alternative contributes to the calculation of a POSIX value. All |
758 alternative contributes to the calculation of a POSIX value. All |
756 subsequent copies can be pruned from the regular expression. |
759 subsequent copies can be pruned from the regular expression. |
757 |
760 |
758 A psuedocode version of our algorithm is given below:\\ |
761 A recursive definition of simplification function that looks similar to scala code is given below:\\ |
759 \[simp r \dn r if r = ONE bs or CHAR bs c or STAR bs r |
762 \begin{center} |
760 simp SEQ bs r_1 r_2 \dn \\ |
763 \begin{tabular}{@{}lcl@{}} |
761 case (simp(r_1), simp(r_2) ) of (0, _) => 0 |
764 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \; \textit{if} \; a = (\textit{ONE} \; bs) \; or\; (\textit{CHAR} \, bs \; c) \; or\; (\textit{STAR}\; bs\; a_1)$\\ |
762 (_,0) => 0 |
765 |
763 (1, r) => fuse bs r |
766 $\textit{simp} \; \textit{SEQ}\;bs\,a_1\,a_2$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
764 (r,1) => fuse bs r |
767 &&$\textit{case} \; (0, \_) \Rightarrow 0$ \\ |
765 (r_1, r_2) => SEQ bs r_1 r_2 |
768 &&$ \textit{case} \; (\_, 0) \Rightarrow 0$ \\ |
766 simp ALT bs rs = distinct(flatten( map simp rs)) match |
769 &&$ \textit{case} \; (1, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
767 case Nil => ZERO |
770 &&$ \textit{case} \; (a_1', 1) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
768 case r::Nil => fuse bs r |
771 &&$ \textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ |
769 case rs => ALT bs rs |
772 |
770 \] |
773 $\textit{simp} \; \textit{ALT}\;bs\,as$ & $\dn$ & $\textit{ distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ |
|
774 &&$\textit{case} \; [] \Rightarrow 0$ \\ |
|
775 &&$ \textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
|
776 &&$ \textit{case} \; as' \Rightarrow \textit{ALT bs as'}$ |
|
777 \end{tabular} |
|
778 \end{center} |
771 |
779 |
772 The simplification does a pattern matching on the regular expression. When it detected that |
780 The simplification does a pattern matching on the regular expression. When it detected that |
773 the regular expression is an alternative or sequence, it will try to simplify its children regular expressions |
781 the regular expression is an alternative or sequence, it will try to simplify its children regular expressions |
774 recursively and then see if one of the children turn into 0 or 1, which might trigger further simplification |
782 recursively and then see if one of the children turn into 0 or 1, which might trigger further simplification |
775 at the current level. The most involved part is the ALTS clause, where we use two auxiliary functions |
783 at the current level. The most involved part is the ALTS clause, where we use two auxiliary functions |