550 that the hole will be on $r_a$. So we recursively call $\inj\, |
550 that the hole will be on $r_a$. So we recursively call $\inj\, |
551 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value |
551 r_a\,c\,v_a$ to fill that hole in $v_a$. After injection, the value |
552 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$. |
552 $v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$. |
553 Other clauses can be understood in a similar way. |
553 Other clauses can be understood in a similar way. |
554 |
554 |
555 The following example gives a \comment{Other word: insight?}taste of $\textit{inj}$'s effect |
555 %\comment{Other word: insight?} |
|
556 The following example gives an insight of $\textit{inj}$'s effect |
556 and how Sulzmann and Lu's algorithm works as a whole. |
557 and how Sulzmann and Lu's algorithm works as a whole. |
557 Suppose we have a |
558 Suppose we have a |
558 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against |
559 regular expression $((((a+b)+ab)+c)+abc)^*$, and want to match it against |
559 the string $abc$ (when $abc$ is written as a regular expression, the most |
560 the string $abc$ (when $abc$ is written as a regular expression, the most |
560 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit |
561 standard way of expressing it should be $a \cdot (b \cdot c)$. We omit |
672 store the information for parse trees inside the regular expression, |
673 store the information for parse trees inside the regular expression, |
673 update it when we do derivative on them, and collect the information |
674 update it when we do derivative on them, and collect the information |
674 when finished with derivatives and call $mkeps$ for deciding which |
675 when finished with derivatives and call $mkeps$ for deciding which |
675 branch is POSIX, we can generate the parse tree in one pass, instead of |
676 branch is POSIX, we can generate the parse tree in one pass, instead of |
676 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel |
677 doing the rest $n$ injections. This leads to Sulzmann and Lu's novel |
677 idea of using bit-codes in derivatives. |
678 idea of using bitcodes in derivatives. |
678 |
679 |
679 In the next section, we shall focus on the bit-coded algorithm and the |
680 In the next section, we shall focus on the bit-coded algorithm and the |
680 process of simplification of regular expressions. This is needed in |
681 process of simplification of regular expressions. This is needed in |
681 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
682 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann |
682 and Lu's algorithms. This is where the PhD-project aims to advance the |
683 and Lu's algorithms. This is where the PhD-project aims to advance the |
710 as the simplifications of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r |
711 as the simplifications of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r |
711 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not |
712 \cdot \ONE$, and $r + r$ to just $r$. These simplifications do not |
712 affect the answer for whether a regular expression matches a string or |
713 affect the answer for whether a regular expression matches a string or |
713 not, but fortunately also do not affect the POSIX strategy of how |
714 not, but fortunately also do not affect the POSIX strategy of how |
714 regular expressions match strings---although the latter is much harder |
715 regular expressions match strings---although the latter is much harder |
715 to establish. \comment{Does not make sense.} The argument for |
716 to establish. Some initial results in this regard have been |
716 complicating the data structures from basic regular expressions to those |
|
717 with bitcodes is that we can introduce simplification without making the |
|
718 algorithm crash or overly complex to reason about. The latter is crucial |
|
719 for a correctness proof. Some initial results in this regard have been |
|
720 obtained in \cite{AusafDyckhoffUrban2016}. |
717 obtained in \cite{AusafDyckhoffUrban2016}. |
721 |
718 |
722 Unfortunately, the simplification rules outlined above are not |
719 Unfortunately, the simplification rules outlined above are not |
723 sufficient to prevent an explosion for all regular expression. We |
720 sufficient to prevent an explosion for all regular expression. We |
724 believe a tighter bound can be achieved that prevents an explosion in |
721 believe a tighter bound can be achieved that prevents an explosion in |
743 |
740 |
744 \noindent |
741 \noindent |
745 A partial derivative of a regular expression $r$ is essentially a set of |
742 A partial derivative of a regular expression $r$ is essentially a set of |
746 regular expressions that are either $r$'s children expressions or a |
743 regular expressions that are either $r$'s children expressions or a |
747 concatenation of them. Antimirov has proved a tight bound of the size of |
744 concatenation of them. Antimirov has proved a tight bound of the size of |
748 partial derivatives. \comment{That looks too preliminary to me.} Roughly |
745 partial derivatives. Roughly |
749 speaking the size will not exceed the fourth power of the number of |
746 speaking the size will not exceed $t^2$, t is the number of characters |
750 nodes in that regular expression. \comment{Improve: which |
747 appearing in that regular expression. If we want the size of derivatives |
751 simplifications?}Interestingly, we observed from experiment that after |
748 to stay below this bound, we would need more aggressive simplifications |
752 the simplification step, our regular expression has the same size or is |
749 such as opening up alternatives to achieve the maximum level of duplicates |
753 smaller than the partial derivatives. This allows us to prove a tight |
750 cancellation. |
754 bound on the size of regular expression during the running time of the |
751 For example, the parentheses in $(a+b) \cdot c + bc$ can be opened up to |
755 algorithm if we can establish the connection between our simplification |
752 get $a\cdot c +b \cdot c |
756 rules and partial derivatives. |
753 + b \cdot c$, and then simplified to $a \cdot c+b \cdot c$. Another example is from |
|
754 $(a^*+a) + (a^*+ 1) + (a +1)$ to $a^*+a+1$. |
|
755 Adding these more aggressive simplification rules |
|
756 helped us to achieve the same size bound as that of the partial derivatives. |
|
757 To introduce these "spilling out alternatives" simplifications |
|
758 and make the correctness proof easier, |
|
759 we used bitcodes. |
|
760 |
|
761 %This allows us to prove a tight |
|
762 %bound on the size of regular expression during the running time of the |
|
763 %algorithm if we can establish the connection between our simplification |
|
764 %rules and partial derivatives. |
757 |
765 |
758 %We believe, and have generated test |
766 %We believe, and have generated test |
759 %data, that a similar bound can be obtained for the derivatives in |
767 %data, that a similar bound can be obtained for the derivatives in |
760 %Sulzmann and Lu's algorithm. Let us give some details about this next. |
768 %Sulzmann and Lu's algorithm. Let us give some details about this next. |
761 |
769 |
762 Bit-codes look like this: |
770 Bitcodes look like this: |
763 \begin{center} |
771 \begin{center} |
764 $b ::= S \mid Z \; \;\; |
772 $b ::= S \mid Z \; \;\; |
765 bs ::= [] \mid b:bs |
773 bs ::= [] \mid b:bs |
766 $ |
774 $ |
767 \end{center} |
775 \end{center} |
768 They are just a string of bits, the names $S$ and $Z$ here are quite arbitrary, we can use 0 and 1 or any other set of binary symbols to substitute them. Bit-codes(or bit-sequences) are a compact form of parse trees. |
776 They are just a string of bits, |
769 Here is how values and bit-codes are related: |
777 the names $S$ and $Z$ here are quite arbitrary, we can use 0 and 1 |
|
778 or any other set of binary symbols to substitute them. |
|
779 Bitcodes(or bit-sequences) are a compact form of parse trees. |
770 Bitcodes are essentially incomplete values. |
780 Bitcodes are essentially incomplete values. |
771 This can be straightforwardly seen in the following transformation: |
781 This can be straightforwardly seen in the following transformation: |
772 \begin{center} |
782 \begin{center} |
773 \begin{tabular}{lcl} |
783 \begin{tabular}{lcl} |
774 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
784 $\textit{code}(\Empty)$ & $\dn$ & $[]$\\ |
823 \textit{else}\;\textit{None}$ |
833 \textit{else}\;\textit{None}$ |
824 \end{tabular} |
834 \end{tabular} |
825 \end{center} |
835 \end{center} |
826 %\end{definition} |
836 %\end{definition} |
827 |
837 |
828 |
838 Sulzmann and Lu's integrated the bitcodes into regular |
829 Sulzmann and Lu's integrated the bitcodes into annotated regular |
839 expressions to create annotated regular expressions. |
830 expressions by attaching them to the head of every substructure of a |
840 It is by attaching them to the head of every substructure of a |
831 regular expression\cite{Sulzmann2014}. They are defined by the following |
841 regular expression\cite{Sulzmann2014}. Annotated regular expressions |
|
842 are defined by the following |
832 grammar: |
843 grammar: |
833 |
844 |
834 \begin{center} |
845 \begin{center} |
835 \begin{tabular}{lcl} |
846 \begin{tabular}{lcl} |
836 $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ |
847 $\textit{a}$ & $::=$ & $\textit{ZERO}$\\ |
837 & $\mid$ & $\textit{ONE}\;\;bs$\\ |
848 & $\mid$ & $\textit{ONE}\;\;bs$\\ |
838 & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\ |
849 & $\mid$ & $\textit{CHAR}\;\;bs\,c$\\ |
839 & $\mid$ & $\textit{ALTS}\;\;bs\,as$\\ |
850 & $\mid$ & $\textit{ALT}\;\;bs\,as$\\ |
840 & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ |
851 & $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\ |
841 & $\mid$ & $\textit{STAR}\;\;bs\,a$ |
852 & $\mid$ & $\textit{STAR}\;\;bs\,a$ |
842 \end{tabular} |
853 \end{tabular} |
843 \end{center} |
854 \end{center} |
844 |
855 |
845 \noindent |
856 \noindent |
846 where $bs$ stands for bitsequences, and $as$ (in \textit{ALTS}) for a |
857 where $bs$ stands for bit-sequences, and $as$ (in \textit{ALTS}) for a |
847 list of annotated regular expressions. These bitsequences encode |
858 list of annotated regular expressions. These bit-sequences encode |
848 information about the (POSIX) value that should be generated by the |
859 information about the (POSIX) value that should be generated by the |
849 Sulzmann and Lu algorithm. |
860 Sulzmann and Lu algorithm. |
850 |
861 |
851 To do lexing using annotated regular expressions, we shall first |
862 To do lexing using annotated regular expressions, we shall first |
852 transform the usual (un-annotated) regular expressions into annotated |
863 transform the usual (un-annotated) regular expressions into annotated |
1072 have to work out all proof details.'' |
1083 have to work out all proof details.'' |
1073 \end{quote} |
1084 \end{quote} |
1074 |
1085 |
1075 \noindent |
1086 \noindent |
1076 We would settle the correctness claim. |
1087 We would settle the correctness claim. |
1077 It is relatively straightforward to establish that after one simplification step, the part of derivative that corresponds to a POSIX value remains intact and can still be collected, in other words, |
1088 It is relatively straightforward to establish that after one simplification step, |
1078 \comment{that only holds when r is nullable}bmkeps r = bmkeps simp r |
1089 the part of a nullable derivative that corresponds to a POSIX value |
1079 as this basically comes down to proving actions like removing the additional $r$ in $r+r$ does not delete important POSIX information in a regular expression. |
1090 remains intact and can still be collected, |
|
1091 in other words, |
|
1092 \begin{center} |
|
1093 $\textit{bmkeps} \; r = \textit{bmkeps} \; \textit{simp} \; r\;( r\; \textit{nullable})$ |
|
1094 \end{center} |
|
1095 as this basically comes down to proving actions like removing |
|
1096 the additional $r$ in $r+r$ does not delete important POSIX information |
|
1097 in a regular expression. |
1080 The hardcore of this problem is to prove that |
1098 The hardcore of this problem is to prove that |
1081 bmkeps bders r = bmkeps bders simp r |
1099 \begin{center} |
1082 That is, if we do derivative on regular expression r and the simplified version for, they can still prove the same POSIX value if there is one . This is not as straightforward as the previous proposition, as the two regular expression r and simp r might become very different regular expressions after repeated application ofd simp and derivative. |
1100 $\textit{bmkeps} \; \textit{blexer}\_{simp} \; r = \textit{bmkeps} \; \textit{blexer} \; \textit{simp} \; r$ |
1083 The crucial point is to find the \comment{What?}"gene" of a regular expression and how it is kept intact during simplification. |
1101 \end{center} |
1084 To aid this, we are use the helping function retrieve described by Sulzmann and Lu: |
1102 That is, if we do derivative on regular expression r and the simplified version, |
|
1103 they can still provide the same POSIX value if there is one . |
|
1104 This is not as straightforward as the previous proposition, as the two regular expressions $r$ and $\textit{simp}\; r$ |
|
1105 might become very different regular expressions after repeated application of $\textit{simp}$ and derivative. |
|
1106 The crucial point is to find the indispensable information of |
|
1107 a regular expression and how it is kept intact during simplification so that it performs |
|
1108 as good as a regular expression that has not been simplified in the subsequent derivative operations. |
|
1109 To aid this, we use the helping function retrieve described by Sulzmann and Lu: |
1085 \\definition of retrieve\\ |
1110 \\definition of retrieve\\ |
1086 This function assembled the bitcode that corresponds to a parse tree for how the current derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value). |
1111 This function assembles the bitcode that corresponds to a parse tree for how the current |
|
1112 derivative matches the suffix of the string(the characters that have not yet appeared, but is stored in the value). |
1087 Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\ |
1113 Sulzmann and Lu used this to connect the bit-coded algorithm to the older algorithm by the following equation:\\ |
1088 $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\ |
1114 $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; ((\textit{internalise}\; r)\backslash_{simp} c) v)$\\ |
1089 A little fact that needs to be stated to help comprehension:\\ |
1115 A little fact that needs to be stated to help comprehension:\\ |
1090 $r^\uparrow = a$($a$ stands for $annotated$).\\ |
1116 $r^\uparrow = a$($a$ stands for $annotated$).\\ |
1091 Fahad and Christian also used this fact to prove the correctness of bit-coded algorithm without simplification. |
1117 Fahad and Christian also used this fact to prove the correctness of bit-coded algorithm without simplification. |
1092 Our purpose of using this, however, is try to establish \\ |
1118 Our purpose of using this, however, is try to establish \\ |
1093 $ \textit{retrieve} \; a \; v \;=\; \textit{retrieve} \; \textit{simp}(a) \; v'.$\\ |
1119 $ \textit{retrieve} \; a \; v \;=\; \textit{retrieve} \; \textit{simp}(a) \; v'.$\\ |
1094 The idea is that using $v'$, |
1120 The idea is that using $v'$, |
1095 a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still able to extract the bitsequence that gives the same parsing information as the unsimplified one. |
1121 a simplified version of $v$ that possibly had gone through the same simplification step as $\textit{simp}(a)$ we are still able to extract the bit-sequence that gives the same parsing information as the unsimplified one. |
1096 After establishing this, we might be able to finally bridge the gap of proving\\ |
1122 After establishing this, we might be able to finally bridge the gap of proving\\ |
1097 $\textit{retrieve} \; r \backslash s \; v = \;\textit{retrieve} \; \textit{simp}(r) \backslash s \; v'$\\ |
1123 $\textit{retrieve} \; r \backslash s \; v = \;\textit{retrieve} \; \textit{simp}(r) \backslash s \; v'$\\ |
1098 and subsequently\\ |
1124 and subsequently\\ |
1099 $\textit{retrieve} \; r \backslash s \; v\; = \; \textit{retrieve} \; r \backslash_{simp} s \; v'$.\\ |
1125 $\textit{retrieve} \; r \backslash s \; v\; = \; \textit{retrieve} \; r \backslash_{simp} s \; v'$.\\ |
1100 This proves that our simplified version of regular expression still contains all the bitcodes needed. |
1126 This proves that our simplified version of regular expression still contains all the bitcodes needed. |