744 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular |
744 \cite{Antimirov95}. He defined the \emph{partial derivatives} of regular |
745 expressions as follows: |
745 expressions as follows: |
746 |
746 |
747 \begin{center} |
747 \begin{center} |
748 \begin{tabular}{lcl} |
748 \begin{tabular}{lcl} |
749 $\textit{pder} \; c \; 0$ & $\dn$ & $\emptyset$\\ |
749 $\textit{pder} \; c \; \ZERO$ & $\dn$ & $\emptyset$\\ |
750 $\textit{pder} \; c \; 1$ & $\dn$ & $\emptyset$ \\ |
750 $\textit{pder} \; c \; \ONE$ & $\dn$ & $\emptyset$ \\ |
751 $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{ 1 \} \; \textit{else} \; \emptyset$ \\ |
751 $\textit{pder} \; c \; d$ & $\dn$ & $\textit{if} \; c \,=\, d \; \{ \ONE \} \; \textit{else} \; \emptyset$ \\ |
752 $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \; r_2$ \\ |
752 $\textit{pder} \; c \; r_1+r_2$ & $\dn$ & $pder \; c \; r_1 \cup pder \; c \; r_2$ \\ |
753 $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 $\\ |
753 $\textit{pder} \; c \; r_1 \cdot r_2$ & $\dn$ & $\textit{if} \; nullable \; r_1 $\\ |
754 & & $\textit{then} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} \cup pder \; c \; r_2 \;$\\ |
754 & & $\textit{then} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} \cup pder \; c \; r_2 \;$\\ |
755 & & $\textit{else} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} $ \\ |
755 & & $\textit{else} \; \{ r \cdot r_2 \mid r \in pder \; c \; r_1 \} $ \\ |
756 $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{ r' \cdot r^* \mid r' \in pder \; c \; r \} $ \\ |
756 $\textit{pder} \; c \; r^*$ & $\dn$ & $ \{ r' \cdot r^* \mid r' \in pder \; c \; r \} $ \\ |
1037 |
1038 |
1038 \begin{center} |
1039 \begin{center} |
1039 \begin{tabular}{@{}lcl@{}} |
1040 \begin{tabular}{@{}lcl@{}} |
1040 |
1041 |
1041 $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
1042 $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\ |
1042 &&$\quad\textit{case} \; (0, \_) \Rightarrow 0$ \\ |
1043 &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\ |
1043 &&$\quad\textit{case} \; (\_, 0) \Rightarrow 0$ \\ |
1044 &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\ |
1044 &&$\quad\textit{case} \; (1, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
1045 &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\ |
1045 &&$\quad\textit{case} \; (a_1', 1) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
1046 &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\ |
1046 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ |
1047 &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\ |
1047 |
1048 |
1048 $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ |
1049 $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\ |
1049 &&$\quad\textit{case} \; [] \Rightarrow 0$ \\ |
1050 &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\ |
1050 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
1051 &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\ |
1051 &&$\quad\textit{case} \; as' \Rightarrow \textit{ALT}\;bs\;as'$\\ |
1052 &&$\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$\\ |
1052 |
1053 |
1053 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
1054 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$ |
1054 \end{tabular} |
1055 \end{tabular} |
1055 \end{center} |
1056 \end{center} |
1056 |
1057 |
1062 $\ONE$, which might trigger further simplification at the current level. |
1063 $\ONE$, which might trigger further simplification at the current level. |
1063 The most involved part is the $\textit{ALTS}$ clause, where we use two |
1064 The most involved part is the $\textit{ALTS}$ clause, where we use two |
1064 auxiliary functions flatten and distinct to open up nested |
1065 auxiliary functions flatten and distinct to open up nested |
1065 $\textit{ALTS}$ and reduce as many duplicates as possible. Function |
1066 $\textit{ALTS}$ and reduce as many duplicates as possible. Function |
1066 distinct keeps the first occurring copy only and remove all later ones |
1067 distinct keeps the first occurring copy only and remove all later ones |
1067 when detected duplicates. Function flatten opens up nested \textit{ALT}. |
1068 when detected duplicates. Function flatten opens up nested \textit{ALTS}. |
1068 Its recursive definition is given below: |
1069 Its recursive definition is given below: |
1069 |
1070 |
1070 \begin{center} |
1071 \begin{center} |
1071 \begin{tabular}{@{}lcl@{}} |
1072 \begin{tabular}{@{}lcl@{}} |
1072 $\textit{flatten} \; (\textit{ALT}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \; |
1073 $\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \; |
1073 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ |
1074 (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\ |
1074 $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \; as' $ \\ |
1075 $\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \; as' $ \\ |
1075 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) |
1076 $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise) |
1076 \end{tabular} |
1077 \end{tabular} |
1077 \end{center} |
1078 \end{center} |
1140 |
1141 |
1141 \noindent We would settle this correctness claim. It is relatively |
1142 \noindent We would settle this correctness claim. It is relatively |
1142 straightforward to establish that after one simplification step, the part of a |
1143 straightforward to establish that after one simplification step, the part of a |
1143 nullable derivative that corresponds to a POSIX value remains intact and can |
1144 nullable derivative that corresponds to a POSIX value remains intact and can |
1144 still be collected, in other words, we can show that\comment{Double-check....I |
1145 still be collected, in other words, we can show that\comment{Double-check....I |
1145 think this is not the case} |
1146 think this is not the case}\comment{If i remember correctly, you have proved this lemma. |
1146 |
1147 I feel this is indeed not true because you might place arbitrary |
1147 \begin{center} |
1148 bits on the regex r, however if this is the case, did i remember wrongly that |
1148 $\textit{bmkeps} \; r = \textit{bmkeps} \; \textit{simp} \; r\;( r\; \textit{nullable})$ |
1149 you proved something like simplification does not affect $\textit{bmkeps}$ results? |
|
1150 Anyway, i have amended this a little bit so it does not allow arbitrary bits attached |
|
1151 to a regex. Maybe it works now.} |
|
1152 |
|
1153 \begin{center} |
|
1154 $\textit{bmkeps} \; a = \textit{bmkeps} \; \textit{bsimp} \; a\;( a\; \textit{bnullable} and \textit{decode}(r, \textit{bmkeps}(a)) is a \textit{POSIX} value)$ |
1149 \end{center} |
1155 \end{center} |
1150 |
1156 |
1151 \noindent |
1157 \noindent |
1152 as this basically comes down to proving actions like removing the |
1158 as this basically comes down to proving actions like removing the |
1153 additional $r$ in $r+r$ does not delete important POSIX information in |
1159 additional $r$ in $r+r$ does not delete important POSIX information in |
1157 $\textit{bmkeps} \; \textit{blexer}\_{simp}(s, \; r) = \textit{bmkeps} \; \textit{blexer} \; \textit{simp}(s, \; r)$ |
1163 $\textit{bmkeps} \; \textit{blexer}\_{simp}(s, \; r) = \textit{bmkeps} \; \textit{blexer} \; \textit{simp}(s, \; r)$ |
1158 \end{center} |
1164 \end{center} |
1159 |
1165 |
1160 \noindent That is, if we do derivative on regular expression $r$ and then |
1166 \noindent That is, if we do derivative on regular expression $r$ and then |
1161 simplify it, and repeat this process until we exhaust the string, we get a |
1167 simplify it, and repeat this process until we exhaust the string, we get a |
1162 regular expression $r''$ that provides the POSIX matching as the result $r'$ of |
1168 regular expression $r''$($\textit{LHS}$) that provides the POSIX matching |
1163 the normal derivative algorithm which only does derivative operation repeatedly |
1169 information, which is exactly the same as the result $r'$($\textit{RHS}$ of the |
1164 and has no simplification at all. This might seem at first glance very |
1170 normal derivative algorithm that only does derivative repeatedly and has no |
1165 unintuitive, as the $r'$ is exponentially larger than $r''$. But this can be |
1171 simplification at all. This might seem at first glance very unintuitive, as |
1166 explained in the following way: we are pruning away the possible matches that |
1172 the $r'$ is exponentially larger than $r''$, but can be explained in the |
1167 are not POSIX. Since there are exponentially non-POSIX matchings and only 1 |
1173 following way: we are pruning away the possible matches that are not POSIX. |
1168 POSIX matching, it is understandable that our $r''$ can be a lot smaller. we |
1174 Since there are exponentially non-POSIX matchings and only 1 POSIX matching, it |
1169 can still provide the same POSIX value if there is one. This is not as |
1175 is understandable that our $r''$ can be a lot smaller. we can still provide |
1170 straightforward as the previous proposition, as the two regular expressions $r$ |
1176 the same POSIX value if there is one. This is not as straightforward as the |
1171 and $\textit{simp}\; r$ might become very different regular expressions after |
1177 previous proposition, as the two regular expressions $r'$ and $r''$ might have |
1172 repeated application of $\textit{simp}$ and derivative. The crucial point is |
1178 become very different regular expressions. The crucial point is to find the |
1173 to find the indispensable information of a regular expression and how it is |
1179 $\textit{POSIX}$ information of a regular expression and how it is modified, |
1174 kept intact during simplification so that it performs as good as a regular |
1180 augmented and propagated |
1175 expression that has not been simplified in the subsequent derivative |
1181 during simplification in parallel with the regularr expression that |
1176 operations. To aid this, we use the helping function retrieve described by |
1182 has not been simplified in the subsequent derivative operations. To aid this, |
1177 Sulzmann and Lu: \\definition of retrieve\\ |
1183 we use the helping function retrieve described by Sulzmann and Lu: \\definition |
1178 |
1184 of retrieve TODO\\ |
1179 This function assembled the bitcode that corresponds to a parse tree for |
1185 This function assembles the bitcode that corresponds to a parse tree for how |
1180 how the current derivative matches the suffix of the string(the |
1186 the current derivative matches the suffix of the string(the characters that |
1181 characters that have not yet appeared, but is stored in the value). |
1187 have not yet appeared, but will appear as the successive derivatives go on, |
1182 Sulzmann and Lu used this to connect the bitcoded algorithm to the older |
1188 how do we get this "future" information? By the value $v$, which is |
1183 algorithm by the following equation: |
1189 computed by a pass of the algorithm that uses |
|
1190 $inj$ as described in the previous section). Sulzmann and Lu used this |
|
1191 to connect the bitcoded algorithm to the older algorithm by the following |
|
1192 equation: |
1184 |
1193 |
1185 \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; |
1194 \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\; |
1186 ((\textit{internalise}\; r)\backslash_{simp} c) v)$ \end{center} A |
1195 ((\textit{internalise}\; r)\backslash_{simp} c) v)$ |
1187 little fact that needs to be stated to help comprehension: |
1196 \end{center} |
1188 \begin{center} $r^\uparrow = a$($a$ stands for $\textit{annotated}).$ |
1197 A little fact that needs to be stated to help comprehension: |
1189 \end{center} Ausaf and Urban also used this fact to prove the |
1198 \begin{center} |
1190 correctness of bitcoded algorithm without simplification. Our purpose |
1199 $r^\uparrow = a$($a$ stands for $\textit{annotated}).$ |
1191 of using this, however, is try to establish \\ $ \textit{retrieve} \; |
1200 \end{center} |
1192 a \; v \;=\; \textit{retrieve} \; \textit{simp}(a) \; v'.$\\ The idea |
1201 Ausaf and Urban also used this fact to prove the |
1193 is that using $v'$, a simplified version of $v$ that possibly had gone |
1202 correctness of bitcoded algorithm without simplification. Our purpose |
1194 through the same simplification step as $\textit{simp}(a)$ we are |
1203 of using this, however, is to establish |
1195 still able to extract the bit-sequence that gives the same parsing |
1204 \begin{center} |
1196 information as the unsimplified one. After establishing this, we |
1205 $ \textit{retrieve} \; |
1197 might be able to finally bridge the gap of proving\\ |
1206 a \; v \;=\; \textit{retrieve} \; \textit{simp}(a) \; v'.$ |
1198 $\textit{retrieve} \; r \backslash s \; v = \;\textit{retrieve} \; |
1207 \end{center} |
1199 \textit{simp}(r) \backslash s \; v'$\\ and subsequently\\ |
1208 The idea |
1200 $\textit{retrieve} \; r \backslash s \; v\; = \; \textit{retrieve} \; |
1209 is that using $v'$, a simplified version of $v$ that possibly had gone |
1201 r \backslash_{simp} s \; v'$.\\ This proves that our simplified |
1210 through the same simplification step as $\textit{simp}(a)$ we are |
1202 version of regular expression still contains all the bitcodes needed. |
1211 able to extract the bit-sequence that gives the same parsing |
|
1212 information as the unsimplified one. After establishing this, we |
|
1213 might be able to finally bridge the gap of proving |
|
1214 \begin{center} |
|
1215 $\textit{retrieve} \; r \backslash s \; v = \;\textit{retrieve} \; |
|
1216 \textit{simp}(r) \backslash s \; v'$ |
|
1217 \end{center} |
|
1218 and subsequently |
|
1219 \begin{center} |
|
1220 $\textit{retrieve} \; r \backslash s \; v\; = \; \textit{retrieve} \; |
|
1221 r \backslash_{simp} s \; v'$. |
|
1222 \end{center} |
|
1223 This proves that our simplified |
|
1224 version of regular expression still contains all the bitcodes needed. |
1203 |
1225 |
1204 |
1226 |
1205 The second task is to speed up the more aggressive simplification. |
1227 The second task is to speed up the more aggressive simplification. |
1206 Currently it is slower than a naive simplification(the naive version as |
1228 Currently it is slower than a naive simplification(the naive version as |
1207 implemented in ADU of course can explode in some cases). So it needs to |
1229 implemented in ADU of course can explode in some cases). So it needs to |