170 \begin{quote}\it |
171 \begin{quote}\it |
171 ``Correctness Claim: We further claim that the incremental parsing |
172 ``Correctness Claim: We further claim that the incremental parsing |
172 method [..] in combination with the simplification steps [..] |
173 method [..] in combination with the simplification steps [..] |
173 yields POSIX parse trees. We have tested this claim |
174 yields POSIX parse trees. We have tested this claim |
174 extensively [..] but yet |
175 extensively [..] but yet |
175 have to work out all proof details.'' |
176 have to work out all proof details.'' \cite[Page 14]{Sulzmann2014} |
176 \end{quote} |
177 \end{quote} |
177 |
178 |
178 \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL |
179 \noindent{}\textbf{Contributions:} We have implemented in Isabelle/HOL |
179 the derivative-based lexing algorithm of Sulzmann and Lu |
180 the derivative-based lexing algorithm of Sulzmann and Lu |
180 \cite{Sulzmann2014} where regular expressions are annotated with |
181 \cite{Sulzmann2014} where regular expressions are annotated with |
181 bitsequences. We define the crucial simplification function as a |
182 bitsequences. We define the crucial simplification function as a |
182 recursive function, instead of a fix-point operation. One objective of |
183 recursive function, without the need of a fix-point operation. One objective of |
183 the simplification function is to remove duplicates of regular |
184 the simplification function is to remove duplicates of regular |
184 expressions. For this Sulzmann and Lu use in their paper the standard |
185 expressions. For this Sulzmann and Lu use in their paper the standard |
185 @{text nub} function from Haskell's list library, but this function |
186 @{text nub} function from Haskell's list library, but this function |
186 does not achieve the intended objective with bitcoded regular expressions. The |
187 does not achieve the intended objective with bitcoded regular expressions. The |
187 reason is that in the bitcoded setting, each copy generally has a |
188 reason is that in the bitcoded setting, each copy generally has a |
188 different bitcode annotation---so @{text nub} would never ``fire''. |
189 different bitcode annotation---so @{text nub} would never ``fire''. |
189 Inspired by Scala's library for lists, we shall instead use a @{text |
190 Inspired by Scala's library for lists, we shall instead use a @{text |
190 distinctBy} function that finds duplicates under an erasing function |
191 distinctBy} function that finds duplicates under an erasing function |
191 that deletes bitcodes. |
192 which deletes bitcodes. |
192 We shall also introduce our own argument and definitions for |
193 We shall also introduce our own argument and definitions for |
193 establishing the correctness of the bitcoded algorithm when |
194 establishing the correctness of the bitcoded algorithm when |
194 simplifications are included.\medskip |
195 simplifications are included.\medskip |
195 |
196 |
196 \noindent In this paper, we shall first briefly introduce the basic notions |
197 \noindent In this paper, we shall first briefly introduce the basic notions |
967 \underbrace{\mbox{$(\ONE + \ONE{}a) \cdot (a + aa)^*$}}_{r} \;+\; |
968 \underbrace{\mbox{$(\ONE + \ONE{}a) \cdot (a + aa)^*$}}_{r} \;+\; |
968 ((a + aa)^* + \underbrace{\mbox{$(\ONE + \ONE{}a) \cdot (a + aa)^*$}}_{r}) |
969 ((a + aa)^* + \underbrace{\mbox{$(\ONE + \ONE{}a) \cdot (a + aa)^*$}}_{r}) |
969 \end{equation} |
970 \end{equation} |
970 |
971 |
971 \noindent This is a simpler derivative, but unfortunately we |
972 \noindent This is a simpler derivative, but unfortunately we |
972 cannot make further simplifications. This is a problem because |
973 cannot make any further simplifications. This is a problem because |
973 the outermost alternatives contains two copies of the same |
974 the outermost alternatives contains two copies of the same |
974 regular expression (underlined with $r$). The copies will |
975 regular expression (underlined with $r$). These copies will |
975 spawn new copies in later steps and they in turn further copies. This |
976 spawn new copies in later derivative steps and they in turn even more copies. This |
976 destroys an hope of taming the size of the derivatives. But the |
977 destroys any hope of taming the size of the derivatives. But the |
977 second copy of $r$ in \eqref{derivex} will never contribute to a |
978 second copy of $r$ in \eqref{derivex} will never contribute to a |
978 value, because POSIX lexing will always prefer matching a string |
979 value, because POSIX lexing will always prefer matching a string |
979 with the first copy. So in principle it could be removed. |
980 with the first copy. So it could be safely removed without affecting the correctness of the algorithm. |
980 The dilemma with the simple-minded |
981 The dilemma with the simple-minded |
981 simplification rules above is that the rule $r + r \Rightarrow r$ |
982 simplification rules above is that the rule $r + r \Rightarrow r$ |
982 will never be applicable because as can be seen in this example the |
983 will never be applicable because as can be seen in this example the |
983 regular expressions are separated by another sub-regular expression. |
984 regular expressions are not next to each other but separated by another regular expression. |
984 |
985 |
985 But here is where Sulzmann and Lu's representation of generalised |
986 But here is where Sulzmann and Lu's representation of generalised |
986 alternatives in the bitcoded algorithm shines: in @{term |
987 alternatives in the bitcoded algorithm shines: in @{term |
987 "ALTs bs rs"} we can define a more aggressive simplification by |
988 "ALTs bs rs"} we can define a more aggressive simplification by |
988 recursively simplifying all regular expressions in @{text rs} and |
989 recursively simplifying all regular expressions in @{text rs} and |
989 then analyse the resulting list and remove any duplicates. |
990 then analyse the resulting list and remove any duplicates. |
990 Another advantage is that the bitsequences in bitcoded regular |
991 Another advantage with the bitsequences in bitcoded regular |
991 expressions can be easily modified such that simplification does not |
992 expressions is that they can be easily modified such that simplification does not |
992 interfere with the value constructions. For example we can ``flatten'', or |
993 interfere with the value constructions. For example we can ``flatten'', or |
993 de-nest, @{text ALTs} as follows |
994 de-nest, @{text ALTs} as follows |
994 % |
995 % |
995 \[ |
996 \[ |
996 @{term "ALTs bs\<^sub>1 ((ALTs bs\<^sub>2 rs\<^sub>2) # rs\<^sub>1)"} |
997 @{term "ALTs bs\<^sub>1 ((ALTs bs\<^sub>2 rs\<^sub>2) # rs\<^sub>1)"} |
1035 @{thm (lhs) distinctBy.simps(2)} & $\dn$ & @{thm (rhs) distinctBy.simps(2)} |
1036 @{thm (lhs) distinctBy.simps(2)} & $\dn$ & @{thm (rhs) distinctBy.simps(2)} |
1036 \end{tabular} |
1037 \end{tabular} |
1037 \end{center} |
1038 \end{center} |
1038 |
1039 |
1039 \noindent where we scan the list from left to right (because we |
1040 \noindent where we scan the list from left to right (because we |
1040 have to remove later copies). In this function, @{text f} is a |
1041 have to remove later copies). In @{text distinctBy}, @{text f} is a |
1041 function and @{text acc} is an accumulator for regular |
1042 function and @{text acc} is an accumulator for regular |
1042 expressions---essentially a set of elements we have already seen |
1043 expressions---essentially a set of regular expression that we have already seen |
1043 while scanning the list. Therefore we delete an element, say @{text x}, |
1044 while scanning the list. Therefore we delete an element, say @{text x}, |
1044 from the list provided @{text "f x"} is already in the accumulator; |
1045 from the list provided @{text "f x"} is already in the accumulator; |
1045 otherwise we keep @{text x} and scan the rest of the list but now |
1046 otherwise we keep @{text x} and scan the rest of the list but |
1046 add @{text "f x"} as another element to @{text acc}. We will use |
1047 add @{text "f x"} as another ``seen'' element to @{text acc}. We will use |
1047 @{term distinctBy} where @{text f} is our erase functions, @{term "erase (DUMMY)"}, |
1048 @{term distinctBy} where @{text f} is the erase functions, @{term "erase (DUMMY)"}, |
1048 that deletes bitsequences from bitcoded regular expressions. |
1049 that deletes bitsequences from bitcoded regular expressions. |
1049 This is clearly a computationally more expensive operation, than @{text nub}, |
1050 This is clearly a computationally more expensive operation, than @{text nub}, |
1050 but is needed in order to make the removal of unnecessary copies |
1051 but is needed in order to make the removal of unnecessary copies |
1051 to work. |
1052 to work properly. |
1052 |
1053 |
1053 Our simplification function depends on three helper functions, one is called |
1054 Our simplification function depends on three helper functions, one is called |
1054 @{text flts} and defined as follows: |
1055 @{text flts} and analyses lists of regular expressions coming from alternatives. |
|
1056 It is defined as follows: |
1055 |
1057 |
1056 \begin{center} |
1058 \begin{center} |
1057 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
1059 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
1058 @{thm (lhs) flts.simps(1)} & $\dn$ & @{thm (rhs) flts.simps(1)}\\ |
1060 @{thm (lhs) flts.simps(1)} & $\dn$ & @{thm (rhs) flts.simps(1)}\\ |
1059 @{thm (lhs) flts.simps(2)} & $\dn$ & @{thm (rhs) flts.simps(2)}\\ |
1061 @{thm (lhs) flts.simps(2)} & $\dn$ & @{thm (rhs) flts.simps(2)}\\ |
1060 @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\ |
1062 @{thm (lhs) flts.simps(3)[of "bs'" "rs'"]} & $\dn$ & @{thm (rhs) flts.simps(3)[of "bs'" "rs'"]}\\ |
1061 \end{tabular} |
1063 \end{tabular} |
1062 \end{center} |
1064 \end{center} |
1063 |
1065 |
1064 \noindent |
1066 \noindent |
1065 The second clause removes all instances of @{text ZERO} in alternatives and |
1067 The second clause of @{text flts} removes all instances of @{text ZERO} in alternatives and |
1066 the second ``spills'' out nested alternatives (but retaining the |
1068 the second ``spills'' out nested alternatives (but retaining the |
1067 bitsequence @{text "bs'"} accumulated in the inner alternative). There are |
1069 bitsequence @{text "bs'"} accumulated in the inner alternative). There are |
1068 some corner cases to be considered when the resulting list inside an alternative is |
1070 some corner cases to be considered when the resulting list inside an alternative is |
1069 empty or a singleton list. We take care of those cases in the |
1071 empty or a singleton list. We take care of those cases in the |
1070 @{text "bsimpALTs"} function; similarly we define a helper function that simplifies |
1072 @{text "bsimpALTs"} function; similarly we define a helper function that simplifies |
1101 \end{tabular} |
1103 \end{tabular} |
1102 \end{center} |
1104 \end{center} |
1103 |
1105 |
1104 \noindent |
1106 \noindent |
1105 As far as we can see, our recursive function @{term bsimp} simplifies regular |
1107 As far as we can see, our recursive function @{term bsimp} simplifies regular |
1106 expressions as intended by Sulzmann and Lu. There is no point to apply the |
1108 expressions as intended by Sulzmann and Lu. There is no point in applying the |
1107 @{text bsimp} |
1109 @{text bsimp} function repeatedly (like the simplification in their paper which needs to be |
1108 function repeatedly (like the simplification in their paper which is applied |
1110 applied until a fixpoint is reached) because we can show that @{term bsimp} is idempotent, |
1109 until a fixpoint is reached), because we can show that it is idempotent, that is |
1111 that is |
1110 |
1112 |
1111 \begin{proposition} |
1113 \begin{proposition} |
1112 ??? |
1114 @{term "bsimp (bsimp r) = bsimp r"} |
1113 \end{proposition} |
1115 \end{proposition} |
1114 |
1116 |
|
1117 \noindent |
|
1118 This can be proved by induction on @{text r} but requires a detailed analysis |
|
1119 that the de-nesting of alternatives always results in a flat list of regular |
|
1120 expressions. We omit the details since it does not concern the correctness proof. |
|
1121 |
|
1122 Next we can include simplification after each derivative step leading to the |
|
1123 following notion of bitcoded derivatives: |
|
1124 |
|
1125 \begin{center} |
|
1126 \begin{tabular}{cc} |
|
1127 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
|
1128 @{thm (lhs) bders_simp.simps(1)} & $\dn$ & @{thm (rhs) bders_simp.simps(1)} |
|
1129 \end{tabular} |
|
1130 & |
|
1131 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
|
1132 @{thm (lhs) bders_simp.simps(2)} & $\dn$ & @{thm (rhs) bders_simp.simps(2)} |
|
1133 \end{tabular} |
|
1134 \end{tabular} |
|
1135 \end{center} |
|
1136 |
|
1137 \noindent |
|
1138 and use it in the improved lexing algorithm defined as |
|
1139 |
|
1140 \begin{center} |
|
1141 \begin{tabular}{lcl} |
|
1142 $\textit{blexer}^+\;r\,s$ & $\dn$ & |
|
1143 $\textit{let}\;r_{der} = (r^\uparrow)\backslash_{bsimp}\, s\;\textit{in}$\\ |
|
1144 & & $\;\;\;\;\textit{if}\; \textit{bnullable}(r_{der}) \;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,r_{der})\,r |
|
1145 \;\;\textit{else}\;\textit{None}$ |
|
1146 \end{tabular} |
|
1147 \end{center} |
|
1148 |
|
1149 \noindent The remaining task is to show that @{term blexer} and |
|
1150 @{term "blexer_simp"} generate the same answers. |
|
1151 |
|
1152 When we first |
|
1153 attempted this proof we encountered a problem with the idea |
|
1154 in Sulzmann and Lu's paper where the argument seems to be to appeal |
|
1155 again to the @{text retrieve}-function defined for the unsimplified version |
|
1156 of the algorithm. But |
|
1157 this does not work, because desirable properties such as |
|
1158 % |
|
1159 \[ |
|
1160 @{text "retrieve r v = retrieve (bsimp r) v"} |
|
1161 \] |
|
1162 |
|
1163 \noindent do not hold under simplification---this property |
|
1164 essentially purports that we can retrieve the same value from a |
|
1165 simplified version of the regular expression. To start with @{text retrieve} |
|
1166 depends on the fact that the value @{text v} correspond to the |
|
1167 structure of the regular expressions---but the whole point of simplification |
|
1168 is to ``destroy'' this structure by making the regular expression simpler. |
|
1169 To see this consider the regular expression @{text "r = r' + 0"} and a corresponding |
|
1170 value @{text "v = Left v'"}. If we annotate bitcodes to @{text "r"}, then |
|
1171 we can use @{text retrieve} and @{text v} in order to extract a corresponding |
|
1172 bitsequence. The reason that this works is that @{text r} is an alternative |
|
1173 regular expression and @{text v} a corresponding value. However, if we simplify |
|
1174 @{text r}, then @{text v} does not correspond to the shape of the regular |
|
1175 expression anymore. So unless one can somehow |
|
1176 synchronise the change in the simplified regular expressions with |
|
1177 the original POSIX value, there is no hope of appealing to @{text retrieve} in the |
|
1178 correctness argument for @{term blexer_simp}. |
|
1179 |
|
1180 We found it more helpful to introduce the rewriting systems shown in |
|
1181 Figure~\ref{SimpRewrites}. The idea is to generate |
|
1182 simplified regular expressions in small steps (unlike the @{text bsimp}-function which |
|
1183 does the same in a big step), and show that each of |
|
1184 the small steps preserves the bitcodes that lead to the final POSIX value. |
|
1185 The rewrite system is organised such that $\leadsto$ is for bitcode regular |
|
1186 expressions and $\stackrel{s}{\leadsto}$ for lists of bitcoded regular |
|
1187 expressions. The former essentially implements the simplifications of |
|
1188 @{text "bsimpSEQ"} and @{text flts}; while the latter implements the |
|
1189 simplifications in @{text "bsimpALTs"}. We can show that any bitcoded |
|
1190 regular expression reduces in one or more steps to the simplified |
|
1191 regular expression generated by @{text bsimp}: |
|
1192 |
|
1193 \begin{lemma}\label{lemone} |
|
1194 @{thm[mode=IfThen] rewrites_to_bsimp} |
|
1195 \end{lemma} |
|
1196 |
|
1197 \begin{proof} |
|
1198 By induction on @{text r}. For this we can use the properties |
|
1199 @{thm fltsfrewrites} and @{thm ss6_stronger}. The latter uses |
|
1200 repeated applications of the $LD$ rule which allows the removal |
|
1201 of duplicates that can recognise the same strings. |
|
1202 \end{proof} |
|
1203 |
|
1204 \noindent |
|
1205 We can show that this rewrite system preserves @{term bnullable}, that |
|
1206 is simplification, essentially, does not affect nullability: |
1115 |
1207 |
1116 \begin{lemma} |
1208 \begin{lemma} |
1117 @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]} |
1209 @{thm[mode=IfThen] bnullable0(1)[of "r\<^sub>1" "r\<^sub>2"]} |
1118 \end{lemma} |
1210 \end{lemma} |
1119 |
1211 |
1120 |
1212 \begin{proof} |
1121 \begin{lemma} |
1213 Straightforward mutual induction on the definition of $\leadsto$ and $\stackrel{s}{\leadsto}$. |
|
1214 The only interesting case is the rule $LD$ where the property holds since by the side-conditions of that rule the empty string will |
|
1215 be in both @{text "L (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ [r\<^sub>2] @ rs\<^sub>c)"} and |
|
1216 @{text "L (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ rs\<^sub>c)"}. |
|
1217 \end{proof} |
|
1218 |
|
1219 \noindent |
|
1220 From this, we can show that @{text bmkeps} will produce the same bitsequence |
|
1221 as long as one of the bitcoded regular expressions in $\leadsto$ is nullable (this lemma |
|
1222 establishes the missing fact we were not able to establish using @{text retrieve}, as suggested |
|
1223 in the paper by Sulzmannn and Lu). |
|
1224 |
|
1225 |
|
1226 \begin{lemma}\label{lemthree} |
1122 @{thm[mode=IfThen] rewrite_bmkeps_aux(1)[of "r\<^sub>1" "r\<^sub>2"]} |
1227 @{thm[mode=IfThen] rewrite_bmkeps_aux(1)[of "r\<^sub>1" "r\<^sub>2"]} |
1123 \end{lemma} |
1228 \end{lemma} |
1124 |
1229 |
1125 \begin{lemma} |
1230 \begin{proof} |
1126 @{thm[mode=IfThen] rewrites_to_bsimp} |
1231 By straightforward mutual induction on the definition of $\leadsto$ and $\stackrel{s}{\leadsto}$. |
1127 \end{lemma} |
1232 Again the only interesting case is the rule $LD$ where we need to ensure that |
|
1233 \[ |
|
1234 @{text "bmkeps (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ [r\<^sub>2] @ rs\<^sub>c) = |
|
1235 bmkeps (rs\<^sub>a @ [r\<^sub>1] @ rs\<^sub>b @ rs\<^sub>c)"} |
|
1236 \] |
|
1237 |
|
1238 \noindent holds. This is indeed the case because according to the POSIX rules the |
|
1239 generated bitsequence is determined by the first alternative that can match the |
|
1240 string (in this case being nullable). |
|
1241 \end{proof} |
|
1242 |
|
1243 \noindent |
|
1244 Crucial is also the fact that derivative steps and simplification steps can be interleaved, |
|
1245 which is shown by the fact that $\leadsto$ is preserved under derivatives. |
1128 |
1246 |
1129 \begin{lemma} |
1247 \begin{lemma} |
1130 @{thm[mode=IfThen] rewrite_preserves_bder(1)[of "r\<^sub>1" "r\<^sub>2"]} |
1248 @{thm[mode=IfThen] rewrite_preserves_bder(1)[of "r\<^sub>1" "r\<^sub>2"]} |
1131 \end{lemma} |
1249 \end{lemma} |
1132 |
1250 |
1133 \begin{lemma} |
1251 \begin{proof} |
|
1252 By straightforward mutual induction on the definition of $\leadsto$ and $\stackrel{s}{\leadsto}$. |
|
1253 The case for $LD$ holds because @{term "L (erase (bder c r\<^sub>2)) \<subseteq> L (erase (bder c r\<^sub>1))"} |
|
1254 if and only if @{term "L (erase (r\<^sub>2)) \<subseteq> L (erase (r\<^sub>1))"}. |
|
1255 \end{proof} |
|
1256 |
|
1257 |
|
1258 \noindent |
|
1259 Using this fact together with Lemma~\ref{lemone} allows us to prove the central lemma |
|
1260 that the unsimplified |
|
1261 derivative (with a string @{term s}) reduces to the simplified derivative (with the same string). |
|
1262 |
|
1263 \begin{lemma}\label{lemtwo} |
1134 @{thm[mode=IfThen] central} |
1264 @{thm[mode=IfThen] central} |
1135 \end{lemma} |
1265 \end{lemma} |
1136 |
1266 |
|
1267 \begin{proof} |
|
1268 By reverse induction on @{term s} generalising over @{text r}. |
|
1269 \end{proof} |
|
1270 |
|
1271 \noindent |
|
1272 With these lemmas in place we can finally establish that @{term "blexer_simp"} and @{term "blexer"} |
|
1273 generate the same value, and using Theorem~\ref{thmone} from the previous section that this value |
|
1274 is indeed the POSIX value. |
|
1275 |
1137 \begin{theorem} |
1276 \begin{theorem} |
1138 @{thm[mode=IfThen] main_blexer_simp} |
1277 @{thm[mode=IfThen] main_blexer_simp} |
1139 \end{theorem} |
1278 \end{theorem} |
1140 |
1279 |
1141 Sulzmann \& Lu apply simplification via a fixpoint operation |
1280 \begin{proof} |
1142 |
1281 By unfolding the definitions and using Lemmas~\ref{lemtwo} and \ref{lemthree}. |
1143 ; also does not use erase to filter out duplicates. |
1282 \end{proof} |
1144 |
1283 |
1145 not direct correspondence with PDERs, because of example |
1284 \noindent |
1146 problem with retrieve |
1285 This completes the correctness proof for the second POSIX lexing algorithm by Sulzmann and Lu. |
1147 |
1286 The interesting point of this algorithm is that the sizes of derivatives do not grow arbitrarily, which |
1148 correctness |
1287 we shall show next. |
1149 |
|
1150 |
|
1151 |
|
1152 |
|
1153 |
|
1154 |
1288 |
1155 \begin{figure}[t] |
1289 \begin{figure}[t] |
1156 \begin{center} |
1290 \begin{center} |
1157 \begin{tabular}{c} |
1291 \begin{tabular}{c} |
1158 @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}\qquad |
1292 @{thm[mode=Axiom] bs1[of _ "r\<^sub>2"]}$S\ZERO{}_l$\qquad |
1159 @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}\qquad |
1293 @{thm[mode=Axiom] bs2[of _ "r\<^sub>1"]}$S\ZERO{}_r$\\ |
1160 @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}\\ |
1294 @{thm[mode=Axiom] bs3[of "bs\<^sub>1" "bs\<^sub>2"]}$S\ONE$\\ |
1161 @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}\qquad |
1295 @{thm[mode=Rule] bs4[of "r\<^sub>1" "r\<^sub>2" _ "r\<^sub>3"]}SL\qquad |
1162 @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}\\ |
1296 @{thm[mode=Rule] bs5[of "r\<^sub>3" "r\<^sub>4" _ "r\<^sub>1"]}SR\\ |
1163 @{thm[mode=Axiom] bs6}\qquad |
1297 @{thm[mode=Axiom] bs6}$A0$\qquad |
1164 @{thm[mode=Axiom] bs7}\\ |
1298 @{thm[mode=Axiom] bs7}$A1$\\ |
1165 @{thm[mode=Rule] bs8[of "rs\<^sub>1" "rs\<^sub>2"]}\\ |
1299 @{thm[mode=Rule] bs8[of "rs\<^sub>1" "rs\<^sub>2"]}$AL$\\ |
1166 %@ { t hm[mode=Axiom] ss1}\qquad |
1300 @{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}$LH$\qquad |
1167 @{thm[mode=Rule] ss2[of "rs\<^sub>1" "rs\<^sub>2"]}\qquad |
1301 @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}$LT$\\ |
1168 @{thm[mode=Rule] ss3[of "r\<^sub>1" "r\<^sub>2"]}\\ |
1302 @{thm[mode=Axiom] ss4}$L\ZERO$\qquad |
1169 @{thm[mode=Axiom] ss4}\qquad |
1303 @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}$LS$\medskip\\ |
1170 @{thm[mode=Axiom] ss5[of "bs" "rs\<^sub>1" "rs\<^sub>2"]}\\ |
1304 @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}$LD$\\ |
1171 @{thm[mode=Rule] ss6[of "r\<^sub>2" "r\<^sub>1" "rs\<^sub>1" "rs\<^sub>2" "rs\<^sub>3"]}\\ |
|
1172 \end{tabular} |
1305 \end{tabular} |
1173 \end{center} |
1306 \end{center} |
1174 \caption{???}\label{SimpRewrites} |
1307 \caption{The rewrite rules that generate simplified regular expressions |
|
1308 in small steps: @{term "rrewrite r\<^sub>1 r\<^sub>2"} is for bitcoded regular |
|
1309 expressions and @{term "rrewrites rs\<^sub>1 rs\<^sub>2"} for \emph{lists} of bitcoded |
|
1310 regular expressions. Interesting is the $LD$ rule that allows copies of regular |
|
1311 expressions be removed provided a regular expression earlier in the list can |
|
1312 match the same strings.}\label{SimpRewrites} |
1175 \end{figure} |
1313 \end{figure} |
1176 *} |
1314 *} |
1177 |
1315 |
1178 section {* Bound - NO *} |
1316 section {* Finiteness of Derivatives *} |
|
1317 |
|
1318 text {* |
|
1319 |
|
1320 In this section let us sketch our argument on why the size of the simplified |
|
1321 derivatives with the aggressive simplification function is finite. Suppose |
|
1322 we have a size functions for bitcoded regular expressions, written |
|
1323 @{text "|r|"}, which counts the number of nodes if we regard $r$ as a tree |
|
1324 (we omit the precise definition). For this we show that for every $r$ |
|
1325 there exists a bound $N$ |
|
1326 such that |
|
1327 |
|
1328 \begin{center} |
|
1329 $\forall s. \; |@{term "bders_simp r s"}| < N$ |
|
1330 \end{center} |
|
1331 |
|
1332 \noindent |
|
1333 We prove this by induction on $r$. The base cases for @{term AZERO}, |
|
1334 @{term "AONE bs"} and @{term "ACHAR bs c"} are straightforward. The interesting case is |
|
1335 for sequences of the form @{term "ASEQ bs r\<^sub>1 r\<^sub>2"}. In this case our induction |
|
1336 hypotheses state $\forall s. \; |@{term "bders_simp r\<^sub>1 s"}| < N_1$ and |
|
1337 $\forall s. \; |@{term "bders_simp r\<^sub>2 s"}| < N_2$. We can reason as follows |
|
1338 |
|
1339 \begin{center} |
|
1340 \begin{tabular}{lcll} |
|
1341 & & $ |@{term "bders_simp (ASEQ bs r\<^sub>1 r\<^sub>2) s"}|$\\ |
|
1342 & $ = $ & $|bsimp(ALTs\;bs\;((@{term "bders_simp r\<^sub>1 s"}) \cdot r_2) :: |
|
1343 [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| $ & (1) \\ |
|
1344 & $\leq$ & |
|
1345 $|distinctBy\,(flts\,((@{term "bders_simp r\<^sub>1 s "}) \cdot r_2) :: |
|
1346 [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (2) \\ |
|
1347 & $\leq$ & $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2| + |
|
1348 |distinctBy\,(flts\, [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])| + 1 $ & (3) \\ |
|
1349 & $\leq$ & $N_1 + |r_2| + 2 + |distinctBy\,(flts\, [@{term "bders_simp r\<^sub>2 s'"} \;|\; s' \in Suf\!fix(s)])|$ & (4)\\ |
|
1350 & $\leq$ & $N_1 + |r_2| + 2 + l_{N_{2}} * N_{2}$ & (5) |
|
1351 \end{tabular} |
|
1352 \end{center} |
|
1353 |
|
1354 % tell Chengsong about Indian paper of closed forms of derivatives |
|
1355 |
|
1356 \noindent |
|
1357 where in (1) the $Suf\!fix(s')$ are the suffixes where @{term "bders_simp r\<^sub>1 s''"} is nullable for |
|
1358 @{text "s = s'' @ s'"}. In (3) we know that $|(@{term "bders_simp r\<^sub>1 s"}) \cdot r_2|$ is |
|
1359 bounded by $N_1 + |r_2|$. In (5) we know the list comprehension contains only regular expressions of size smaller |
|
1360 than $N_2$. The list length after @{text distinctBy} is bounded by a number, which we call $l_{N_2}$. It stands |
|
1361 for the number of distinct regular expressions with a maximum size $N_2$ (there can only be finitely many of them). |
|
1362 We reason similarly in the @{text Star}-case.\medskip |
|
1363 |
|
1364 \noindent |
|
1365 Clearly we give in this finiteness argument (Step (5)) a very loose bound that is |
|
1366 far from the actual bound we can expect. We can do better than this, but this does not improve |
|
1367 the finiteness property we are proving. If we are interested in a polynomial bound, |
|
1368 one would hope to obtain a similar tight bound as for partial |
|
1369 derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with |
|
1370 @{text distinctBy} is to maintain a ``set'' of alternatives (like the sets in |
|
1371 partial derivatives). Unfortunately to obtain the exact same bound would mean |
|
1372 we need to introduce simplifications such as |
|
1373 % |
|
1374 \[ (r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3) |
|
1375 \] |
|
1376 |
|
1377 \noindent |
|
1378 which exist for partial derivatives. However, if we introduce them in our |
|
1379 setting we would lose the POSIX property of our calculated values. We leave better |
|
1380 bounds for future work. |
|
1381 |
|
1382 *} |
1179 |
1383 |
1180 |
1384 |
1181 section {* Conclusion *} |
1385 section {* Conclusion *} |
1182 |
1386 |
1183 text {* |
1387 text {* |
1184 |
1388 |
|
1389 We set out in this work to prove in Isabelle/HOL the correctness of |
|
1390 the second POSIX lexing algorithm by Sulzmann and Lu |
|
1391 \cite{Sulzmann2014}. This follows earlier work where we established |
|
1392 the correctness of the first algorithm |
|
1393 \cite{AusafDyckhoffUrban2016}. In the earlier work we needed to |
|
1394 introduce our own specification about what POSIX values are, |
|
1395 because the informal definition given by Sulzmann and Lu did not |
|
1396 stand up to a formal proof. Also for the second algorithm we needed |
|
1397 to introduce our own definitions and proof ideas in order to establish the |
|
1398 correctness. Our interest in the second algorithm |
|
1399 lies in the fact that by using bitcoded regular expressions and an aggressive |
|
1400 simplification method there is a chance that the the derivatives |
|
1401 can be kept universally small (we established in this paper that |
|
1402 they can be kept finite for any string). This is important if one is after |
|
1403 an efficient POSIX lexing algorithm. |
|
1404 |
|
1405 Having proved the correctness of the POSIX lexing algorithm, which |
|
1406 lessons have we learned? Well, we feel this is a very good example |
|
1407 where formal proofs give further insight into the matter at |
|
1408 hand. For example it is very hard to see a problem with @{text nub} |
|
1409 vs @{text distinctBy} with only experimental data---one would still |
|
1410 see the correct result but find that simplification does not simplify in well-chosen, but not |
|
1411 obscure, examples. We found that from an implementation |
|
1412 point-of-view it is really important to have the formal proofs of |
|
1413 the corresponding properties at hand. We have also developed a |
|
1414 healthy suspicion when experimental data is used to back up |
|
1415 efficiency claims. For example Sulzmann and Lu write about their |
|
1416 equivalent of @{term blexer_simp} ``...we can incrementally compute |
|
1417 bitcoded parse trees in linear time in the size of the input'' |
|
1418 \cite[Page 14]{Sulzmann2014}. |
|
1419 Given the growth of the |
|
1420 derivatives in some cases even after aggressive simplification, this |
|
1421 is a hard to believe fact. A similar claim of about a theoretical runtime |
|
1422 of @{text "O(n\<^sup>2)"} is made for the Verbatim lexer, which calculates POSIX matches and is based on |
|
1423 derivatives \cite{verbatim}. In this case derivatives are not simplified. |
|
1424 Clearly our result of having finite |
|
1425 derivatives is rather weak in this context but we think such effeciency claims |
|
1426 require further scrutiny.\medskip |
|
1427 |
|
1428 \noindent |
|
1429 Our Isabelle/HOL code is available under \url{https://github.com/urbanchr/posix}. |
1185 |
1430 |
1186 |
1431 |
1187 %%\bibliographystyle{plain} |
1432 %%\bibliographystyle{plain} |
1188 \bibliography{root} |
1433 \bibliography{root} |
1189 *} |
1434 *} |