184 identifier. For @{text "if"} we obtain by the Priority Rule a keyword |
185 identifier. For @{text "if"} we obtain by the Priority Rule a keyword |
185 token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"} |
186 token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"} |
186 matches also. By the Star Rule we know @{text "(r\<^bsub>key\<^esub> + |
187 matches also. By the Star Rule we know @{text "(r\<^bsub>key\<^esub> + |
187 r\<^bsub>id\<^esub>)\<^sup>\<star>"} matches @{text "iffoo"}, |
188 r\<^bsub>id\<^esub>)\<^sup>\<star>"} matches @{text "iffoo"}, |
188 respectively @{text "if"}, in exactly one `iteration' of the star. The |
189 respectively @{text "if"}, in exactly one `iteration' of the star. The |
189 Empty String Rule is for cases where @{text |
190 Empty String Rule is for cases where, for example, @{text |
190 "(a\<^sup>\<star>)\<^sup>\<star>"}, for example, matches against the |
191 "(a\<^sup>\<star>)\<^sup>\<star>"} matches against the |
191 string @{text "bc"}. Then the longest initial matched substring is the |
192 string @{text "bc"}. Then the longest initial matched substring is the |
192 empty string, which is matched by both the whole regular expression |
193 empty string, which is matched by both the whole regular expression |
193 and the parenthesised sub-expression. |
194 and the parenthesised subexpression. |
194 |
195 |
195 |
196 |
196 One limitation of Brzozowski's matcher is that it only generates a |
197 One limitation of Brzozowski's matcher is that it only generates a |
197 YES/NO answer for whether a string is being matched by a regular |
198 YES/NO answer for whether a string is being matched by a regular |
198 expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this matcher |
199 expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this matcher |
199 to allow generation not just of a YES/NO answer but of an actual |
200 to allow generation not just of a YES/NO answer but of an actual |
200 matching, called a [lexical] {\em value}. \marginpar{explain values; |
201 matching, called a [lexical] {\em value}. Assuming a regular |
201 who introduced them} They give a simple algorithm to calculate a value |
202 expression matches a string, values encode the information of |
202 that appears to be the value associated with POSIX matching. The |
203 \emph{how} the string is matched by the regular expression---that is, |
203 challenge then is to specify that value, in an algorithm-independent |
204 which part of the string is matched by which part of the regular |
204 fashion, and to show that Sulzmann and Lu's derivative-based algorithm |
205 expression. For this consider again the the string @{text "xy"} and |
205 does indeed calculate a value that is correct according to the |
206 the regular expression \mbox{@{text "(x + (y + |
206 specification. |
207 xy))\<^sup>\<star>"}}. The POSIX value, which corresponds to using the |
207 |
208 star in only one repetition, |
208 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a |
209 |
209 relation (called an ``order relation'') on the set of values of @{term |
210 |
210 r}, and to show that (once a string to be matched is chosen) there is |
211 \marginpar{explain values; who introduced them} |
211 a maximum element and that it is computed by their derivative-based |
212 |
212 algorithm. This proof idea is inspired by work of Frisch and Cardelli |
213 |
213 \cite{Frisch2004} on a GREEDY regular expression matching |
214 Sulzmann and Lu give a simple algorithm to calculate a value that |
214 algorithm. However, we were not able to establish transitivity and |
215 appears to be the value associated with POSIX matching. The challenge |
215 totality for the ``order relation'' by Sulzmann and Lu. |
216 then is to specify that value, in an algorithm-independent fashion, |
216 There are some inherent problems with their approach (of |
217 and to show that Sulzmann and Lu's derivative-based algorithm does |
217 which some of the proofs are not published in \cite{Sulzmann2014}); |
218 indeed calculate a value that is correct according to the |
218 perhaps more importantly, we give in this paper a simple inductive (and |
219 specification. The answer given by Sulzmann and Lu |
219 algorithm-independent) definition of what we call being a {\em POSIX |
220 \cite{Sulzmann2014} is to define a relation (called an ``order |
220 value} for a regular expression @{term r} and a string @{term s}; we |
221 relation'') on the set of values of @{term r}, and to show that (once |
221 show that the algorithm computes such a value and that such a value is |
222 a string to be matched is chosen) there is a maximum element and that |
222 unique. Our proofs are both done by hand and checked in Isabelle/HOL. The |
223 it is computed by their derivative-based algorithm. This proof idea is |
223 experience of doing our proofs has been that this mechanical checking |
224 inspired by work of Frisch and Cardelli \cite{Frisch2004} on a GREEDY |
224 was absolutely essential: this subject area has hidden snares. This |
225 regular expression matching algorithm. However, we were not able to |
225 was also noted by Kuklewicz \cite{Kuklewicz} who found that nearly all |
226 establish transitivity and totality for the ``order relation'' by |
226 POSIX matching implementations are ``buggy'' \cite[Page |
227 Sulzmann and Lu. There are some inherent problems with their approach |
227 203]{Sulzmann2014} and by Grathwohl et al \cite[Page 36]{CrashCourse2014} |
228 (of which some of the proofs are not published in |
228 who wrote: |
229 \cite{Sulzmann2014}); perhaps more importantly, we give in this paper |
|
230 a simple inductive (and algorithm-independent) definition of what we |
|
231 call being a {\em POSIX value} for a regular expression @{term r} and |
|
232 a string @{term s}; we show that the algorithm by Sulzmann and Lu |
|
233 computes such a value and that such a value is unique. Our proofs are |
|
234 both done by hand and checked in Isabelle/HOL. The experience of |
|
235 doing our proofs has been that this mechanical checking was absolutely |
|
236 essential: this subject area has hidden snares. This was also noted by |
|
237 Kuklewicz \cite{Kuklewicz} who found that nearly all POSIX matching |
|
238 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014} and by |
|
239 Grathwohl et al \cite[Page 36]{CrashCourse2014} who wrote: |
229 |
240 |
230 \begin{quote} |
241 \begin{quote} |
231 \it{}``The POSIX strategy is more complicated than the greedy because of |
242 \it{}``The POSIX strategy is more complicated than the greedy because of |
232 the dependence on information about the length of matched strings in the |
243 the dependence on information about the length of matched strings in the |
233 various subexpressions.'' |
244 various subexpressions.'' |
770 Interesting is also the rule for sequence regular expressions (@{text |
781 Interesting is also the rule for sequence regular expressions (@{text |
771 "PS"}). The first two premises state that @{term "v\<^sub>1"} and @{term "v\<^sub>2"} |
782 "PS"}). The first two premises state that @{term "v\<^sub>1"} and @{term "v\<^sub>2"} |
772 are the POSIX values for @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"} |
783 are the POSIX values for @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"} |
773 respectively. Consider now the third premise and note that the POSIX value |
784 respectively. Consider now the third premise and note that the POSIX value |
774 of this rule should match the string \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}}. According to the |
785 of this rule should match the string \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}}. According to the |
775 longest match rule, we want that the @{term "s\<^sub>1"} is the longest initial |
786 Longest Match Rule, we want that the @{term "s\<^sub>1"} is the longest initial |
776 split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} such that @{term "s\<^sub>2"} is still recognised |
787 split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} such that @{term "s\<^sub>2"} is still recognised |
777 by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there |
788 by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there |
778 \emph{exist} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"} |
789 \emph{exist} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"} |
779 can be split up into a non-empty string @{term "s\<^sub>3"} and a possibly empty |
790 can be split up into a non-empty string @{term "s\<^sub>3"} and a possibly empty |
780 string @{term "s\<^sub>4"}. Moreover the longer string @{term "s\<^sub>1 @ s\<^sub>3"} can be |
791 string @{term "s\<^sub>4"}. Moreover the longer string @{term "s\<^sub>1 @ s\<^sub>3"} can be |
781 matched by @{text "r\<^sub>1"} and the shorter @{term "s\<^sub>4"} can still be |
792 matched by @{text "r\<^sub>1"} and the shorter @{term "s\<^sub>4"} can still be |
782 matched by @{term "r\<^sub>2"}. In this case @{term "s\<^sub>1"} would \emph{not} be the |
793 matched by @{term "r\<^sub>2"}. In this case @{term "s\<^sub>1"} would \emph{not} be the |
783 longest initial split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} and therefore @{term "Seq v\<^sub>1 |
794 longest initial split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} and therefore @{term "Seq v\<^sub>1 |
784 v\<^sub>2"} cannot be a POSIX value for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. |
795 v\<^sub>2"} cannot be a POSIX value for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. |
785 The main point is that our side-condition ensures the longest |
796 The main point is that our side-condition ensures the Longest |
786 match rule is satisfied. |
797 Match Rule is satisfied. |
787 |
798 |
788 A similar condition is imposed on the POSIX value in the @{text |
799 A similar condition is imposed on the POSIX value in the @{text |
789 "P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial |
800 "P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial |
790 split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value |
801 split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value |
791 @{term v} cannot be flattened to the empty string. In effect, we require |
802 @{term v} cannot be flattened to the empty string. In effect, we require |
792 that in each ``iteration'' of the star, some non-empty substring needs to |
803 that in each ``iteration'' of the star, some non-empty substring needs to |
793 be ``chipped'' away; only in case of the empty string we accept @{term |
804 be ``chipped'' away; only in case of the empty string we accept @{term |
794 "Stars []"} as the POSIX value. Indeed we can show that our POSIX value |
805 "Stars []"} as the POSIX value. Indeed we can show that our POSIX value |
795 is a lexical value which excludes those @{text Stars} containing values |
806 is a lexical value which excludes those @{text Stars} containing subvalues |
796 that flatten to the empty string. |
807 that flatten to the empty string. |
797 |
808 |
798 \begin{lemma} |
809 \begin{lemma}\label{LVposix} |
799 @{thm [mode=IfThen] Posix_LV} |
810 @{thm [mode=IfThen] Posix_LV} |
800 \end{lemma} |
811 \end{lemma} |
801 |
812 |
802 \begin{proof} |
813 \begin{proof} |
803 By routine induction on @{thm (prem 1) Posix_LV}.\qed |
814 By routine induction on @{thm (prem 1) Posix_LV}.\qed |
1025 @{thm pflat_len_def} |
1034 @{thm pflat_len_def} |
1026 \end{center} |
1035 \end{center} |
1027 |
1036 |
1028 \noindent where we take the length of the flattened value at |
1037 \noindent where we take the length of the flattened value at |
1029 position @{text p}, provided the position is inside @{text v}; if |
1038 position @{text p}, provided the position is inside @{text v}; if |
1030 not, then the norm is @{text "-1"}. The default is crucial |
1039 not, then the norm is @{text "-1"}. The default for outside |
1031 for the POSIX requirement of preferring a @{text Left}-value |
1040 positions is crucial for the POSIX requirement of preferring a |
1032 over a @{text Right}-value (if they can match the same string---see |
1041 @{text Left}-value over a @{text Right}-value (if they can match the |
1033 the Priority Rule from the Introduction). For this consider |
1042 same string---see the Priority Rule from the Introduction). For this |
|
1043 consider |
1034 |
1044 |
1035 \begin{center} |
1045 \begin{center} |
1036 @{term "v == Left (Char x)"} \qquad and \qquad @{term "w == Right (Char x)"} |
1046 @{term "v == Left (Char x)"} \qquad and \qquad @{term "w == Right (Char x)"} |
1037 \end{center} |
1047 \end{center} |
1038 |
1048 |
1039 \noindent Both values match @{text x}, but at position @{text "[0]"} |
1049 \noindent Both values match @{text x}, but at position @{text "[0]"} |
1040 the norm of @{term v} is @{text 1} (the subvalue matches @{text x}), but the |
1050 the norm of @{term v} is @{text 1} (the subvalue matches @{text x}), |
1041 norm of @{text w} is @{text "-1"} (the position is outside @{text w} |
1051 but the norm of @{text w} is @{text "-1"} (the position is outside |
1042 according to how we defined the `inside' positions of @{text Left}- |
1052 @{text w} according to how we defined the `inside' positions of |
1043 and @{text Right}-values). Of course at position @{text "[1]"}, the |
1053 @{text Left}- and @{text Right}-values). Of course at position |
1044 norms @{term "pflat_len v [1]"} and @{term "pflat_len w [1]"} are |
1054 @{text "[1]"}, the norms @{term "pflat_len v [1]"} and @{term |
1045 reversed, but the point is that subvalues will be analysed according to |
1055 "pflat_len w [1]"} are reversed, but the point is that subvalues |
1046 lexicographically orderd positions. This order, written @{term |
1056 will be analysed according to lexicographically ordered |
1047 "DUMMY \<sqsubset>lex DUMMY"}, can be conveniently formalised by |
1057 positions. According to this ordering, the position @{text "[0]"} |
1048 three inference rules |
1058 takes precedence. The lexicographic ordering of positions, written |
|
1059 @{term "DUMMY \<sqsubset>lex DUMMY"}, can be conveniently formalised |
|
1060 by three inference rules |
1049 |
1061 |
1050 \begin{center} |
1062 \begin{center} |
1051 \begin{tabular}{ccc} |
1063 \begin{tabular}{ccc} |
1052 @{thm [mode=Axiom] lex_list.intros(1)}\hspace{1cm} & |
1064 @{thm [mode=Axiom] lex_list.intros(1)}\hspace{1cm} & |
1053 @{thm [mode=Rule] lex_list.intros(3)[where ?p1.0="p\<^sub>1" and ?p2.0="p\<^sub>2" and |
1065 @{thm [mode=Rule] lex_list.intros(3)[where ?p1.0="p\<^sub>1" and ?p2.0="p\<^sub>2" and |
1054 ?ps1.0="ps\<^sub>1" and ?ps2.0="ps\<^sub>2"]}\hspace{1cm} & |
1066 ?ps1.0="ps\<^sub>1" and ?ps2.0="ps\<^sub>2"]}\hspace{1cm} & |
1055 @{thm [mode=Rule] lex_list.intros(2)[where ?ps1.0="ps\<^sub>1" and ?ps2.0="ps\<^sub>2"]} |
1067 @{thm [mode=Rule] lex_list.intros(2)[where ?ps1.0="ps\<^sub>1" and ?ps2.0="ps\<^sub>2"]} |
1056 \end{tabular} |
1068 \end{tabular} |
1057 \end{center} |
1069 \end{center} |
1058 |
1070 |
1059 With the norm and lexicographic order of positions in place, |
1071 With the norm and lexicographic order in place, |
1060 we can state the key definition of Okui and Suzuki |
1072 we can state the key definition of Okui and Suzuki |
1061 \cite{OkuiSuzuki2010}: a value @{term "v\<^sub>1"} is \emph{smaller} than |
1073 \cite{OkuiSuzuki2010}: a value @{term "v\<^sub>1"} is \emph{smaller} than |
1062 @{term "v\<^sub>2"} if and only if $(i)$ the norm at position @{text p} is |
1074 @{term "v\<^sub>2"} if and only if $(i)$ the norm at position @{text p} is |
1063 greater in @{term "v\<^sub>1"} (that is the string @{term "flat (at v\<^sub>1 p)"} is longer |
1075 greater in @{term "v\<^sub>1"} (that is the string @{term "flat (at v\<^sub>1 p)"} is longer |
1064 than @{term "flat (at v\<^sub>2 p)"}) and $(ii)$ all subvalues at |
1076 than @{term "flat (at v\<^sub>2 p)"}) and $(ii)$ all subvalues at |
1140 @{thm [mode=IfThen] PosOrd_shorterI[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} |
1152 @{thm [mode=IfThen] PosOrd_shorterI[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} |
1141 |
1153 |
1142 |
1154 |
1143 |
1155 |
1144 Next we establish how Okui and Suzuki's ordering relates to our |
1156 Next we establish how Okui and Suzuki's ordering relates to our |
1145 definition of POSIX values. Given a POSIX value @{text "v\<^sub>1"} |
1157 definition of POSIX values. Given a @{text POSIX} value @{text "v\<^sub>1"} |
1146 for @{text r} and @{text s}, then any other lexical value @{text |
1158 for @{text r} and @{text s}, then any other lexical value @{text |
1147 "v\<^sub>2"} in @{term "LV r s"} is greater or equal than @{text |
1159 "v\<^sub>2"} in @{term "LV r s"} is greater or equal than @{text |
1148 "v\<^sub>1"}: |
1160 "v\<^sub>1"}, namely: |
1149 |
1161 |
1150 |
1162 |
1151 \begin{theorem} |
1163 \begin{theorem}\label{orderone} |
1152 @{thm [mode=IfThen] Posix_PosOrd[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} |
1164 @{thm [mode=IfThen] Posix_PosOrd[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} |
1153 \end{theorem} |
1165 \end{theorem} |
1154 |
1166 |
1155 \begin{proof} By induction on our POSIX rules. By |
1167 \begin{proof} By induction on our POSIX rules. By |
1156 Thm~\ref{posixdeterm} and the definition of @{const LV}, it is clear |
1168 Theorem~\ref{posixdeterm} and the definition of @{const LV}, it is clear |
1157 that @{text "v\<^sub>1"} and @{text "v\<^sub>2"} have the same |
1169 that @{text "v\<^sub>1"} and @{text "v\<^sub>2"} have the same |
1158 underlying string @{term s}. The three base cases are |
1170 underlying string @{term s}. The three base cases are |
1159 straightforward: for example for @{term "v\<^sub>1 = Void"}, we have |
1171 straightforward: for example for @{term "v\<^sub>1 = Void"}, we have |
1160 that @{term "v\<^sub>2 \<in> LV ONE []"} must also be of the form |
1172 that @{term "v\<^sub>2 \<in> LV ONE []"} must also be of the form |
1161 \mbox{@{term "v\<^sub>2 = Void"}}. Therefore we have @{term |
1173 \mbox{@{term "v\<^sub>2 = Void"}}. Therefore we have @{term |
1162 "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"}. The inductive cases for |
1174 "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"}. The inductive cases for |
1163 @{term "ALT r\<^sub>1 r\<^sub>2"} and @{term "SEQ r\<^sub>1 |
1175 @{text r} being of the form @{term "ALT r\<^sub>1 r\<^sub>2"} and |
1164 r\<^sub>2"} are as follows: |
1176 @{term "SEQ r\<^sub>1 r\<^sub>2"} are as follows: |
1165 |
1177 |
1166 |
1178 |
1167 \begin{itemize} |
1179 \begin{itemize} |
1168 |
1180 |
1169 \item[$\bullet$] Case @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2) |
1181 \item[$\bullet$] Case @{text "P+L"} with @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2) |
1170 \<rightarrow> (Left w\<^sub>1)"}: In this case @{term "v\<^sub>1 = |
1182 \<rightarrow> (Left w\<^sub>1)"}: In this case @{term "v\<^sub>1 = |
1171 Left w\<^sub>1"} and the value @{term "v\<^sub>2"} is either of the |
1183 Left w\<^sub>1"} and the value @{term "v\<^sub>2"} is either of the |
1172 form @{term "Left w\<^sub>2"} or @{term "Right w\<^sub>2"}. In the |
1184 form @{term "Left w\<^sub>2"} or @{term "Right w\<^sub>2"}. In the |
1173 latter case we can immediately conclude with @{term "v\<^sub>1 |
1185 latter case we can immediately conclude with @{term "v\<^sub>1 |
1174 :\<sqsubseteq>val v\<^sub>2"} since a @{text Left}-value with the |
1186 :\<sqsubseteq>val v\<^sub>2"} since a @{text Left}-value with the |
1175 same underlying string @{text s} is always smaller or equal than a |
1187 same underlying string @{text s} is always smaller than a |
1176 @{text Right}-value. In the former case we have @{term "w\<^sub>2 |
1188 @{text Right}-value. In the former case we have @{term "w\<^sub>2 |
1177 \<in> LV r\<^sub>1 s"} and can use the induction hypothesis to infer |
1189 \<in> LV r\<^sub>1 s"} and can use the induction hypothesis to infer |
1178 @{term "w\<^sub>1 :\<sqsubseteq>val w\<^sub>2"}. Because @{term |
1190 @{term "w\<^sub>1 :\<sqsubseteq>val w\<^sub>2"}. Because @{term |
1179 "w\<^sub>1"} and @{term "w\<^sub>2"} have the same underlying string |
1191 "w\<^sub>1"} and @{term "w\<^sub>2"} have the same underlying string |
1180 @{text s}, we can conclude with @{term "Left w\<^sub>1 |
1192 @{text s}, we can conclude with @{term "Left w\<^sub>1 |
1181 :\<sqsubseteq>val Left w\<^sub>2"} by Prop ???.\smallskip |
1193 :\<sqsubseteq>val Left w\<^sub>2"} by Prop ???.\smallskip |
1182 |
1194 |
1183 \item[$\bullet$] Case @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2) |
1195 \item[$\bullet$] Case @{text "P+R"} with @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2) |
1184 \<rightarrow> (Right w\<^sub>1)"}: This case similar as the previous |
1196 \<rightarrow> (Right w\<^sub>1)"}: This case similar to the previous |
1185 case, except that we know that @{term "s \<notin> L |
1197 case, except that we additionally know @{term "s \<notin> L |
1186 r\<^sub>1"}. This is needed when @{term "v\<^sub>2"} is of the form |
1198 r\<^sub>1"}. This is needed when @{term "v\<^sub>2"} is of the form |
1187 @{term "Left w\<^sub>2"}: since \mbox{@{term "flat v\<^sub>2 = flat |
1199 @{term "Left w\<^sub>2"}. Since \mbox{@{term "flat v\<^sub>2 = flat |
1188 w\<^sub>2"} @{text "= s"}} and @{term "\<Turnstile> w\<^sub>2 : |
1200 w\<^sub>2"} @{text "= s"}} and @{term "\<Turnstile> w\<^sub>2 : |
1189 r\<^sub>1"}, we can derive a contradiction using |
1201 r\<^sub>1"}, we can derive a contradiction for \mbox{@{term "s \<notin> L |
1190 Prop.~\ref{inhabs}. So also in this case \mbox{@{term "v\<^sub>1 |
1202 r\<^sub>1"}} using |
|
1203 Proposition~\ref{inhabs}. So also in this case \mbox{@{term "v\<^sub>1 |
1191 :\<sqsubseteq>val v\<^sub>2"}}.\smallskip |
1204 :\<sqsubseteq>val v\<^sub>2"}}.\smallskip |
1192 |
1205 |
1193 \item[$\bullet$] Case @{term "(s\<^sub>1 @ s\<^sub>2) \<in> (SEQ |
1206 \item[$\bullet$] Case @{text "PS"} with @{term "(s\<^sub>1 @ s\<^sub>2) \<in> (SEQ |
1194 r\<^sub>1 r\<^sub>2) \<rightarrow> (Seq w\<^sub>1 w\<^sub>2)"}: We |
1207 r\<^sub>1 r\<^sub>2) \<rightarrow> (Seq w\<^sub>1 w\<^sub>2)"}: We |
1195 can assume @{term "v\<^sub>2 = Seq (u\<^sub>1) (u\<^sub>2)"} with |
1208 can assume @{term "v\<^sub>2 = Seq (u\<^sub>1) (u\<^sub>2)"} with |
1196 @{term "\<Turnstile> u\<^sub>1 : r\<^sub>1"} and \mbox{@{term |
1209 @{term "\<Turnstile> u\<^sub>1 : r\<^sub>1"} and \mbox{@{term |
1197 "\<Turnstile> u\<^sub>2 : r\<^sub>2"}}. We have @{term "s\<^sub>1 @ |
1210 "\<Turnstile> u\<^sub>2 : r\<^sub>2"}}. We have @{term "s\<^sub>1 @ |
1198 s\<^sub>2 = (flat u\<^sub>1) @ (flat u\<^sub>2)"}. By the |
1211 s\<^sub>2 = (flat u\<^sub>1) @ (flat u\<^sub>2)"}. By the |
1199 side-condition on out @{text PS}-rule we know that either @{term |
1212 side-condition of the @{text PS}-rule we know that either @{term |
1200 "s\<^sub>1 = flat u\<^sub>1"} or @{term "flat u\<^sub>1"} is a |
1213 "s\<^sub>1 = flat u\<^sub>1"} or that @{term "flat u\<^sub>1"} is a |
1201 strict prefix ??? of @{term "s\<^sub>1"}. In the latter case we can |
1214 strict prefix ??? of @{term "s\<^sub>1"}. In the latter case we can |
1202 infer @{term "w\<^sub>1 :\<sqsubset>val u\<^sub>1"} by ??? and from |
1215 infer @{term "w\<^sub>1 :\<sqsubset>val u\<^sub>1"} by ??? and from |
1203 this @{term "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"} by ???. In the |
1216 this @{term "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"} by ???. In the |
1204 former case we know @{term "u\<^sub>1 \<in> LV r\<^sub>1 s\<^sub>1"} |
1217 former case we know @{term "u\<^sub>1 \<in> LV r\<^sub>1 s\<^sub>1"} |
1205 and @{term "u\<^sub>2 \<in> LV r\<^sub>2 s\<^sub>2"}. With this we |
1218 and @{term "u\<^sub>2 \<in> LV r\<^sub>2 s\<^sub>2"}. With this we |
1206 can use the induction hypotheses to infer @{term "w\<^sub>1 |
1219 can use the induction hypotheses to infer @{term "w\<^sub>1 |
1207 :\<sqsubseteq>val u\<^sub>1"} and @{term "w\<^sub>2 |
1220 :\<sqsubseteq>val u\<^sub>1"} and @{term "w\<^sub>2 |
1208 :\<sqsubseteq>val u\<^sub>2"}. By ??? we can again infer @{term |
1221 :\<sqsubseteq>val u\<^sub>2"}. By ??? we can again infer @{term |
1209 "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"} and are done. |
1222 "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"}. |
1210 |
1223 |
1211 \end{itemize} |
1224 \end{itemize} |
1212 |
1225 |
1213 \noindent The case for @{text "P\<star>"} is similar to the @{text PS}-case.\qed |
1226 \noindent The case for @{text "P\<star>"} is similar to the @{text PS}-case and omitted.\qed |
1214 \end{proof} |
1227 \end{proof} |
1215 |
1228 |
1216 \noindent This theorem shows that our posix value @{text |
1229 \noindent This theorem shows that our @{text POSIX} value for a |
1217 "v\<^sub>1"} with @{term "s \<in> r \<rightarrow> v\<^sub>1"} is a |
1230 regular expression @{text r} and string @{term s} is in fact a |
1218 minimal element for the values in @{text "LV r s"}. By ??? we also |
1231 minimal element of the values in @{text "LV r s"}. By ??? we also |
1219 know that any value in @{text "LV s' r"}, with @{term "s'"} being a |
1232 know that any value in @{text "LV s' r"}, with @{term "s'"} being a |
1220 prefix, cannot be smaller than @{text "v\<^sub>1"}. The next theorem |
1233 prefix, cannot be smaller than @{text "v\<^sub>1"}. The next theorem |
1221 shows the opposite---namely any minimal element must be a @{text |
1234 shows the opposite---namely any minimal element in @{term "LV r s"} |
1222 POSIX}-value. Given a lexical value @{text "v\<^sub>1"}, say, in |
1235 must be a @{text POSIX} value. For this it helps that we proved in |
1223 @{term "LV r s"}, for which there does not exists any smaller value |
1236 the previous section that whenever a string @{term "s \<in> L r"} then |
1224 in @{term "LV r s"}, then this value is our @{text POSIX}-value: |
1237 a corresponding @{text POSIX} value exists and is a lexical value, |
|
1238 see Theorem ??? and Lemma ???. |
1225 |
1239 |
1226 \begin{theorem} |
1240 \begin{theorem} |
1227 @{thm [mode=IfThen] PosOrd_Posix[where ?v1.0="v\<^sub>1"]} |
1241 @{thm [mode=IfThen] PosOrd_Posix[where ?v1.0="v\<^sub>1"]} |
1228 \end{theorem} |
1242 \end{theorem} |
1229 |
1243 |
1230 \begin{proof} By induction on @{text r}. The tree base cases are |
1244 \begin{proof} |
1231 again straightforward. For example if @{term "v\<^sub>1 \<in> LV |
1245 If @{thm (prem 1) PosOrd_Posix[where ?v1.0="v\<^sub>1"]} then |
1232 ONE s"} then @{term "v\<^sub>1 = Void"} and @{term "s = []"}. We |
1246 @{term "s \<in> L r"} by Proposition~\ref{inhabs}. Hence by Theorem~\ref{lexercorrect}(2) |
1233 know that @{term "[] \<in> ONE \<rightarrow> Void"} holds. In the |
1247 there exists a |
1234 cases for @{term "ALT r\<^sub>1 r\<^sub>2"} and @{term "SEQ |
1248 @{text POSIX} value @{term "v\<^sub>P"} with @{term "s \<in> r \<rightarrow> v\<^sub>P"} |
1235 r\<^sub>1 r\<^sub>2"} we reason as follows: |
1249 and by Lemma~\ref{LVposix} we also have @{term "v\<^sub>P \<in> LV r s"}. |
1236 |
1250 By Theorem~\ref{orderone} we therefore have |
1237 \begin{itemize} |
1251 @{term "v\<^sub>P :\<sqsubseteq>val v\<^sub>1"}. If @{term "v\<^sub>P = v\<^sub>1"} then |
1238 |
1252 we are done. Otherwise we have @{term "v\<^sub>P :\<sqsubset>val v\<^sub>1"} which |
1239 \item[$\bullet$] Case @{term "v\<^sub>1 \<in> LV (ALT r\<^sub>1 |
1253 however contradicts the second assumption and we are done too.\qed |
1240 r\<^sub>2) s"} with @{term "v\<^sub>1 = Left w\<^sub>1"} and @{term |
|
1241 "w\<^sub>1 \<in> LV r\<^sub>1 s"}: In order to use the induction |
|
1242 hypothesis we need to infer |
|
1243 |
|
1244 \begin{center} |
|
1245 @{term "\<forall>v' |
|
1246 \<in> LV (ALT r\<^sub>1 r\<^sub>2) s. \<not> (v' :\<sqsubset>val |
|
1247 Left w\<^sub>1)"} |
|
1248 implies |
|
1249 @{term "\<forall>v' \<in> LV r\<^sub>1 |
|
1250 s. \<not> (v' :\<sqsubset>val w\<^sub>1)"} |
|
1251 \end{center} |
|
1252 |
|
1253 \noindent This can be done because of ?? we can infer that for every |
|
1254 @{text v'} in @{term "LV r\<^sub>1 s"} and @{term "v' |
|
1255 :\<sqsubset>val w\<^sub>1"} also @{term "Left v' :\<sqsubset>val |
|
1256 Left w\<^sub>1"} holds. This gives a contradiction. Consequently, we |
|
1257 can use the induction hypothesis to obtain @{term "s \<in> r\<^sub>1 |
|
1258 \<rightarrow> w\<^sub>1"} and then conclude this case with @{term "s |
|
1259 \<in> ALT r\<^sub>1 r\<^sub>2 \<rightarrow> v\<^sub>1"}.\smallskip |
|
1260 |
|
1261 \item[$\bullet$] Case @{term "v\<^sub>1 \<in> LV (ALT r\<^sub>1 |
|
1262 r\<^sub>2) s"} with @{term "v\<^sub>1 = Right w\<^sub>1"} and @{term |
|
1263 "w\<^sub>1 \<in> LV r\<^sub>2 s"}: Like above we can use the |
|
1264 induction hypothesis in order to infer @{term "s \<in> r\<^sub>2 |
|
1265 \<rightarrow> w\<^sub>1"}. In order to conclude we still need to |
|
1266 obtain @{term "s \<notin> L r\<^sub>1"}. Let us suppose the opposite. |
|
1267 Then there is a @{term v'} such that @{term "v' \<in> LV r\<^sub>1 s"} |
|
1268 by Prop ??? and hence @{term "Left v' \<in> LV (ALT r\<^sub>1 r\<^sub>2) s"}. |
|
1269 But then we can use the second assumption of the theorem to infer that |
|
1270 @{term "\<not>(Left v' :\<sqsubset>val Right w\<^sub>1)"}, which cannot be the case. |
|
1271 Therefore @{term "s \<notin> L r\<^sub>1"} must hold and we can also conclude in this |
|
1272 case. |
|
1273 |
|
1274 |
|
1275 |
|
1276 |
|
1277 \end{itemize} |
|
1278 |
|
1279 \end{proof} |
1254 \end{proof} |
1280 |
1255 |
1281 To sum up, we have shown that minimal elements according to the ordering |
1256 \begin{corollary} |
1282 by Okui and Suzuki are exactly the @{text POSIX}-values we defined inductively |
1257 @{thm [mode=IfThen] Least_existence1} |
1283 in Section ??? |
1258 \end{corollary} |
|
1259 |
|
1260 \noindent To sum up, we have shown that minimal elements according |
|
1261 to the ordering by Okui and Suzuki are exactly the @{text POSIX} |
|
1262 values we defined inductively in Section~\ref{posixsec} |
|
1263 |
|
1264 |
|
1265 IS THE minimal element unique? We have not shown totality. |
1284 *} |
1266 *} |
1285 |
1267 |
1286 |
1268 section {* Optimisations *} |
1287 section {* Extensions and Optimisations*} |
|
1288 |
1269 |
1289 text {* |
1270 text {* |
1290 |
|
1291 If we are interested in tokenising a string, then we need to not just |
|
1292 split up the string into tokens, but also ``classify'' the tokens (for |
|
1293 example whether it is a keyword or an identifier). This can be done with |
|
1294 only minor modifications to the algorithm by introducing \emph{record |
|
1295 regular expressions} and \emph{record values} (for example |
|
1296 \cite{Sulzmann2014b}): |
|
1297 |
|
1298 \begin{center} |
|
1299 @{text "r :="} |
|
1300 @{text "..."} $\mid$ |
|
1301 @{text "(l : r)"} \qquad\qquad |
|
1302 @{text "v :="} |
|
1303 @{text "..."} $\mid$ |
|
1304 @{text "(l : v)"} |
|
1305 \end{center} |
|
1306 |
|
1307 \noindent where @{text l} is a label, say a string, @{text r} a regular |
|
1308 expression and @{text v} a value. All functions can be smoothly extended |
|
1309 to these regular expressions and values. For example \mbox{@{text "(l : |
|
1310 r)"}} is nullable iff @{term r} is, and so on. The purpose of the record |
|
1311 regular expression is to mark certain parts of a regular expression and |
|
1312 then record in the calculated value which parts of the string were matched |
|
1313 by this part. The label can then serve as classification for the tokens. |
|
1314 For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for |
|
1315 keywords and identifiers from the Introduction. With the record regular |
|
1316 expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}} |
|
1317 and then traverse the calculated value and only collect the underlying |
|
1318 strings in record values. With this we obtain finite sequences of pairs of |
|
1319 labels and strings, for example |
|
1320 |
|
1321 \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\] |
|
1322 |
|
1323 \noindent from which tokens with classifications (keyword-token, |
|
1324 identifier-token and so on) can be extracted. |
|
1325 |
1271 |
1326 Derivatives as calculated by \Brz's method are usually more complex |
1272 Derivatives as calculated by \Brz's method are usually more complex |
1327 regular expressions than the initial one; the result is that the |
1273 regular expressions than the initial one; the result is that the |
1328 derivative-based matching and lexing algorithms are often abysmally slow. |
1274 derivative-based matching and lexing algorithms are often abysmally slow. |
1329 However, various optimisations are possible, such as the simplifications |
1275 However, various optimisations are possible, such as the simplifications |
1458 particular for @{term "r"} being the derivative @{term "der c |
1404 particular for @{term "r"} being the derivative @{term "der c |
1459 r"}). Let @{term "r\<^sub>s"} be the simplified derivative regular expression, that is @{term |
1405 r"}). Let @{term "r\<^sub>s"} be the simplified derivative regular expression, that is @{term |
1460 "fst (simp (der c r))"}, and @{term "f\<^sub>r"} be the rectification |
1406 "fst (simp (der c r))"}, and @{term "f\<^sub>r"} be the rectification |
1461 function, that is @{term "snd (simp (der c r))"}. We distinguish the cases |
1407 function, that is @{term "snd (simp (der c r))"}. We distinguish the cases |
1462 whether (*) @{term "s \<in> L (der c r)"} or not. In the first case we |
1408 whether (*) @{term "s \<in> L (der c r)"} or not. In the first case we |
1463 have by Thm.~\ref{lexercorrect}(2) a value @{term "v"} so that @{term |
1409 have by Theorem~\ref{lexercorrect}(2) a value @{term "v"} so that @{term |
1464 "lexer (der c r) s = Some v"} and @{term "s \<in> der c r \<rightarrow> v"} hold. |
1410 "lexer (der c r) s = Some v"} and @{term "s \<in> der c r \<rightarrow> v"} hold. |
1465 By Lem.~\ref{slexeraux}(1) we can also infer from~(*) that @{term "s |
1411 By Lemma~\ref{slexeraux}(1) we can also infer from~(*) that @{term "s |
1466 \<in> L r\<^sub>s"} holds. Hence we know by Thm.~\ref{lexercorrect}(2) that |
1412 \<in> L r\<^sub>s"} holds. Hence we know by Theorem~\ref{lexercorrect}(2) that |
1467 there exists a @{term "v'"} with @{term "lexer r\<^sub>s s = Some v'"} and |
1413 there exists a @{term "v'"} with @{term "lexer r\<^sub>s s = Some v'"} and |
1468 @{term "s \<in> r\<^sub>s \<rightarrow> v'"}. From the latter we know by |
1414 @{term "s \<in> r\<^sub>s \<rightarrow> v'"}. From the latter we know by |
1469 Lem.~\ref{slexeraux}(2) that @{term "s \<in> der c r \<rightarrow> (f\<^sub>r v')"} holds. |
1415 Lemma~\ref{slexeraux}(2) that @{term "s \<in> der c r \<rightarrow> (f\<^sub>r v')"} holds. |
1470 By the uniqueness of the POSIX relation (Thm.~\ref{posixdeterm}) we |
1416 By the uniqueness of the POSIX relation (Theorem~\ref{posixdeterm}) we |
1471 can infer that @{term v} is equal to @{term "f\<^sub>r v'"}---that is the |
1417 can infer that @{term v} is equal to @{term "f\<^sub>r v'"}---that is the |
1472 rectification function applied to @{term "v'"} |
1418 rectification function applied to @{term "v'"} |
1473 produces the original @{term "v"}. Now the case follows by the |
1419 produces the original @{term "v"}. Now the case follows by the |
1474 definitions of @{const lexer} and @{const slexer}. |
1420 definitions of @{const lexer} and @{const slexer}. |
1475 |
1421 |
1476 In the second case where @{term "s \<notin> L (der c r)"} we have that |
1422 In the second case where @{term "s \<notin> L (der c r)"} we have that |
1477 @{term "lexer (der c r) s = None"} by Thm.~\ref{lexercorrect}(1). We |
1423 @{term "lexer (der c r) s = None"} by Theorem~\ref{lexercorrect}(1). We |
1478 also know by Lem.~\ref{slexeraux}(1) that @{term "s \<notin> L r\<^sub>s"}. Hence |
1424 also know by Lemma~\ref{slexeraux}(1) that @{term "s \<notin> L r\<^sub>s"}. Hence |
1479 @{term "lexer r\<^sub>s s = None"} by Thm.~\ref{lexercorrect}(1) and |
1425 @{term "lexer r\<^sub>s s = None"} by Theorem~\ref{lexercorrect}(1) and |
1480 by IH then also @{term "slexer r\<^sub>s s = None"}. With this we can |
1426 by IH then also @{term "slexer r\<^sub>s s = None"}. With this we can |
1481 conclude in this case too.\qed |
1427 conclude in this case too.\qed |
1482 |
1428 |
1483 \end{proof} |
1429 \end{proof} |
|
1430 |
|
1431 *} |
|
1432 |
|
1433 section {* Extensions*} |
|
1434 |
|
1435 text {* |
|
1436 |
|
1437 A strong point in favour of |
|
1438 Sulzmann and Lu's algorithm is that it can be extended in various |
|
1439 ways. |
|
1440 |
|
1441 If we are interested in tokenising a string, then we need to not just |
|
1442 split up the string into tokens, but also ``classify'' the tokens (for |
|
1443 example whether it is a keyword or an identifier). This can be done with |
|
1444 only minor modifications to the algorithm by introducing \emph{record |
|
1445 regular expressions} and \emph{record values} (for example |
|
1446 \cite{Sulzmann2014b}): |
|
1447 |
|
1448 \begin{center} |
|
1449 @{text "r :="} |
|
1450 @{text "..."} $\mid$ |
|
1451 @{text "(l : r)"} \qquad\qquad |
|
1452 @{text "v :="} |
|
1453 @{text "..."} $\mid$ |
|
1454 @{text "(l : v)"} |
|
1455 \end{center} |
|
1456 |
|
1457 \noindent where @{text l} is a label, say a string, @{text r} a regular |
|
1458 expression and @{text v} a value. All functions can be smoothly extended |
|
1459 to these regular expressions and values. For example \mbox{@{text "(l : |
|
1460 r)"}} is nullable iff @{term r} is, and so on. The purpose of the record |
|
1461 regular expression is to mark certain parts of a regular expression and |
|
1462 then record in the calculated value which parts of the string were matched |
|
1463 by this part. The label can then serve as classification for the tokens. |
|
1464 For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for |
|
1465 keywords and identifiers from the Introduction. With the record regular |
|
1466 expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}} |
|
1467 and then traverse the calculated value and only collect the underlying |
|
1468 strings in record values. With this we obtain finite sequences of pairs of |
|
1469 labels and strings, for example |
|
1470 |
|
1471 \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\] |
|
1472 |
|
1473 \noindent from which tokens with classifications (keyword-token, |
|
1474 identifier-token and so on) can be extracted. |
|
1475 |
|
1476 |
1484 *} |
1477 *} |
1485 |
1478 |
1486 |
1479 |
1487 |
1480 |
1488 section {* The Correctness Argument by Sulzmann and Lu\label{argu} *} |
1481 section {* The Correctness Argument by Sulzmann and Lu\label{argu} *} |