194 |
193 |
195 \item[$\bullet$] \emph{Empty String Rule:} An empty string shall be considered to |
194 \item[$\bullet$] \emph{Empty String Rule:} An empty string shall be considered to |
196 be longer than no match at all. |
195 be longer than no match at all. |
197 \end{itemize} |
196 \end{itemize} |
198 |
197 |
199 \noindent Consider for example a regular expression @{text |
198 \noindent Consider for example a regular expression \<open>r\<^bsub>key\<^esub>\<close> for recognising keywords such as \<open>if\<close>, |
200 "r\<^bsub>key\<^esub>"} for recognising keywords such as @{text "if"}, |
199 \<open>then\<close> and so on; and \<open>r\<^bsub>id\<^esub>\<close> |
201 @{text "then"} and so on; and @{text "r\<^bsub>id\<^esub>"} |
|
202 recognising identifiers (say, a single character followed by |
200 recognising identifiers (say, a single character followed by |
203 characters or numbers). Then we can form the regular expression |
201 characters or numbers). Then we can form the regular expression |
204 @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} |
202 \<open>(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>\<close> |
205 and use POSIX matching to tokenise strings, say @{text "iffoo"} and |
203 and use POSIX matching to tokenise strings, say \<open>iffoo\<close> and |
206 @{text "if"}. For @{text "iffoo"} we obtain by the Longest Match Rule |
204 \<open>if\<close>. For \<open>iffoo\<close> we obtain by the Longest Match Rule |
207 a single identifier token, not a keyword followed by an |
205 a single identifier token, not a keyword followed by an |
208 identifier. For @{text "if"} we obtain by the Priority Rule a keyword |
206 identifier. For \<open>if\<close> we obtain by the Priority Rule a keyword |
209 token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"} |
207 token, not an identifier token---even if \<open>r\<^bsub>id\<^esub>\<close> |
210 matches also. By the Star Rule we know @{text "(r\<^bsub>key\<^esub> + |
208 matches also. By the Star Rule we know \<open>(r\<^bsub>key\<^esub> + |
211 r\<^bsub>id\<^esub>)\<^sup>\<star>"} matches @{text "iffoo"}, |
209 r\<^bsub>id\<^esub>)\<^sup>\<star>\<close> matches \<open>iffoo\<close>, |
212 respectively @{text "if"}, in exactly one `iteration' of the star. The |
210 respectively \<open>if\<close>, in exactly one `iteration' of the star. The |
213 Empty String Rule is for cases where, for example, the regular expression |
211 Empty String Rule is for cases where, for example, the regular expression |
214 @{text "(a\<^sup>\<star>)\<^sup>\<star>"} matches against the |
212 \<open>(a\<^sup>\<star>)\<^sup>\<star>\<close> matches against the |
215 string @{text "bc"}. Then the longest initial matched substring is the |
213 string \<open>bc\<close>. Then the longest initial matched substring is the |
216 empty string, which is matched by both the whole regular expression |
214 empty string, which is matched by both the whole regular expression |
217 and the parenthesised subexpression. |
215 and the parenthesised subexpression. |
218 |
216 |
219 |
217 |
220 One limitation of Brzozowski's matcher is that it only generates a |
218 One limitation of Brzozowski's matcher is that it only generates a |
223 to allow generation not just of a YES/NO answer but of an actual |
221 to allow generation not just of a YES/NO answer but of an actual |
224 matching, called a [lexical] {\em value}. Assuming a regular |
222 matching, called a [lexical] {\em value}. Assuming a regular |
225 expression matches a string, values encode the information of |
223 expression matches a string, values encode the information of |
226 \emph{how} the string is matched by the regular expression---that is, |
224 \emph{how} the string is matched by the regular expression---that is, |
227 which part of the string is matched by which part of the regular |
225 which part of the string is matched by which part of the regular |
228 expression. For this consider again the string @{text "xy"} and |
226 expression. For this consider again the string \<open>xy\<close> and |
229 the regular expression \mbox{@{text "(x + (y + xy))\<^sup>\<star>"}} |
227 the regular expression \mbox{\<open>(x + (y + xy))\<^sup>\<star>\<close>} |
230 (this time fully parenthesised). We can view this regular expression |
228 (this time fully parenthesised). We can view this regular expression |
231 as tree and if the string @{text xy} is matched by two Star |
229 as tree and if the string \<open>xy\<close> is matched by two Star |
232 `iterations', then the @{text x} is matched by the left-most |
230 `iterations', then the \<open>x\<close> is matched by the left-most |
233 alternative in this tree and the @{text y} by the right-left alternative. This |
231 alternative in this tree and the \<open>y\<close> by the right-left alternative. This |
234 suggests to record this matching as |
232 suggests to record this matching as |
235 |
233 |
236 \begin{center} |
234 \begin{center} |
237 @{term "Stars [Left(Char x), Right(Left(Char y))]"} |
235 @{term "Stars [Left(Char x), Right(Left(Char y))]"} |
238 \end{center} |
236 \end{center} |
239 |
237 |
240 \noindent where @{const Stars}, @{text Left}, @{text Right} and @{text |
238 \noindent where @{const Stars}, \<open>Left\<close>, \<open>Right\<close> and \<open>Char\<close> are constructors for values. \<open>Stars\<close> records how many |
241 Char} are constructors for values. @{text Stars} records how many |
239 iterations were used; \<open>Left\<close>, respectively \<open>Right\<close>, which |
242 iterations were used; @{text Left}, respectively @{text Right}, which |
|
243 alternative is used. This `tree view' leads naturally to the idea that |
240 alternative is used. This `tree view' leads naturally to the idea that |
244 regular expressions act as types and values as inhabiting those types |
241 regular expressions act as types and values as inhabiting those types |
245 (see, for example, \cite{HosoyaVouillonPierce2005}). The value for |
242 (see, for example, \cite{HosoyaVouillonPierce2005}). The value for |
246 matching @{text "xy"} in a single `iteration', i.e.~the POSIX value, |
243 matching \<open>xy\<close> in a single `iteration', i.e.~the POSIX value, |
247 would look as follows |
244 would look as follows |
248 |
245 |
249 \begin{center} |
246 \begin{center} |
250 @{term "Stars [Seq (Char x) (Char y)]"} |
247 @{term "Stars [Seq (Char x) (Char y)]"} |
251 \end{center} |
248 \end{center} |
571 @{term Stars}-rule above. For example using Sulzmann and Lu's |
568 @{term Stars}-rule above. For example using Sulzmann and Lu's |
572 less restrictive definition, @{term "LV (STAR ONE) []"} would contain |
569 less restrictive definition, @{term "LV (STAR ONE) []"} would contain |
573 infinitely many values, but according to our more restricted |
570 infinitely many values, but according to our more restricted |
574 definition only a single value, namely @{thm LV_STAR_ONE_empty}. |
571 definition only a single value, namely @{thm LV_STAR_ONE_empty}. |
575 |
572 |
576 If a regular expression @{text r} matches a string @{text s}, then |
573 If a regular expression \<open>r\<close> matches a string \<open>s\<close>, then |
577 generally the set @{term "LV r s"} is not just a singleton set. In |
574 generally the set @{term "LV r s"} is not just a singleton set. In |
578 case of POSIX matching the problem is to calculate the unique lexical value |
575 case of POSIX matching the problem is to calculate the unique lexical value |
579 that satisfies the (informal) POSIX rules from the Introduction. |
576 that satisfies the (informal) POSIX rules from the Introduction. |
580 Graphically the POSIX value calculation algorithm by Sulzmann and Lu |
577 Graphically the POSIX value calculation algorithm by Sulzmann and Lu |
581 can be illustrated by the picture in Figure~\ref{Sulz} where the |
578 can be illustrated by the picture in Figure~\ref{Sulz} where the |
582 path from the left to the right involving @{term |
579 path from the left to the right involving @{term |
583 derivatives}/@{const nullable} is the first phase of the algorithm |
580 derivatives}/@{const nullable} is the first phase of the algorithm |
584 (calculating successive \Brz's derivatives) and @{const |
581 (calculating successive \Brz's derivatives) and @{const |
585 mkeps}/@{text inj}, the path from right to left, the second |
582 mkeps}/\<open>inj\<close>, the path from right to left, the second |
586 phase. This picture shows the steps required when a regular |
583 phase. This picture shows the steps required when a regular |
587 expression, say @{text "r\<^sub>1"}, matches the string @{term |
584 expression, say \<open>r\<^sub>1\<close>, matches the string @{term |
588 "[a,b,c]"}. We first build the three derivatives (according to |
585 "[a,b,c]"}. We first build the three derivatives (according to |
589 @{term a}, @{term b} and @{term c}). We then use @{const nullable} |
586 @{term a}, @{term b} and @{term c}). We then use @{const nullable} |
590 to find out whether the resulting derivative regular expression |
587 to find out whether the resulting derivative regular expression |
591 @{term "r\<^sub>4"} can match the empty string. If yes, we call the |
588 @{term "r\<^sub>4"} can match the empty string. If yes, we call the |
592 function @{const mkeps} that produces a value @{term "v\<^sub>4"} |
589 function @{const mkeps} that produces a value @{term "v\<^sub>4"} |
688 |
684 |
689 \noindent To better understand what is going on in this definition it |
685 \noindent To better understand what is going on in this definition it |
690 might be instructive to look first at the three sequence cases (clauses |
686 might be instructive to look first at the three sequence cases (clauses |
691 \textit{(4)} -- \textit{(6)}). In each case we need to construct an ``injected value'' for |
687 \textit{(4)} -- \textit{(6)}). In each case we need to construct an ``injected value'' for |
692 @{term "SEQ r\<^sub>1 r\<^sub>2"}. This must be a value of the form @{term |
688 @{term "SEQ r\<^sub>1 r\<^sub>2"}. This must be a value of the form @{term |
693 "Seq DUMMY DUMMY"}\,. Recall the clause of the @{text derivative}-function |
689 "Seq DUMMY DUMMY"}\,. Recall the clause of the \<open>derivative\<close>-function |
694 for sequence regular expressions: |
690 for sequence regular expressions: |
695 |
691 |
696 \begin{center} |
692 \begin{center} |
697 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} $\dn$ @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} |
693 @{thm (lhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} $\dn$ @{thm (rhs) der.simps(5)[of c "r\<^sub>1" "r\<^sub>2"]} |
698 \end{center} |
694 \end{center} |
699 |
695 |
700 \noindent Consider first the @{text "else"}-branch where the derivative is @{term |
696 \noindent Consider first the \<open>else\<close>-branch where the derivative is @{term |
701 "SEQ (der c r\<^sub>1) r\<^sub>2"}. The corresponding value must therefore |
697 "SEQ (der c r\<^sub>1) r\<^sub>2"}. The corresponding value must therefore |
702 be of the form @{term "Seq v\<^sub>1 v\<^sub>2"}, which matches the left-hand |
698 be of the form @{term "Seq v\<^sub>1 v\<^sub>2"}, which matches the left-hand |
703 side in clause~\textit{(4)} of @{term inj}. In the @{text "if"}-branch the derivative is an |
699 side in clause~\textit{(4)} of @{term inj}. In the \<open>if\<close>-branch the derivative is an |
704 alternative, namely @{term "ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c |
700 alternative, namely @{term "ALT (SEQ (der c r\<^sub>1) r\<^sub>2) (der c |
705 r\<^sub>2)"}. This means we either have to consider a @{text Left}- or |
701 r\<^sub>2)"}. This means we either have to consider a \<open>Left\<close>- or |
706 @{text Right}-value. In case of the @{text Left}-value we know further it |
702 \<open>Right\<close>-value. In case of the \<open>Left\<close>-value we know further it |
707 must be a value for a sequence regular expression. Therefore the pattern |
703 must be a value for a sequence regular expression. Therefore the pattern |
708 we match in the clause \textit{(5)} is @{term "Left (Seq v\<^sub>1 v\<^sub>2)"}, |
704 we match in the clause \textit{(5)} is @{term "Left (Seq v\<^sub>1 v\<^sub>2)"}, |
709 while in \textit{(6)} it is just @{term "Right v\<^sub>2"}. One more interesting |
705 while in \textit{(6)} it is just @{term "Right v\<^sub>2"}. One more interesting |
710 point is in the right-hand side of clause \textit{(6)}: since in this case the |
706 point is in the right-hand side of clause \textit{(6)}: since in this case the |
711 regular expression @{text "r\<^sub>1"} does not ``contribute'' to |
707 regular expression \<open>r\<^sub>1\<close> does not ``contribute'' to |
712 matching the string, that means it only matches the empty string, we need to |
708 matching the string, that means it only matches the empty string, we need to |
713 call @{const mkeps} in order to construct a value for how @{term "r\<^sub>1"} |
709 call @{const mkeps} in order to construct a value for how @{term "r\<^sub>1"} |
714 can match this empty string. A similar argument applies for why we can |
710 can match this empty string. A similar argument applies for why we can |
715 expect in the left-hand side of clause \textit{(7)} that the value is of the form |
711 expect in the left-hand side of clause \textit{(7)} that the value is of the form |
716 @{term "Seq v (Stars vs)"}---the derivative of a star is @{term "SEQ (der c r) |
712 @{term "Seq v (Stars vs)"}---the derivative of a star is @{term "SEQ (der c r) |
782 |
778 |
783 % |
779 % |
784 \begin{figure}[t] |
780 \begin{figure}[t] |
785 \begin{center} |
781 \begin{center} |
786 \begin{tabular}{c} |
782 \begin{tabular}{c} |
787 @{thm[mode=Axiom] Posix.intros(1)}@{text "P"}@{term "ONE"} \qquad |
783 @{thm[mode=Axiom] Posix.intros(1)}\<open>P\<close>@{term "ONE"} \qquad |
788 @{thm[mode=Axiom] Posix.intros(2)}@{text "P"}@{term "c"}\medskip\\ |
784 @{thm[mode=Axiom] Posix.intros(2)}\<open>P\<close>@{term "c"}\medskip\\ |
789 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}@{text "P+L"}\qquad |
785 @{thm[mode=Rule] Posix.intros(3)[of "s" "r\<^sub>1" "v" "r\<^sub>2"]}\<open>P+L\<close>\qquad |
790 @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}@{text "P+R"}\medskip\\ |
786 @{thm[mode=Rule] Posix.intros(4)[of "s" "r\<^sub>2" "v" "r\<^sub>1"]}\<open>P+R\<close>\medskip\\ |
791 $\mprset{flushleft} |
787 $\mprset{flushleft} |
792 \inferrule |
788 \inferrule |
793 {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad |
789 {@{thm (prem 1) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \qquad |
794 @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\ |
790 @{thm (prem 2) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]} \\\\ |
795 @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}} |
791 @{thm (prem 3) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}} |
796 {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$@{text "PS"}\\ |
792 {@{thm (concl) Posix.intros(5)[of "s\<^sub>1" "r\<^sub>1" "v\<^sub>1" "s\<^sub>2" "r\<^sub>2" "v\<^sub>2"]}}$\<open>PS\<close>\\ |
797 @{thm[mode=Axiom] Posix.intros(7)}@{text "P[]"}\medskip\\ |
793 @{thm[mode=Axiom] Posix.intros(7)}\<open>P[]\<close>\medskip\\ |
798 $\mprset{flushleft} |
794 $\mprset{flushleft} |
799 \inferrule |
795 \inferrule |
800 {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
796 {@{thm (prem 1) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
801 @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
797 @{thm (prem 2) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \qquad |
802 @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\ |
798 @{thm (prem 3) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]} \\\\ |
803 @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}} |
799 @{thm (prem 4) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}} |
804 {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$@{text "P\<star>"} |
800 {@{thm (concl) Posix.intros(6)[of "s\<^sub>1" "r" "v" "s\<^sub>2" "vs"]}}$\<open>P\<star>\<close> |
805 \end{tabular} |
801 \end{tabular} |
806 \end{center} |
802 \end{center} |
807 \caption{Our inductive definition of POSIX values.}\label{POSIXrules} |
803 \caption{Our inductive definition of POSIX values.}\label{POSIXrules} |
808 \end{figure} |
804 \end{figure} |
809 |
805 |
823 \end{proof} |
819 \end{proof} |
824 |
820 |
825 \noindent |
821 \noindent |
826 We claim that our @{term "s \<in> r \<rightarrow> v"} relation captures the idea behind the four |
822 We claim that our @{term "s \<in> r \<rightarrow> v"} relation captures the idea behind the four |
827 informal POSIX rules shown in the Introduction: Consider for example the |
823 informal POSIX rules shown in the Introduction: Consider for example the |
828 rules @{text "P+L"} and @{text "P+R"} where the POSIX value for a string |
824 rules \<open>P+L\<close> and \<open>P+R\<close> where the POSIX value for a string |
829 and an alternative regular expression, that is @{term "(s, ALT r\<^sub>1 r\<^sub>2)"}, |
825 and an alternative regular expression, that is @{term "(s, ALT r\<^sub>1 r\<^sub>2)"}, |
830 is specified---it is always a @{text "Left"}-value, \emph{except} when the |
826 is specified---it is always a \<open>Left\<close>-value, \emph{except} when the |
831 string to be matched is not in the language of @{term "r\<^sub>1"}; only then it |
827 string to be matched is not in the language of @{term "r\<^sub>1"}; only then it |
832 is a @{text Right}-value (see the side-condition in @{text "P+R"}). |
828 is a \<open>Right\<close>-value (see the side-condition in \<open>P+R\<close>). |
833 Interesting is also the rule for sequence regular expressions (@{text |
829 Interesting is also the rule for sequence regular expressions (\<open>PS\<close>). The first two premises state that @{term "v\<^sub>1"} and @{term "v\<^sub>2"} |
834 "PS"}). The first two premises state that @{term "v\<^sub>1"} and @{term "v\<^sub>2"} |
|
835 are the POSIX values for @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"} |
830 are the POSIX values for @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"} |
836 respectively. Consider now the third premise and note that the POSIX value |
831 respectively. Consider now the third premise and note that the POSIX value |
837 of this rule should match the string \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}}. According to the |
832 of this rule should match the string \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}}. According to the |
838 Longest Match Rule, we want that the @{term "s\<^sub>1"} is the longest initial |
833 Longest Match Rule, we want that the @{term "s\<^sub>1"} is the longest initial |
839 split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} such that @{term "s\<^sub>2"} is still recognised |
834 split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} such that @{term "s\<^sub>2"} is still recognised |
840 by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there |
835 by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there |
841 \emph{exist} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"} |
836 \emph{exist} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"} |
842 can be split up into a non-empty string @{term "s\<^sub>3"} and a possibly empty |
837 can be split up into a non-empty string @{term "s\<^sub>3"} and a possibly empty |
843 string @{term "s\<^sub>4"}. Moreover the longer string @{term "s\<^sub>1 @ s\<^sub>3"} can be |
838 string @{term "s\<^sub>4"}. Moreover the longer string @{term "s\<^sub>1 @ s\<^sub>3"} can be |
844 matched by @{text "r\<^sub>1"} and the shorter @{term "s\<^sub>4"} can still be |
839 matched by \<open>r\<^sub>1\<close> and the shorter @{term "s\<^sub>4"} can still be |
845 matched by @{term "r\<^sub>2"}. In this case @{term "s\<^sub>1"} would \emph{not} be the |
840 matched by @{term "r\<^sub>2"}. In this case @{term "s\<^sub>1"} would \emph{not} be the |
846 longest initial split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} and therefore @{term "Seq v\<^sub>1 |
841 longest initial split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} and therefore @{term "Seq v\<^sub>1 |
847 v\<^sub>2"} cannot be a POSIX value for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. |
842 v\<^sub>2"} cannot be a POSIX value for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. |
848 The main point is that our side-condition ensures the Longest |
843 The main point is that our side-condition ensures the Longest |
849 Match Rule is satisfied. |
844 Match Rule is satisfied. |
850 |
845 |
851 A similar condition is imposed on the POSIX value in the @{text |
846 A similar condition is imposed on the POSIX value in the \<open>P\<star>\<close>-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial |
852 "P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial |
|
853 split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value |
847 split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value |
854 @{term v} cannot be flattened to the empty string. In effect, we require |
848 @{term v} cannot be flattened to the empty string. In effect, we require |
855 that in each ``iteration'' of the star, some non-empty substring needs to |
849 that in each ``iteration'' of the star, some non-empty substring needs to |
856 be ``chipped'' away; only in case of the empty string we accept @{term |
850 be ``chipped'' away; only in case of the empty string we accept @{term |
857 "Stars []"} as the POSIX value. Indeed we can show that our POSIX values |
851 "Stars []"} as the POSIX value. Indeed we can show that our POSIX values |
858 are lexical values which exclude those @{text Stars} that contain subvalues |
852 are lexical values which exclude those \<open>Stars\<close> that contain subvalues |
859 that flatten to the empty string. |
853 that flatten to the empty string. |
860 |
854 |
861 \begin{lemma}\label{LVposix} |
855 \begin{lemma}\label{LVposix} |
862 @{thm [mode=IfThen] Posix_LV} |
856 @{thm [mode=IfThen] Posix_LV} |
863 \end{lemma} |
857 \end{lemma} |
877 \begin{proof} |
871 \begin{proof} |
878 By routine induction on @{term r}.\qed |
872 By routine induction on @{term r}.\qed |
879 \end{proof} |
873 \end{proof} |
880 |
874 |
881 \noindent |
875 \noindent |
882 The central lemma for our POSIX relation is that the @{text inj}-function |
876 The central lemma for our POSIX relation is that the \<open>inj\<close>-function |
883 preserves POSIX values. |
877 preserves POSIX values. |
884 |
878 |
885 \begin{lemma}\label{Posix2} |
879 \begin{lemma}\label{Posix2} |
886 @{thm[mode=IfThen] Posix_injval} |
880 @{thm[mode=IfThen] Posix_injval} |
887 \end{lemma} |
881 \end{lemma} |
888 |
882 |
889 \begin{proof} |
883 \begin{proof} |
890 By induction on @{text r}. We explain two cases. |
884 By induction on \<open>r\<close>. We explain two cases. |
891 |
885 |
892 \begin{itemize} |
886 \begin{itemize} |
893 \item[$\bullet$] Case @{term "r = ALT r\<^sub>1 r\<^sub>2"}. There are |
887 \item[$\bullet$] Case @{term "r = ALT r\<^sub>1 r\<^sub>2"}. There are |
894 two subcases, namely @{text "(a)"} \mbox{@{term "v = Left v'"}} and @{term |
888 two subcases, namely \<open>(a)\<close> \mbox{@{term "v = Left v'"}} and @{term |
895 "s \<in> der c r\<^sub>1 \<rightarrow> v'"}; and @{text "(b)"} @{term "v = Right v'"}, @{term |
889 "s \<in> der c r\<^sub>1 \<rightarrow> v'"}; and \<open>(b)\<close> @{term "v = Right v'"}, @{term |
896 "s \<notin> L (der c r\<^sub>1)"} and @{term "s \<in> der c r\<^sub>2 \<rightarrow> v'"}. In @{text "(a)"} we |
890 "s \<notin> L (der c r\<^sub>1)"} and @{term "s \<in> der c r\<^sub>2 \<rightarrow> v'"}. In \<open>(a)\<close> we |
897 know @{term "s \<in> der c r\<^sub>1 \<rightarrow> v'"}, from which we can infer @{term "(c # s) |
891 know @{term "s \<in> der c r\<^sub>1 \<rightarrow> v'"}, from which we can infer @{term "(c # s) |
898 \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v'"} by induction hypothesis and hence @{term "(c # |
892 \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v'"} by induction hypothesis and hence @{term "(c # |
899 s) \<in> ALT r\<^sub>1 r\<^sub>2 \<rightarrow> injval (ALT r\<^sub>1 r\<^sub>2) c (Left v')"} as needed. Similarly |
893 s) \<in> ALT r\<^sub>1 r\<^sub>2 \<rightarrow> injval (ALT r\<^sub>1 r\<^sub>2) c (Left v')"} as needed. Similarly |
900 in subcase @{text "(b)"} where, however, in addition we have to use |
894 in subcase \<open>(b)\<close> where, however, in addition we have to use |
901 Proposition~\ref{derprop}(2) in order to infer @{term "c # s \<notin> L r\<^sub>1"} from @{term |
895 Proposition~\ref{derprop}(2) in order to infer @{term "c # s \<notin> L r\<^sub>1"} from @{term |
902 "s \<notin> L (der c r\<^sub>1)"}.\smallskip |
896 "s \<notin> L (der c r\<^sub>1)"}.\smallskip |
903 |
897 |
904 \item[$\bullet$] Case @{term "r = SEQ r\<^sub>1 r\<^sub>2"}. There are three subcases: |
898 \item[$\bullet$] Case @{term "r = SEQ r\<^sub>1 r\<^sub>2"}. There are three subcases: |
905 |
899 |
906 \begin{quote} |
900 \begin{quote} |
907 \begin{description} |
901 \begin{description} |
908 \item[@{text "(a)"}] @{term "v = Left (Seq v\<^sub>1 v\<^sub>2)"} and @{term "nullable r\<^sub>1"} |
902 \item[\<open>(a)\<close>] @{term "v = Left (Seq v\<^sub>1 v\<^sub>2)"} and @{term "nullable r\<^sub>1"} |
909 \item[@{text "(b)"}] @{term "v = Right v\<^sub>1"} and @{term "nullable r\<^sub>1"} |
903 \item[\<open>(b)\<close>] @{term "v = Right v\<^sub>1"} and @{term "nullable r\<^sub>1"} |
910 \item[@{text "(c)"}] @{term "v = Seq v\<^sub>1 v\<^sub>2"} and @{term "\<not> nullable r\<^sub>1"} |
904 \item[\<open>(c)\<close>] @{term "v = Seq v\<^sub>1 v\<^sub>2"} and @{term "\<not> nullable r\<^sub>1"} |
911 \end{description} |
905 \end{description} |
912 \end{quote} |
906 \end{quote} |
913 |
907 |
914 \noindent For @{text "(a)"} we know @{term "s\<^sub>1 \<in> der c r\<^sub>1 \<rightarrow> v\<^sub>1"} and |
908 \noindent For \<open>(a)\<close> we know @{term "s\<^sub>1 \<in> der c r\<^sub>1 \<rightarrow> v\<^sub>1"} and |
915 @{term "s\<^sub>2 \<in> r\<^sub>2 \<rightarrow> v\<^sub>2"} as well as |
909 @{term "s\<^sub>2 \<in> r\<^sub>2 \<rightarrow> v\<^sub>2"} as well as |
916 % |
910 % |
917 \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> s\<^sub>1 @ s\<^sub>3 \<in> L (der c r\<^sub>1) \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\] |
911 \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> s\<^sub>1 @ s\<^sub>3 \<in> L (der c r\<^sub>1) \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\] |
918 |
912 |
919 \noindent From the latter we can infer by Proposition~\ref{derprop}(2): |
913 \noindent From the latter we can infer by Proposition~\ref{derprop}(2): |
920 % |
914 % |
921 \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> (c # s\<^sub>1) @ s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\] |
915 \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> (c # s\<^sub>1) @ s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\] |
922 |
916 |
923 \noindent We can use the induction hypothesis for @{text "r\<^sub>1"} to obtain |
917 \noindent We can use the induction hypothesis for \<open>r\<^sub>1\<close> to obtain |
924 @{term "(c # s\<^sub>1) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}. Putting this all together allows us to infer |
918 @{term "(c # s\<^sub>1) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}. Putting this all together allows us to infer |
925 @{term "((c # s\<^sub>1) @ s\<^sub>2) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (injval r\<^sub>1 c v\<^sub>1) v\<^sub>2"}. The case @{text "(c)"} |
919 @{term "((c # s\<^sub>1) @ s\<^sub>2) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (injval r\<^sub>1 c v\<^sub>1) v\<^sub>2"}. The case \<open>(c)\<close> |
926 is similar. |
920 is similar. |
927 |
921 |
928 For @{text "(b)"} we know @{term "s \<in> der c r\<^sub>2 \<rightarrow> v\<^sub>1"} and |
922 For \<open>(b)\<close> we know @{term "s \<in> der c r\<^sub>2 \<rightarrow> v\<^sub>1"} and |
929 @{term "s\<^sub>1 @ s\<^sub>2 \<notin> L (SEQ (der c r\<^sub>1) r\<^sub>2)"}. From the former |
923 @{term "s\<^sub>1 @ s\<^sub>2 \<notin> L (SEQ (der c r\<^sub>1) r\<^sub>2)"}. From the former |
930 we have @{term "(c # s) \<in> r\<^sub>2 \<rightarrow> (injval r\<^sub>2 c v\<^sub>1)"} by induction hypothesis |
924 we have @{term "(c # s) \<in> r\<^sub>2 \<rightarrow> (injval r\<^sub>2 c v\<^sub>1)"} by induction hypothesis |
931 for @{term "r\<^sub>2"}. From the latter we can infer |
925 for @{term "r\<^sub>2"}. From the latter we can infer |
932 % |
926 % |
933 \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = c # s \<and> s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\] |
927 \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = c # s \<and> s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\] |
1012 \begin{center} |
1006 \begin{center} |
1013 @{term "v == Stars [Seq (Char x) (Char y), Char z]"} |
1007 @{term "v == Stars [Seq (Char x) (Char y), Char z]"} |
1014 \end{center} |
1008 \end{center} |
1015 |
1009 |
1016 \noindent |
1010 \noindent |
1017 At position @{text "[0,1]"} of this value is the |
1011 At position \<open>[0,1]\<close> of this value is the |
1018 subvalue @{text "Char y"} and at position @{text "[1]"} the |
1012 subvalue \<open>Char y\<close> and at position \<open>[1]\<close> the |
1019 subvalue @{term "Char z"}. At the `root' position, or empty list |
1013 subvalue @{term "Char z"}. At the `root' position, or empty list |
1020 @{term "[]"}, is the whole value @{term v}. Positions such as @{text |
1014 @{term "[]"}, is the whole value @{term v}. Positions such as \<open>[0,1,0]\<close> or \<open>[2]\<close> are outside of \<open>v\<close>. If it exists, the subvalue of @{term v} at a position \<open>p\<close>, written @{term "at v p"}, can be recursively defined by |
1021 "[0,1,0]"} or @{text "[2]"} are outside of @{text |
|
1022 v}. If it exists, the subvalue of @{term v} at a position @{text |
|
1023 p}, written @{term "at v p"}, can be recursively defined by |
|
1024 |
1015 |
1025 \begin{center} |
1016 \begin{center} |
1026 \begin{tabular}{r@ {\hspace{0mm}}lcl} |
1017 \begin{tabular}{r@ {\hspace{0mm}}lcl} |
1027 @{term v} & @{text "\<downharpoonleft>\<^bsub>[]\<^esub>"} & @{text "\<equiv>"}& @{thm (rhs) at.simps(1)}\\ |
1018 @{term v} & \<open>\<downharpoonleft>\<^bsub>[]\<^esub>\<close> & \<open>\<equiv>\<close>& @{thm (rhs) at.simps(1)}\\ |
1028 @{term "Left v"} & @{text "\<downharpoonleft>\<^bsub>0::ps\<^esub>"} & @{text "\<equiv>"}& @{thm (rhs) at.simps(2)}\\ |
1019 @{term "Left v"} & \<open>\<downharpoonleft>\<^bsub>0::ps\<^esub>\<close> & \<open>\<equiv>\<close>& @{thm (rhs) at.simps(2)}\\ |
1029 @{term "Right v"} & @{text "\<downharpoonleft>\<^bsub>1::ps\<^esub>"} & @{text "\<equiv>"} & |
1020 @{term "Right v"} & \<open>\<downharpoonleft>\<^bsub>1::ps\<^esub>\<close> & \<open>\<equiv>\<close> & |
1030 @{thm (rhs) at.simps(3)[simplified Suc_0_fold]}\\ |
1021 @{thm (rhs) at.simps(3)[simplified Suc_0_fold]}\\ |
1031 @{term "Seq v\<^sub>1 v\<^sub>2"} & @{text "\<downharpoonleft>\<^bsub>0::ps\<^esub>"} & @{text "\<equiv>"} & |
1022 @{term "Seq v\<^sub>1 v\<^sub>2"} & \<open>\<downharpoonleft>\<^bsub>0::ps\<^esub>\<close> & \<open>\<equiv>\<close> & |
1032 @{thm (rhs) at.simps(4)[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} \\ |
1023 @{thm (rhs) at.simps(4)[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} \\ |
1033 @{term "Seq v\<^sub>1 v\<^sub>2"} & @{text "\<downharpoonleft>\<^bsub>1::ps\<^esub>"} |
1024 @{term "Seq v\<^sub>1 v\<^sub>2"} & \<open>\<downharpoonleft>\<^bsub>1::ps\<^esub>\<close> |
1034 & @{text "\<equiv>"} & |
1025 & \<open>\<equiv>\<close> & |
1035 @{thm (rhs) at.simps(5)[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2", simplified Suc_0_fold]} \\ |
1026 @{thm (rhs) at.simps(5)[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2", simplified Suc_0_fold]} \\ |
1036 @{term "Stars vs"} & @{text "\<downharpoonleft>\<^bsub>n::ps\<^esub>"} & @{text "\<equiv>"}& @{thm (rhs) at.simps(6)}\\ |
1027 @{term "Stars vs"} & \<open>\<downharpoonleft>\<^bsub>n::ps\<^esub>\<close> & \<open>\<equiv>\<close>& @{thm (rhs) at.simps(6)}\\ |
1037 \end{tabular} |
1028 \end{tabular} |
1038 \end{center} |
1029 \end{center} |
1039 |
1030 |
1040 \noindent In the last clause we use Isabelle's notation @{term "vs ! n"} for the |
1031 \noindent In the last clause we use Isabelle's notation @{term "vs ! n"} for the |
1041 @{text n}th element in a list. The set of positions inside a value @{text v}, |
1032 \<open>n\<close>th element in a list. The set of positions inside a value \<open>v\<close>, |
1042 written @{term "Pos v"}, is given by |
1033 written @{term "Pos v"}, is given by |
1043 |
1034 |
1044 \begin{center} |
1035 \begin{center} |
1045 \begin{tabular}{lcl} |
1036 \begin{tabular}{lcl} |
1046 @{thm (lhs) Pos.simps(1)} & @{text "\<equiv>"} & @{thm (rhs) Pos.simps(1)}\\ |
1037 @{thm (lhs) Pos.simps(1)} & \<open>\<equiv>\<close> & @{thm (rhs) Pos.simps(1)}\\ |
1047 @{thm (lhs) Pos.simps(2)} & @{text "\<equiv>"} & @{thm (rhs) Pos.simps(2)}\\ |
1038 @{thm (lhs) Pos.simps(2)} & \<open>\<equiv>\<close> & @{thm (rhs) Pos.simps(2)}\\ |
1048 @{thm (lhs) Pos.simps(3)} & @{text "\<equiv>"} & @{thm (rhs) Pos.simps(3)}\\ |
1039 @{thm (lhs) Pos.simps(3)} & \<open>\<equiv>\<close> & @{thm (rhs) Pos.simps(3)}\\ |
1049 @{thm (lhs) Pos.simps(4)} & @{text "\<equiv>"} & @{thm (rhs) Pos.simps(4)}\\ |
1040 @{thm (lhs) Pos.simps(4)} & \<open>\<equiv>\<close> & @{thm (rhs) Pos.simps(4)}\\ |
1050 @{thm (lhs) Pos.simps(5)[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} |
1041 @{thm (lhs) Pos.simps(5)[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} |
1051 & @{text "\<equiv>"} |
1042 & \<open>\<equiv>\<close> |
1052 & @{thm (rhs) Pos.simps(5)[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}\\ |
1043 & @{thm (rhs) Pos.simps(5)[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}\\ |
1053 @{thm (lhs) Pos_stars} & @{text "\<equiv>"} & @{thm (rhs) Pos_stars}\\ |
1044 @{thm (lhs) Pos_stars} & \<open>\<equiv>\<close> & @{thm (rhs) Pos_stars}\\ |
1054 \end{tabular} |
1045 \end{tabular} |
1055 \end{center} |
1046 \end{center} |
1056 |
1047 |
1057 \noindent |
1048 \noindent |
1058 whereby @{text len} in the last clause stands for the length of a list. Clearly |
1049 whereby \<open>len\<close> in the last clause stands for the length of a list. Clearly |
1059 for every position inside a value there exists a subvalue at that position. |
1050 for every position inside a value there exists a subvalue at that position. |
1060 |
1051 |
1061 |
1052 |
1062 To help understanding the ordering of Okui and Suzuki, consider again |
1053 To help understanding the ordering of Okui and Suzuki, consider again |
1063 the earlier value |
1054 the earlier value |
1064 @{text v} and compare it with the following @{text w}: |
1055 \<open>v\<close> and compare it with the following \<open>w\<close>: |
1065 |
1056 |
1066 \begin{center} |
1057 \begin{center} |
1067 \begin{tabular}{l} |
1058 \begin{tabular}{l} |
1068 @{term "v == Stars [Seq (Char x) (Char y), Char z]"}\\ |
1059 @{term "v == Stars [Seq (Char x) (Char y), Char z]"}\\ |
1069 @{term "w == Stars [Char x, Char y, Char z]"} |
1060 @{term "w == Stars [Char x, Char y, Char z]"} |
1070 \end{tabular} |
1061 \end{tabular} |
1071 \end{center} |
1062 \end{center} |
1072 |
1063 |
1073 \noindent Both values match the string @{text "xyz"}, that means if |
1064 \noindent Both values match the string \<open>xyz\<close>, that means if |
1074 we flatten these values at their respective root position, we obtain |
1065 we flatten these values at their respective root position, we obtain |
1075 @{text "xyz"}. However, at position @{text "[0]"}, @{text v} matches |
1066 \<open>xyz\<close>. However, at position \<open>[0]\<close>, \<open>v\<close> matches |
1076 @{text xy} whereas @{text w} matches only the shorter @{text x}. So |
1067 \<open>xy\<close> whereas \<open>w\<close> matches only the shorter \<open>x\<close>. So |
1077 according to the Longest Match Rule, we should prefer @{text v}, |
1068 according to the Longest Match Rule, we should prefer \<open>v\<close>, |
1078 rather than @{text w} as POSIX value for string @{text xyz} (and |
1069 rather than \<open>w\<close> as POSIX value for string \<open>xyz\<close> (and |
1079 corresponding regular expression). In order to |
1070 corresponding regular expression). In order to |
1080 formalise this idea, Okui and Suzuki introduce a measure for |
1071 formalise this idea, Okui and Suzuki introduce a measure for |
1081 subvalues at position @{text p}, called the \emph{norm} of @{text v} |
1072 subvalues at position \<open>p\<close>, called the \emph{norm} of \<open>v\<close> |
1082 at position @{text p}. We can define this measure in Isabelle as an |
1073 at position \<open>p\<close>. We can define this measure in Isabelle as an |
1083 integer as follows |
1074 integer as follows |
1084 |
1075 |
1085 \begin{center} |
1076 \begin{center} |
1086 @{thm pflat_len_def} |
1077 @{thm pflat_len_def} |
1087 \end{center} |
1078 \end{center} |
1088 |
1079 |
1089 \noindent where we take the length of the flattened value at |
1080 \noindent where we take the length of the flattened value at |
1090 position @{text p}, provided the position is inside @{text v}; if |
1081 position \<open>p\<close>, provided the position is inside \<open>v\<close>; if |
1091 not, then the norm is @{text "-1"}. The default for outside |
1082 not, then the norm is \<open>-1\<close>. The default for outside |
1092 positions is crucial for the POSIX requirement of preferring a |
1083 positions is crucial for the POSIX requirement of preferring a |
1093 @{text Left}-value over a @{text Right}-value (if they can match the |
1084 \<open>Left\<close>-value over a \<open>Right\<close>-value (if they can match the |
1094 same string---see the Priority Rule from the Introduction). For this |
1085 same string---see the Priority Rule from the Introduction). For this |
1095 consider |
1086 consider |
1096 |
1087 |
1097 \begin{center} |
1088 \begin{center} |
1098 @{term "v == Left (Char x)"} \qquad and \qquad @{term "w == Right (Char x)"} |
1089 @{term "v == Left (Char x)"} \qquad and \qquad @{term "w == Right (Char x)"} |
1099 \end{center} |
1090 \end{center} |
1100 |
1091 |
1101 \noindent Both values match @{text x}. At position @{text "[0]"} |
1092 \noindent Both values match \<open>x\<close>. At position \<open>[0]\<close> |
1102 the norm of @{term v} is @{text 1} (the subvalue matches @{text x}), |
1093 the norm of @{term v} is \<open>1\<close> (the subvalue matches \<open>x\<close>), |
1103 but the norm of @{text w} is @{text "-1"} (the position is outside |
1094 but the norm of \<open>w\<close> is \<open>-1\<close> (the position is outside |
1104 @{text w} according to how we defined the `inside' positions of |
1095 \<open>w\<close> according to how we defined the `inside' positions of |
1105 @{text Left}- and @{text Right}-values). Of course at position |
1096 \<open>Left\<close>- and \<open>Right\<close>-values). Of course at position |
1106 @{text "[1]"}, the norms @{term "pflat_len v [1]"} and @{term |
1097 \<open>[1]\<close>, the norms @{term "pflat_len v [1]"} and @{term |
1107 "pflat_len w [1]"} are reversed, but the point is that subvalues |
1098 "pflat_len w [1]"} are reversed, but the point is that subvalues |
1108 will be analysed according to lexicographically ordered |
1099 will be analysed according to lexicographically ordered |
1109 positions. According to this ordering, the position @{text "[0]"} |
1100 positions. According to this ordering, the position \<open>[0]\<close> |
1110 takes precedence over @{text "[1]"} and thus also @{text v} will be |
1101 takes precedence over \<open>[1]\<close> and thus also \<open>v\<close> will be |
1111 preferred over @{text w}. The lexicographic ordering of positions, written |
1102 preferred over \<open>w\<close>. The lexicographic ordering of positions, written |
1112 @{term "DUMMY \<sqsubset>lex DUMMY"}, can be conveniently formalised |
1103 @{term "DUMMY \<sqsubset>lex DUMMY"}, can be conveniently formalised |
1113 by three inference rules |
1104 by three inference rules |
1114 |
1105 |
1115 \begin{center} |
1106 \begin{center} |
1116 \begin{tabular}{ccc} |
1107 \begin{tabular}{ccc} |
1121 \end{tabular} |
1112 \end{tabular} |
1122 \end{center} |
1113 \end{center} |
1123 |
1114 |
1124 With the norm and lexicographic order in place, |
1115 With the norm and lexicographic order in place, |
1125 we can state the key definition of Okui and Suzuki |
1116 we can state the key definition of Okui and Suzuki |
1126 \cite{OkuiSuzuki2010}: a value @{term "v\<^sub>1"} is \emph{smaller at position @{text p}} than |
1117 \cite{OkuiSuzuki2010}: a value @{term "v\<^sub>1"} is \emph{smaller at position \<open>p\<close>} than |
1127 @{term "v\<^sub>2"}, written @{term "v\<^sub>1 \<sqsubset>val p v\<^sub>2"}, |
1118 @{term "v\<^sub>2"}, written @{term "v\<^sub>1 \<sqsubset>val p v\<^sub>2"}, |
1128 if and only if $(i)$ the norm at position @{text p} is |
1119 if and only if $(i)$ the norm at position \<open>p\<close> is |
1129 greater in @{term "v\<^sub>1"} (that is the string @{term "flat (at v\<^sub>1 p)"} is longer |
1120 greater in @{term "v\<^sub>1"} (that is the string @{term "flat (at v\<^sub>1 p)"} is longer |
1130 than @{term "flat (at v\<^sub>2 p)"}) and $(ii)$ all subvalues at |
1121 than @{term "flat (at v\<^sub>2 p)"}) and $(ii)$ all subvalues at |
1131 positions that are inside @{term "v\<^sub>1"} or @{term "v\<^sub>2"} and that are |
1122 positions that are inside @{term "v\<^sub>1"} or @{term "v\<^sub>2"} and that are |
1132 lexicographically smaller than @{text p}, we have the same norm, namely |
1123 lexicographically smaller than \<open>p\<close>, we have the same norm, namely |
1133 |
1124 |
1134 \begin{center} |
1125 \begin{center} |
1135 \begin{tabular}{c} |
1126 \begin{tabular}{c} |
1136 @{thm (lhs) PosOrd_def[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} |
1127 @{thm (lhs) PosOrd_def[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} |
1137 @{text "\<equiv>"} |
1128 \<open>\<equiv>\<close> |
1138 $\begin{cases} |
1129 $\begin{cases} |
1139 (i) & @{term "pflat_len v\<^sub>1 p > pflat_len v\<^sub>2 p"} \quad\text{and}\smallskip \\ |
1130 (i) & @{term "pflat_len v\<^sub>1 p > pflat_len v\<^sub>2 p"} \quad\text{and}\smallskip \\ |
1140 (ii) & @{term "(\<forall>q \<in> Pos v\<^sub>1 \<union> Pos v\<^sub>2. q \<sqsubset>lex p --> pflat_len v\<^sub>1 q = pflat_len v\<^sub>2 q)"} |
1131 (ii) & @{term "(\<forall>q \<in> Pos v\<^sub>1 \<union> Pos v\<^sub>2. q \<sqsubset>lex p --> pflat_len v\<^sub>1 q = pflat_len v\<^sub>2 q)"} |
1141 \end{cases}$ |
1132 \end{cases}$ |
1142 \end{tabular} |
1133 \end{tabular} |
1143 \end{center} |
1134 \end{center} |
1144 |
1135 |
1145 \noindent The position @{text p} in this definition acts as the |
1136 \noindent The position \<open>p\<close> in this definition acts as the |
1146 \emph{first distinct position} of @{text "v\<^sub>1"} and @{text |
1137 \emph{first distinct position} of \<open>v\<^sub>1\<close> and \<open>v\<^sub>2\<close>, where both values match strings of different length |
1147 "v\<^sub>2"}, where both values match strings of different length |
1138 \cite{OkuiSuzuki2010}. Since at \<open>p\<close> the values \<open>v\<^sub>1\<close> and \<open>v\<^sub>2\<close> match different strings, the |
1148 \cite{OkuiSuzuki2010}. Since at @{text p} the values @{text |
|
1149 "v\<^sub>1"} and @{text "v\<^sub>2"} match different strings, the |
|
1150 ordering is irreflexive. Derived from the definition above |
1139 ordering is irreflexive. Derived from the definition above |
1151 are the following two orderings: |
1140 are the following two orderings: |
1152 |
1141 |
1153 \begin{center} |
1142 \begin{center} |
1154 \begin{tabular}{l} |
1143 \begin{tabular}{l} |
1182 with @{term "p' \<sqsubset>lex p"} that @{term "pflat_len v\<^sub>1 |
1168 with @{term "p' \<sqsubset>lex p"} that @{term "pflat_len v\<^sub>1 |
1183 p' = pflat_len v\<^sub>3 p'"} holds. Suppose @{term "p' \<in> Pos |
1169 p' = pflat_len v\<^sub>3 p'"} holds. Suppose @{term "p' \<in> Pos |
1184 v\<^sub>1"}, then we can infer from the first assumption that @{term |
1170 v\<^sub>1"}, then we can infer from the first assumption that @{term |
1185 "pflat_len v\<^sub>1 p' = pflat_len v\<^sub>2 p'"}. But this means |
1171 "pflat_len v\<^sub>1 p' = pflat_len v\<^sub>2 p'"}. But this means |
1186 that @{term "p'"} must be in @{term "Pos v\<^sub>2"} too (the norm |
1172 that @{term "p'"} must be in @{term "Pos v\<^sub>2"} too (the norm |
1187 cannot be @{text "-1"} given @{term "p' \<in> Pos v\<^sub>1"}). |
1173 cannot be \<open>-1\<close> given @{term "p' \<in> Pos v\<^sub>1"}). |
1188 Hence we can use the second assumption and |
1174 Hence we can use the second assumption and |
1189 infer @{term "pflat_len v\<^sub>2 p' = pflat_len v\<^sub>3 p'"}, |
1175 infer @{term "pflat_len v\<^sub>2 p' = pflat_len v\<^sub>3 p'"}, |
1190 which concludes this case with @{term "v\<^sub>1 :\<sqsubset>val |
1176 which concludes this case with @{term "v\<^sub>1 :\<sqsubset>val |
1191 v\<^sub>3"}. The reasoning in the other cases is similar.\qed |
1177 v\<^sub>3"}. The reasoning in the other cases is similar.\qed |
1192 \end{proof} |
1178 \end{proof} |
1193 |
1179 |
1194 \noindent |
1180 \noindent |
1195 The proof for $\preccurlyeq$ is similar and omitted. |
1181 The proof for $\preccurlyeq$ is similar and omitted. |
1196 It is also straightforward to show that @{text "\<prec>"} and |
1182 It is also straightforward to show that \<open>\<prec>\<close> and |
1197 $\preccurlyeq$ are partial orders. Okui and Suzuki furthermore show that they |
1183 $\preccurlyeq$ are partial orders. Okui and Suzuki furthermore show that they |
1198 are linear orderings for lexical values \cite{OkuiSuzuki2010} of a given |
1184 are linear orderings for lexical values \cite{OkuiSuzuki2010} of a given |
1199 regular expression and given string, but we have not formalised this in Isabelle. It is |
1185 regular expression and given string, but we have not formalised this in Isabelle. It is |
1200 not essential for our results. What we are going to show below is |
1186 not essential for our results. What we are going to show below is |
1201 that for a given @{text r} and @{text s}, the orderings have a unique |
1187 that for a given \<open>r\<close> and \<open>s\<close>, the orderings have a unique |
1202 minimal element on the set @{term "LV r s"}, which is the POSIX value |
1188 minimal element on the set @{term "LV r s"}, which is the POSIX value |
1203 we defined in the previous section. We start with two properties that |
1189 we defined in the previous section. We start with two properties that |
1204 show how the length of a flattened value relates to the @{text "\<prec>"}-ordering. |
1190 show how the length of a flattened value relates to the \<open>\<prec>\<close>-ordering. |
1205 |
1191 |
1206 \begin{proposition}\mbox{}\smallskip\\\label{ordlen} |
1192 \begin{proposition}\mbox{}\smallskip\\\label{ordlen} |
1207 \begin{tabular}{@ {}ll} |
1193 \begin{tabular}{@ {}ll} |
1208 (1) & |
1194 (1) & |
1209 @{thm [mode=IfThen] PosOrd_shorterE[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}\\ |
1195 @{thm [mode=IfThen] PosOrd_shorterE[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}\\ |
1257 \end{tabular} |
1243 \end{tabular} |
1258 \end{proposition} |
1244 \end{proposition} |
1259 |
1245 |
1260 \noindent One might prefer that statements \textit{(4)} and \textit{(5)} |
1246 \noindent One might prefer that statements \textit{(4)} and \textit{(5)} |
1261 (respectively \textit{(6)} and \textit{(7)}) |
1247 (respectively \textit{(6)} and \textit{(7)}) |
1262 are combined into a single \textit{iff}-statement (like the ones for @{text |
1248 are combined into a single \textit{iff}-statement (like the ones for \<open>Left\<close> and \<open>Right\<close>). Unfortunately this cannot be done easily: such |
1263 Left} and @{text Right}). Unfortunately this cannot be done easily: such |
|
1264 a single statement would require an additional assumption about the |
1249 a single statement would require an additional assumption about the |
1265 two values @{term "Seq v\<^sub>1 v\<^sub>2"} and @{term "Seq w\<^sub>1 w\<^sub>2"} |
1250 two values @{term "Seq v\<^sub>1 v\<^sub>2"} and @{term "Seq w\<^sub>1 w\<^sub>2"} |
1266 being inhabited by the same regular expression. The |
1251 being inhabited by the same regular expression. The |
1267 complexity of the proofs involved seems to not justify such a |
1252 complexity of the proofs involved seems to not justify such a |
1268 `cleaner' single statement. The statements given are just the properties that |
1253 `cleaner' single statement. The statements given are just the properties that |
1269 allow us to establish our theorems without any difficulty. The proofs |
1254 allow us to establish our theorems without any difficulty. The proofs |
1270 for Proposition~\ref{ordintros} are routine. |
1255 for Proposition~\ref{ordintros} are routine. |
1271 |
1256 |
1272 |
1257 |
1273 Next we establish how Okui and Suzuki's orderings relate to our |
1258 Next we establish how Okui and Suzuki's orderings relate to our |
1274 definition of POSIX values. Given a @{text POSIX} value @{text "v\<^sub>1"} |
1259 definition of POSIX values. Given a \<open>POSIX\<close> value \<open>v\<^sub>1\<close> |
1275 for @{text r} and @{text s}, then any other lexical value @{text |
1260 for \<open>r\<close> and \<open>s\<close>, then any other lexical value \<open>v\<^sub>2\<close> in @{term "LV r s"} is greater or equal than \<open>v\<^sub>1\<close>, namely: |
1276 "v\<^sub>2"} in @{term "LV r s"} is greater or equal than @{text |
|
1277 "v\<^sub>1"}, namely: |
|
1278 |
1261 |
1279 |
1262 |
1280 \begin{theorem}\label{orderone} |
1263 \begin{theorem}\label{orderone} |
1281 @{thm [mode=IfThen] Posix_PosOrd[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} |
1264 @{thm [mode=IfThen] Posix_PosOrd[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} |
1282 \end{theorem} |
1265 \end{theorem} |
1283 |
1266 |
1284 \begin{proof} By induction on our POSIX rules. By |
1267 \begin{proof} By induction on our POSIX rules. By |
1285 Theorem~\ref{posixdeterm} and the definition of @{const LV}, it is clear |
1268 Theorem~\ref{posixdeterm} and the definition of @{const LV}, it is clear |
1286 that @{text "v\<^sub>1"} and @{text "v\<^sub>2"} have the same |
1269 that \<open>v\<^sub>1\<close> and \<open>v\<^sub>2\<close> have the same |
1287 underlying string @{term s}. The three base cases are |
1270 underlying string @{term s}. The three base cases are |
1288 straightforward: for example for @{term "v\<^sub>1 = Void"}, we have |
1271 straightforward: for example for @{term "v\<^sub>1 = Void"}, we have |
1289 that @{term "v\<^sub>2 \<in> LV ONE []"} must also be of the form |
1272 that @{term "v\<^sub>2 \<in> LV ONE []"} must also be of the form |
1290 \mbox{@{term "v\<^sub>2 = Void"}}. Therefore we have @{term |
1273 \mbox{@{term "v\<^sub>2 = Void"}}. Therefore we have @{term |
1291 "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"}. The inductive cases for |
1274 "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"}. The inductive cases for |
1292 @{text r} being of the form @{term "ALT r\<^sub>1 r\<^sub>2"} and |
1275 \<open>r\<close> being of the form @{term "ALT r\<^sub>1 r\<^sub>2"} and |
1293 @{term "SEQ r\<^sub>1 r\<^sub>2"} are as follows: |
1276 @{term "SEQ r\<^sub>1 r\<^sub>2"} are as follows: |
1294 |
1277 |
1295 |
1278 |
1296 \begin{itemize} |
1279 \begin{itemize} |
1297 |
1280 |
1298 \item[$\bullet$] Case @{text "P+L"} with @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2) |
1281 \item[$\bullet$] Case \<open>P+L\<close> with @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2) |
1299 \<rightarrow> (Left w\<^sub>1)"}: In this case the value |
1282 \<rightarrow> (Left w\<^sub>1)"}: In this case the value |
1300 @{term "v\<^sub>2"} is either of the |
1283 @{term "v\<^sub>2"} is either of the |
1301 form @{term "Left w\<^sub>2"} or @{term "Right w\<^sub>2"}. In the |
1284 form @{term "Left w\<^sub>2"} or @{term "Right w\<^sub>2"}. In the |
1302 latter case we can immediately conclude with \mbox{@{term "v\<^sub>1 |
1285 latter case we can immediately conclude with \mbox{@{term "v\<^sub>1 |
1303 :\<sqsubseteq>val v\<^sub>2"}} since a @{text Left}-value with the |
1286 :\<sqsubseteq>val v\<^sub>2"}} since a \<open>Left\<close>-value with the |
1304 same underlying string @{text s} is always smaller than a |
1287 same underlying string \<open>s\<close> is always smaller than a |
1305 @{text Right}-value by Proposition~\ref{ordintros}\textit{(1)}. |
1288 \<open>Right\<close>-value by Proposition~\ref{ordintros}\textit{(1)}. |
1306 In the former case we have @{term "w\<^sub>2 |
1289 In the former case we have @{term "w\<^sub>2 |
1307 \<in> LV r\<^sub>1 s"} and can use the induction hypothesis to infer |
1290 \<in> LV r\<^sub>1 s"} and can use the induction hypothesis to infer |
1308 @{term "w\<^sub>1 :\<sqsubseteq>val w\<^sub>2"}. Because @{term |
1291 @{term "w\<^sub>1 :\<sqsubseteq>val w\<^sub>2"}. Because @{term |
1309 "w\<^sub>1"} and @{term "w\<^sub>2"} have the same underlying string |
1292 "w\<^sub>1"} and @{term "w\<^sub>2"} have the same underlying string |
1310 @{text s}, we can conclude with @{term "Left w\<^sub>1 |
1293 \<open>s\<close>, we can conclude with @{term "Left w\<^sub>1 |
1311 :\<sqsubseteq>val Left w\<^sub>2"} using |
1294 :\<sqsubseteq>val Left w\<^sub>2"} using |
1312 Proposition~\ref{ordintros}\textit{(2)}.\smallskip |
1295 Proposition~\ref{ordintros}\textit{(2)}.\smallskip |
1313 |
1296 |
1314 \item[$\bullet$] Case @{text "P+R"} with @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2) |
1297 \item[$\bullet$] Case \<open>P+R\<close> with @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2) |
1315 \<rightarrow> (Right w\<^sub>1)"}: This case similar to the previous |
1298 \<rightarrow> (Right w\<^sub>1)"}: This case similar to the previous |
1316 case, except that we additionally know @{term "s \<notin> L |
1299 case, except that we additionally know @{term "s \<notin> L |
1317 r\<^sub>1"}. This is needed when @{term "v\<^sub>2"} is of the form |
1300 r\<^sub>1"}. This is needed when @{term "v\<^sub>2"} is of the form |
1318 \mbox{@{term "Left w\<^sub>2"}}. Since \mbox{@{term "flat v\<^sub>2 = flat |
1301 \mbox{@{term "Left w\<^sub>2"}}. Since \mbox{@{term "flat v\<^sub>2 = flat |
1319 w\<^sub>2"} @{text "= s"}} and @{term "\<Turnstile> w\<^sub>2 : |
1302 w\<^sub>2"} \<open>= s\<close>} and @{term "\<Turnstile> w\<^sub>2 : |
1320 r\<^sub>1"}, we can derive a contradiction for \mbox{@{term "s \<notin> L |
1303 r\<^sub>1"}, we can derive a contradiction for \mbox{@{term "s \<notin> L |
1321 r\<^sub>1"}} using |
1304 r\<^sub>1"}} using |
1322 Proposition~\ref{inhabs}. So also in this case \mbox{@{term "v\<^sub>1 |
1305 Proposition~\ref{inhabs}. So also in this case \mbox{@{term "v\<^sub>1 |
1323 :\<sqsubseteq>val v\<^sub>2"}}.\smallskip |
1306 :\<sqsubseteq>val v\<^sub>2"}}.\smallskip |
1324 |
1307 |
1325 \item[$\bullet$] Case @{text "PS"} with @{term "(s\<^sub>1 @ |
1308 \item[$\bullet$] Case \<open>PS\<close> with @{term "(s\<^sub>1 @ |
1326 s\<^sub>2) \<in> (SEQ r\<^sub>1 r\<^sub>2) \<rightarrow> (Seq |
1309 s\<^sub>2) \<in> (SEQ r\<^sub>1 r\<^sub>2) \<rightarrow> (Seq |
1327 w\<^sub>1 w\<^sub>2)"}: We can assume @{term "v\<^sub>2 = Seq |
1310 w\<^sub>1 w\<^sub>2)"}: We can assume @{term "v\<^sub>2 = Seq |
1328 (u\<^sub>1) (u\<^sub>2)"} with @{term "\<Turnstile> u\<^sub>1 : |
1311 (u\<^sub>1) (u\<^sub>2)"} with @{term "\<Turnstile> u\<^sub>1 : |
1329 r\<^sub>1"} and \mbox{@{term "\<Turnstile> u\<^sub>2 : |
1312 r\<^sub>1"} and \mbox{@{term "\<Turnstile> u\<^sub>2 : |
1330 r\<^sub>2"}}. We have @{term "s\<^sub>1 @ s\<^sub>2 = (flat |
1313 r\<^sub>2"}}. We have @{term "s\<^sub>1 @ s\<^sub>2 = (flat |
1331 u\<^sub>1) @ (flat u\<^sub>2)"}. By the side-condition of the |
1314 u\<^sub>1) @ (flat u\<^sub>2)"}. By the side-condition of the |
1332 @{text PS}-rule we know that either @{term "s\<^sub>1 = flat |
1315 \<open>PS\<close>-rule we know that either @{term "s\<^sub>1 = flat |
1333 u\<^sub>1"} or that @{term "flat u\<^sub>1"} is a strict prefix of |
1316 u\<^sub>1"} or that @{term "flat u\<^sub>1"} is a strict prefix of |
1334 @{term "s\<^sub>1"}. In the latter case we can infer @{term |
1317 @{term "s\<^sub>1"}. In the latter case we can infer @{term |
1335 "w\<^sub>1 :\<sqsubset>val u\<^sub>1"} by |
1318 "w\<^sub>1 :\<sqsubset>val u\<^sub>1"} by |
1336 Proposition~\ref{ordlen}\textit{(2)} and from this @{term "v\<^sub>1 |
1319 Proposition~\ref{ordlen}\textit{(2)} and from this @{term "v\<^sub>1 |
1337 :\<sqsubseteq>val v\<^sub>2"} by Proposition~\ref{ordintros}\textit{(5)} |
1320 :\<sqsubseteq>val v\<^sub>2"} by Proposition~\ref{ordintros}\textit{(5)} |
1600 |
1582 |
1601 \begin{center} |
1583 \begin{center} |
1602 \begin{tabular}{lcl} |
1584 \begin{tabular}{lcl} |
1603 @{thm (lhs) slexer.simps(1)} & $\dn$ & @{thm (rhs) slexer.simps(1)}\\ |
1585 @{thm (lhs) slexer.simps(1)} & $\dn$ & @{thm (rhs) slexer.simps(1)}\\ |
1604 @{thm (lhs) slexer.simps(2)} & $\dn$ & |
1586 @{thm (lhs) slexer.simps(2)} & $\dn$ & |
1605 @{text "let (r\<^sub>s, f\<^sub>r) = simp (r "}$\backslash$@{text " c) in"}\\ |
1587 \<open>let (r\<^sub>s, f\<^sub>r) = simp (r \<close>$\backslash$\<open> c) in\<close>\\ |
1606 & & @{text "case"} @{term "slexer r\<^sub>s s"} @{text of}\\ |
1588 & & \<open>case\<close> @{term "slexer r\<^sub>s s"} \<open>of\<close>\\ |
1607 & & \phantom{$|$} @{term "None"} @{text "\<Rightarrow>"} @{term None}\\ |
1589 & & \phantom{$|$} @{term "None"} \<open>\<Rightarrow>\<close> @{term None}\\ |
1608 & & $|$ @{term "Some v"} @{text "\<Rightarrow>"} @{text "Some (inj r c (f\<^sub>r v))"} |
1590 & & $|$ @{term "Some v"} \<open>\<Rightarrow>\<close> \<open>Some (inj r c (f\<^sub>r v))\<close> |
1609 \end{tabular} |
1591 \end{tabular} |
1610 \end{center} |
1592 \end{center} |
1611 |
1593 |
1612 \noindent |
1594 \noindent |
1613 In the second clause we first calculate the derivative @{term "der c r"} |
1595 In the second clause we first calculate the derivative @{term "der c r"} |
1614 and then simplify the result. This gives us a simplified derivative |
1596 and then simplify the result. This gives us a simplified derivative |
1615 @{text "r\<^sub>s"} and a rectification function @{text "f\<^sub>r"}. The lexer |
1597 \<open>r\<^sub>s\<close> and a rectification function \<open>f\<^sub>r\<close>. The lexer |
1616 is then recursively called with the simplified derivative, but before |
1598 is then recursively called with the simplified derivative, but before |
1617 we inject the character @{term c} into the value @{term v}, we need to rectify |
1599 we inject the character @{term c} into the value @{term v}, we need to rectify |
1618 @{term v} (that is construct @{term "f\<^sub>r v"}). Before we can establish the correctness |
1600 @{term v} (that is construct @{term "f\<^sub>r v"}). Before we can establish the correctness |
1619 of @{term "slexer"}, we need to show that simplification preserves the language |
1601 of @{term "slexer"}, we need to show that simplification preserves the language |
1620 and simplification preserves our POSIX relation once the value is rectified |
1602 and simplification preserves our POSIX relation once the value is rectified |