thys/Journal/Paper.thy
changeset 272 f16019b11179
parent 271 f46ebc84408d
child 273 e85099ac4c6c
equal deleted inserted replaced
271:f46ebc84408d 272:f16019b11179
    50   val.Right ("Right _" [1000] 78) and
    50   val.Right ("Right _" [1000] 78) and
    51   val.Seq ("Seq _ _" [79,79] 78) and
    51   val.Seq ("Seq _ _" [79,79] 78) and
    52   val.Stars ("Stars _" [79] 78) and
    52   val.Stars ("Stars _" [79] 78) and
    53 
    53 
    54   L ("L'(_')" [10] 78) and
    54   L ("L'(_')" [10] 78) and
       
    55   LV ("LV _ _" [80,73] 78) and
    55   der_syn ("_\\_" [79, 1000] 76) and  
    56   der_syn ("_\\_" [79, 1000] 76) and  
    56   ders_syn ("_\\_" [79, 1000] 76) and
    57   ders_syn ("_\\_" [79, 1000] 76) and
    57   flat ("|_|" [75] 74) and
    58   flat ("|_|" [75] 74) and
    58   Sequ ("_ @ _" [78,77] 63) and
    59   Sequ ("_ @ _" [78,77] 63) and
    59   injval ("inj _ _ _" [79,77,79] 76) and 
    60   injval ("inj _ _ _" [79,77,79] 76) and 
    98 
    99 
    99 
   100 
   100 (*
   101 (*
   101 comments not implemented
   102 comments not implemented
   102 
   103 
   103 p9. The condtion "not exists s3 s4..." appears often enough (in particular in
   104 p9. The condition "not exists s3 s4..." appears often enough (in particular in
   104 the proof of Lemma 3) to warrant a definition.
   105 the proof of Lemma 3) to warrant a definition.
   105 
   106 
   106 *)
   107 *)
   107 
   108 
   108 (*>*)
   109 (*>*)
   184 identifier. For @{text "if"} we obtain by the Priority Rule a keyword
   185 identifier. For @{text "if"} we obtain by the Priority Rule a keyword
   185 token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"}
   186 token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"}
   186 matches also. By the Star Rule we know @{text "(r\<^bsub>key\<^esub> +
   187 matches also. By the Star Rule we know @{text "(r\<^bsub>key\<^esub> +
   187 r\<^bsub>id\<^esub>)\<^sup>\<star>"} matches @{text "iffoo"},
   188 r\<^bsub>id\<^esub>)\<^sup>\<star>"} matches @{text "iffoo"},
   188 respectively @{text "if"}, in exactly one `iteration' of the star. The
   189 respectively @{text "if"}, in exactly one `iteration' of the star. The
   189 Empty String Rule is for cases where @{text
   190 Empty String Rule is for cases where, for example, @{text
   190 "(a\<^sup>\<star>)\<^sup>\<star>"}, for example, matches against the
   191 "(a\<^sup>\<star>)\<^sup>\<star>"} matches against the
   191 string @{text "bc"}. Then the longest initial matched substring is the
   192 string @{text "bc"}. Then the longest initial matched substring is the
   192 empty string, which is matched by both the whole regular expression
   193 empty string, which is matched by both the whole regular expression
   193 and the parenthesised sub-expression.
   194 and the parenthesised subexpression.
   194 
   195 
   195 
   196 
   196 One limitation of Brzozowski's matcher is that it only generates a
   197 One limitation of Brzozowski's matcher is that it only generates a
   197 YES/NO answer for whether a string is being matched by a regular
   198 YES/NO answer for whether a string is being matched by a regular
   198 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this matcher
   199 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this matcher
   199 to allow generation not just of a YES/NO answer but of an actual
   200 to allow generation not just of a YES/NO answer but of an actual
   200 matching, called a [lexical] {\em value}. \marginpar{explain values;
   201 matching, called a [lexical] {\em value}. Assuming a regular
   201 who introduced them} They give a simple algorithm to calculate a value
   202 expression matches a string, values encode the information of
   202 that appears to be the value associated with POSIX matching.  The
   203 \emph{how} the string is matched by the regular expression---that is,
   203 challenge then is to specify that value, in an algorithm-independent
   204 which part of the string is matched by which part of the regular
   204 fashion, and to show that Sulzmann and Lu's derivative-based algorithm
   205 expression. For this consider again the the string @{text "xy"} and
   205 does indeed calculate a value that is correct according to the
   206 the regular expression \mbox{@{text "(x + (y +
   206 specification.
   207 xy))\<^sup>\<star>"}}. The POSIX value, which corresponds to using the
   207 
   208 star in only one repetition,
   208 The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a
   209 
   209 relation (called an ``order relation'') on the set of values of @{term
   210 
   210 r}, and to show that (once a string to be matched is chosen) there is
   211 \marginpar{explain values; who introduced them} 
   211 a maximum element and that it is computed by their derivative-based
   212 
   212 algorithm. This proof idea is inspired by work of Frisch and Cardelli
   213 
   213 \cite{Frisch2004} on a GREEDY regular expression matching
   214 Sulzmann and Lu give a simple algorithm to calculate a value that
   214 algorithm. However, we were not able to establish transitivity and
   215 appears to be the value associated with POSIX matching.  The challenge
   215 totality for the ``order relation'' by Sulzmann and Lu. 
   216 then is to specify that value, in an algorithm-independent fashion,
   216 There are some inherent problems with their approach (of
   217 and to show that Sulzmann and Lu's derivative-based algorithm does
   217 which some of the proofs are not published in \cite{Sulzmann2014});
   218 indeed calculate a value that is correct according to the
   218 perhaps more importantly, we give in this paper a simple inductive (and
   219 specification.  The answer given by Sulzmann and Lu
   219 algorithm-independent) definition of what we call being a {\em POSIX
   220 \cite{Sulzmann2014} is to define a relation (called an ``order
   220 value} for a regular expression @{term r} and a string @{term s}; we
   221 relation'') on the set of values of @{term r}, and to show that (once
   221 show that the algorithm computes such a value and that such a value is
   222 a string to be matched is chosen) there is a maximum element and that
   222 unique. Our proofs are both done by hand and checked in Isabelle/HOL.  The
   223 it is computed by their derivative-based algorithm. This proof idea is
   223 experience of doing our proofs has been that this mechanical checking
   224 inspired by work of Frisch and Cardelli \cite{Frisch2004} on a GREEDY
   224 was absolutely essential: this subject area has hidden snares. This
   225 regular expression matching algorithm. However, we were not able to
   225 was also noted by Kuklewicz \cite{Kuklewicz} who found that nearly all
   226 establish transitivity and totality for the ``order relation'' by
   226 POSIX matching implementations are ``buggy'' \cite[Page
   227 Sulzmann and Lu.  There are some inherent problems with their approach
   227 203]{Sulzmann2014} and by Grathwohl et al \cite[Page 36]{CrashCourse2014}
   228 (of which some of the proofs are not published in
   228 who wrote:
   229 \cite{Sulzmann2014}); perhaps more importantly, we give in this paper
       
   230 a simple inductive (and algorithm-independent) definition of what we
       
   231 call being a {\em POSIX value} for a regular expression @{term r} and
       
   232 a string @{term s}; we show that the algorithm by Sulzmann and Lu
       
   233 computes such a value and that such a value is unique. Our proofs are
       
   234 both done by hand and checked in Isabelle/HOL.  The experience of
       
   235 doing our proofs has been that this mechanical checking was absolutely
       
   236 essential: this subject area has hidden snares. This was also noted by
       
   237 Kuklewicz \cite{Kuklewicz} who found that nearly all POSIX matching
       
   238 implementations are ``buggy'' \cite[Page 203]{Sulzmann2014} and by
       
   239 Grathwohl et al \cite[Page 36]{CrashCourse2014} who wrote:
   229 
   240 
   230 \begin{quote}
   241 \begin{quote}
   231 \it{}``The POSIX strategy is more complicated than the greedy because of 
   242 \it{}``The POSIX strategy is more complicated than the greedy because of 
   232 the dependence on information about the length of matched strings in the 
   243 the dependence on information about the length of matched strings in the 
   233 various subexpressions.''
   244 various subexpressions.''
   261 of values based on positions by Okui and Suzuki \cite{OkuiSuzuki2010}.
   272 of values based on positions by Okui and Suzuki \cite{OkuiSuzuki2010}.
   262 Derivatives as calculated by Brzozowski's method are usually more complex
   273 Derivatives as calculated by Brzozowski's method are usually more complex
   263 regular expressions than the initial one; various optimisations are
   274 regular expressions than the initial one; various optimisations are
   264 possible. We prove the correctness when simplifications of @{term "ALT ZERO
   275 possible. We prove the correctness when simplifications of @{term "ALT ZERO
   265 r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
   276 r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
   266 @{term r} are applied.
   277 @{term r} are applied. We extend our results to ???
   267 
   278 
   268 *}
   279 *}
   269 
   280 
   270 section {* Preliminaries *}
   281 section {* Preliminaries *}
   271 
   282 
   447   \end{center}
   458   \end{center}
   448 
   459 
   449   \noindent Sulzmann and Lu also define inductively an inhabitation relation
   460   \noindent Sulzmann and Lu also define inductively an inhabitation relation
   450   that associates values to regular expressions. We define this relation as 
   461   that associates values to regular expressions. We define this relation as 
   451   follows:\footnote{Note that the rule for @{term Stars} differs from our 
   462   follows:\footnote{Note that the rule for @{term Stars} differs from our 
   452   erlier paper \cite{AusafDyckhoffUrban2016}. There we used the original
   463   earlier paper \cite{AusafDyckhoffUrban2016}. There we used the original
   453   definition by Sulzmann and Lu which does not require that the values @{term "v \<in> set vs"}
   464   definition by Sulzmann and Lu which does not require that the values @{term "v \<in> set vs"}
   454   flatten to a non-empty string. The reason for introducing the 
   465   flatten to a non-empty string. The reason for introducing the 
   455   more restricted version of lexical values is convenience later 
   466   more restricted version of lexical values is convenience later 
   456   on when reasoning about 
   467   on when reasoning about 
   457   an ordering relation for values.} 
   468   an ordering relation for values.} 
   746 
   757 
   747    
   758    
   748 
   759 
   749   \begin{theorem}\mbox{}\smallskip\\\label{posixdeterm}
   760   \begin{theorem}\mbox{}\smallskip\\\label{posixdeterm}
   750   \begin{tabular}{ll}
   761   \begin{tabular}{ll}
   751   @{text "(1)"} & If @{thm (prem 1) Posix1(1)} then @{thm (concl)
   762   (1) & If @{thm (prem 1) Posix1(1)} then @{thm (concl)
   752   Posix1(1)} and @{thm (concl) Posix1(2)}.\\
   763   Posix1(1)} and @{thm (concl) Posix1(2)}.\\
   753   @{text "(2)"} & @{thm[mode=IfThen] Posix_determ(1)[of _ _ "v" "v'"]}
   764   (2) & @{thm[mode=IfThen] Posix_determ(1)[of _ _ "v" "v'"]}
   754   \end{tabular}
   765   \end{tabular}
   755   \end{theorem}
   766   \end{theorem}
   756 
   767 
   757   \begin{proof} Both by induction on the definition of @{term "s \<in> r \<rightarrow> v"}. 
   768   \begin{proof} Both by induction on the definition of @{term "s \<in> r \<rightarrow> v"}. 
   758   The second parts follows by a case analysis of @{term "s \<in> r \<rightarrow> v'"} and
   769   The second parts follows by a case analysis of @{term "s \<in> r \<rightarrow> v'"} and
   770   Interesting is also the rule for sequence regular expressions (@{text
   781   Interesting is also the rule for sequence regular expressions (@{text
   771   "PS"}). The first two premises state that @{term "v\<^sub>1"} and @{term "v\<^sub>2"}
   782   "PS"}). The first two premises state that @{term "v\<^sub>1"} and @{term "v\<^sub>2"}
   772   are the POSIX values for @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"}
   783   are the POSIX values for @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"}
   773   respectively. Consider now the third premise and note that the POSIX value
   784   respectively. Consider now the third premise and note that the POSIX value
   774   of this rule should match the string \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}}. According to the
   785   of this rule should match the string \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}}. According to the
   775   longest match rule, we want that the @{term "s\<^sub>1"} is the longest initial
   786   Longest Match Rule, we want that the @{term "s\<^sub>1"} is the longest initial
   776   split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} such that @{term "s\<^sub>2"} is still recognised
   787   split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} such that @{term "s\<^sub>2"} is still recognised
   777   by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there
   788   by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there
   778   \emph{exist} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"}
   789   \emph{exist} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"}
   779   can be split up into a non-empty string @{term "s\<^sub>3"} and a possibly empty
   790   can be split up into a non-empty string @{term "s\<^sub>3"} and a possibly empty
   780   string @{term "s\<^sub>4"}. Moreover the longer string @{term "s\<^sub>1 @ s\<^sub>3"} can be
   791   string @{term "s\<^sub>4"}. Moreover the longer string @{term "s\<^sub>1 @ s\<^sub>3"} can be
   781   matched by @{text "r\<^sub>1"} and the shorter @{term "s\<^sub>4"} can still be
   792   matched by @{text "r\<^sub>1"} and the shorter @{term "s\<^sub>4"} can still be
   782   matched by @{term "r\<^sub>2"}. In this case @{term "s\<^sub>1"} would \emph{not} be the
   793   matched by @{term "r\<^sub>2"}. In this case @{term "s\<^sub>1"} would \emph{not} be the
   783   longest initial split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} and therefore @{term "Seq v\<^sub>1
   794   longest initial split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} and therefore @{term "Seq v\<^sub>1
   784   v\<^sub>2"} cannot be a POSIX value for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. 
   795   v\<^sub>2"} cannot be a POSIX value for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}. 
   785   The main point is that our side-condition ensures the longest 
   796   The main point is that our side-condition ensures the Longest 
   786   match rule is satisfied.
   797   Match Rule is satisfied.
   787 
   798 
   788   A similar condition is imposed on the POSIX value in the @{text
   799   A similar condition is imposed on the POSIX value in the @{text
   789   "P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial
   800   "P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial
   790   split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value
   801   split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value
   791   @{term v} cannot be flattened to the empty string. In effect, we require
   802   @{term v} cannot be flattened to the empty string. In effect, we require
   792   that in each ``iteration'' of the star, some non-empty substring needs to
   803   that in each ``iteration'' of the star, some non-empty substring needs to
   793   be ``chipped'' away; only in case of the empty string we accept @{term
   804   be ``chipped'' away; only in case of the empty string we accept @{term
   794   "Stars []"} as the POSIX value. Indeed we can show that our POSIX value
   805   "Stars []"} as the POSIX value. Indeed we can show that our POSIX value
   795   is a lexical value which excludes those @{text Stars} containing values 
   806   is a lexical value which excludes those @{text Stars} containing subvalues 
   796   that flatten to the empty string.
   807   that flatten to the empty string.
   797 
   808 
   798   \begin{lemma}
   809   \begin{lemma}\label{LVposix}
   799   @{thm [mode=IfThen] Posix_LV}
   810   @{thm [mode=IfThen] Posix_LV}
   800   \end{lemma}
   811   \end{lemma}
   801 
   812 
   802   \begin{proof}
   813   \begin{proof}
   803   By routine induction on @{thm (prem 1) Posix_LV}.\qed 
   814   By routine induction on @{thm (prem 1) Posix_LV}.\qed 
   833   "s \<notin> L (der c r\<^sub>1)"} and @{term "s \<in> der c r\<^sub>2 \<rightarrow> v'"}. In @{text "(a)"} we
   844   "s \<notin> L (der c r\<^sub>1)"} and @{term "s \<in> der c r\<^sub>2 \<rightarrow> v'"}. In @{text "(a)"} we
   834   know @{term "s \<in> der c r\<^sub>1 \<rightarrow> v'"}, from which we can infer @{term "(c # s)
   845   know @{term "s \<in> der c r\<^sub>1 \<rightarrow> v'"}, from which we can infer @{term "(c # s)
   835   \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v'"} by induction hypothesis and hence @{term "(c #
   846   \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v'"} by induction hypothesis and hence @{term "(c #
   836   s) \<in> ALT r\<^sub>1 r\<^sub>2 \<rightarrow> injval (ALT r\<^sub>1 r\<^sub>2) c (Left v')"} as needed. Similarly
   847   s) \<in> ALT r\<^sub>1 r\<^sub>2 \<rightarrow> injval (ALT r\<^sub>1 r\<^sub>2) c (Left v')"} as needed. Similarly
   837   in subcase @{text "(b)"} where, however, in addition we have to use
   848   in subcase @{text "(b)"} where, however, in addition we have to use
   838   Prop.~\ref{derprop}(2) in order to infer @{term "c # s \<notin> L r\<^sub>1"} from @{term
   849   Proposition~\ref{derprop}(2) in order to infer @{term "c # s \<notin> L r\<^sub>1"} from @{term
   839   "s \<notin> L (der c r\<^sub>1)"}.
   850   "s \<notin> L (der c r\<^sub>1)"}.\smallskip
   840 
   851 
   841   \item[$\bullet$] Case @{term "r = SEQ r\<^sub>1 r\<^sub>2"}. There are three subcases:
   852   \item[$\bullet$] Case @{term "r = SEQ r\<^sub>1 r\<^sub>2"}. There are three subcases:
   842   
   853   
   843   \begin{quote}
   854   \begin{quote}
   844   \begin{description}
   855   \begin{description}
   851   \noindent For @{text "(a)"} we know @{term "s\<^sub>1 \<in> der c r\<^sub>1 \<rightarrow> v\<^sub>1"} and
   862   \noindent For @{text "(a)"} we know @{term "s\<^sub>1 \<in> der c r\<^sub>1 \<rightarrow> v\<^sub>1"} and
   852   @{term "s\<^sub>2 \<in> r\<^sub>2 \<rightarrow> v\<^sub>2"} as well as
   863   @{term "s\<^sub>2 \<in> r\<^sub>2 \<rightarrow> v\<^sub>2"} as well as
   853   %
   864   %
   854   \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> s\<^sub>1 @ s\<^sub>3 \<in> L (der c r\<^sub>1) \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\]
   865   \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> s\<^sub>1 @ s\<^sub>3 \<in> L (der c r\<^sub>1) \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\]
   855 
   866 
   856   \noindent From the latter we can infer by Prop.~\ref{derprop}(2):
   867   \noindent From the latter we can infer by Proposition~\ref{derprop}(2):
   857   %
   868   %
   858   \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> (c # s\<^sub>1) @ s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\]
   869   \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> (c # s\<^sub>1) @ s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\]
   859 
   870 
   860   \noindent We can use the induction hypothesis for @{text "r\<^sub>1"} to obtain
   871   \noindent We can use the induction hypothesis for @{text "r\<^sub>1"} to obtain
   861   @{term "(c # s\<^sub>1) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}. Putting this all together allows us to infer
   872   @{term "(c # s\<^sub>1) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}. Putting this all together allows us to infer
   867   we have @{term "(c # s) \<in> r\<^sub>2 \<rightarrow> (injval r\<^sub>2 c v\<^sub>1)"} by induction hypothesis
   878   we have @{term "(c # s) \<in> r\<^sub>2 \<rightarrow> (injval r\<^sub>2 c v\<^sub>1)"} by induction hypothesis
   868   for @{term "r\<^sub>2"}. From the latter we can infer
   879   for @{term "r\<^sub>2"}. From the latter we can infer
   869   %
   880   %
   870   \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = c # s \<and> s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\]
   881   \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = c # s \<and> s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\]
   871 
   882 
   872   \noindent By Lem.~\ref{lemmkeps} we know @{term "[] \<in> r\<^sub>1 \<rightarrow> (mkeps r\<^sub>1)"}
   883   \noindent By Lemma~\ref{lemmkeps} we know @{term "[] \<in> r\<^sub>1 \<rightarrow> (mkeps r\<^sub>1)"}
   873   holds. Putting this all together, we can conclude with @{term "(c #
   884   holds. Putting this all together, we can conclude with @{term "(c #
   874   s) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (mkeps r\<^sub>1) (injval r\<^sub>2 c v\<^sub>1)"}, as required.
   885   s) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (mkeps r\<^sub>1) (injval r\<^sub>2 c v\<^sub>1)"}, as required.
   875 
   886 
   876   Finally suppose @{term "r = STAR r\<^sub>1"}. This case is very similar to the
   887   Finally suppose @{term "r = STAR r\<^sub>1"}. This case is very similar to the
   877   sequence case, except that we need to also ensure that @{term "flat (injval r\<^sub>1
   888   sequence case, except that we need to also ensure that @{term "flat (injval r\<^sub>1
   880   r\<^sub>1 \<rightarrow> v\<^sub>1"} and the induction hypothesis).\qed
   891   r\<^sub>1 \<rightarrow> v\<^sub>1"} and the induction hypothesis).\qed
   881   \end{itemize}
   892   \end{itemize}
   882   \end{proof}
   893   \end{proof}
   883 
   894 
   884   \noindent
   895   \noindent
   885   With Lem.~\ref{Posix2} in place, it is completely routine to establish
   896   With Lemma~\ref{Posix2} in place, it is completely routine to establish
   886   that the Sulzmann and Lu lexer satisfies our specification (returning
   897   that the Sulzmann and Lu lexer satisfies our specification (returning
   887   the null value @{term "None"} iff the string is not in the language of the regular expression,
   898   the null value @{term "None"} iff the string is not in the language of the regular expression,
   888   and returning a unique POSIX value iff the string \emph{is} in the language):
   899   and returning a unique POSIX value iff the string \emph{is} in the language):
   889 
   900 
   890   \begin{theorem}\mbox{}\smallskip\\\label{lexercorrect}
   901   \begin{theorem}\mbox{}\smallskip\\\label{lexercorrect}
   893   (2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\
   904   (2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\
   894   \end{tabular}
   905   \end{tabular}
   895   \end{theorem}
   906   \end{theorem}
   896 
   907 
   897   \begin{proof}
   908   \begin{proof}
   898   By induction on @{term s} using Lem.~\ref{lemmkeps} and \ref{Posix2}.\qed  
   909   By induction on @{term s} using Lemma~\ref{lemmkeps} and \ref{Posix2}.\qed  
   899   \end{proof}
   910   \end{proof}
   900 
   911 
   901   \noindent In (2) we further know by Thm.~\ref{posixdeterm} that the
   912   \noindent In (2) we further know by Theorem~\ref{posixdeterm} that the
   902   value returned by the lexer must be unique.   A simple corollary 
   913   value returned by the lexer must be unique.   A simple corollary 
   903   of our two theorems is:
   914   of our two theorems is:
   904 
   915 
   905   \begin{corollary}\mbox{}\smallskip\\\label{lexercorrectcor}
   916   \begin{corollary}\mbox{}\smallskip\\\label{lexercorrectcor}
   906   \begin{tabular}{ll}
   917   \begin{tabular}{ll}
   907   (1) & @{thm (lhs) lexer_correctness(2)} if and only if @{thm (rhs) lexer_correctness(2)}\\ 
   918   (1) & @{thm (lhs) lexer_correctness(2)} if and only if @{thm (rhs) lexer_correctness(2)}\\ 
   908   (2) & @{thm (lhs) lexer_correctness(1)} if and only if @{thm (rhs) lexer_correctness(1)}\\
   919   (2) & @{thm (lhs) lexer_correctness(1)} if and only if @{thm (rhs) lexer_correctness(1)}\\
   909   \end{tabular}
   920   \end{tabular}
   910   \end{corollary}
   921   \end{corollary}
   911 
   922 
   912   \noindent
   923   \noindent This concludes our correctness proof. Note that we have
   913   This concludes our
   924   not changed the algorithm of Sulzmann and Lu,\footnote{All
   914   correctness proof. Note that we have not changed the algorithm of
   925   deviations we introduced are harmless.} but introduced our own
   915   Sulzmann and Lu,\footnote{All deviations we introduced are
   926   specification for what a correct result---a POSIX value---should be.
   916   harmless.} but introduced our own specification for what a correct
   927   In the next section we show that our specification coincides with
   917   result---a POSIX value---should be. A strong point in favour of
   928   another one given by Okui and Suzuki using a different technique.
   918   Sulzmann and Lu's algorithm is that it can be extended in various
       
   919   ways.
       
   920 
   929 
   921 *}
   930 *}
   922 
   931 
   923 section {* Ordering of Values according to Okui and Suzuki*}
   932 section {* Ordering of Values according to Okui and Suzuki*}
   924 
   933 
   938   Okui and Suzuki \cite{OkuiSuzuki2010,OkuiSuzukiTech} described
   947   Okui and Suzuki \cite{OkuiSuzuki2010,OkuiSuzukiTech} described
   939   another ordering of values, which they use to establish the
   948   another ordering of values, which they use to establish the
   940   correctness of their automata-based algorithm for POSIX matching.
   949   correctness of their automata-based algorithm for POSIX matching.
   941   Their ordering resembles some aspects of the one given by Sulzmann
   950   Their ordering resembles some aspects of the one given by Sulzmann
   942   and Lu, but is quite different. To begin with, Okui and Suzuki
   951   and Lu, but is quite different. To begin with, Okui and Suzuki
   943   identify POSIX values as least, rather than maximal, elements in
   952   identify POSIX values as minimal, rather than maximal, elements in
   944   their ordering. A more substantial difference is that the ordering
   953   their ordering. A more substantial difference is that the ordering
   945   by Okui and Suzuki uses \emph{positions} in order to identify and
   954   by Okui and Suzuki uses \emph{positions} in order to identify and
   946   compare subvalues. Positions are lists of natural numbers. This
   955   compare subvalues. Positions are lists of natural numbers. This
   947   allows them to quite naturally formalise the Longest Match and
   956   allows them to quite naturally formalise the Longest Match and
   948   Priority rules of the informal POSIX standard.  Consider for example
   957   Priority rules of the informal POSIX standard.  Consider for example
   992   @{thm (lhs) Pos_stars} & @{text "\<equiv>"} & @{thm (rhs) Pos_stars}\\
  1001   @{thm (lhs) Pos_stars} & @{text "\<equiv>"} & @{thm (rhs) Pos_stars}\\
   993   \end{tabular}
  1002   \end{tabular}
   994   \end{center}
  1003   \end{center}
   995 
  1004 
   996   \noindent 
  1005   \noindent 
   997   In the last clause @{text len} stands for the length of a list. Clearly
  1006   whereby @{text len} stands for the length of a list. Clearly
   998   for every position inside a value there exists a subvalue at that position.
  1007   for every position inside a value there exists a subvalue at that position.
   999  
  1008  
  1000 
  1009 
  1001   To help understanding the ordering of Okui and Suzuki, consider again 
  1010   To help understanding the ordering of Okui and Suzuki, consider again 
  1002   the earlier value
  1011   the earlier value
  1025   @{thm pflat_len_def}
  1034   @{thm pflat_len_def}
  1026   \end{center}
  1035   \end{center}
  1027 
  1036 
  1028   \noindent where we take the length of the flattened value at
  1037   \noindent where we take the length of the flattened value at
  1029   position @{text p}, provided the position is inside @{text v}; if
  1038   position @{text p}, provided the position is inside @{text v}; if
  1030   not, then the norm is @{text "-1"}. The default is crucial
  1039   not, then the norm is @{text "-1"}. The default for outside
  1031   for the POSIX requirement of preferring a @{text Left}-value
  1040   positions is crucial for the POSIX requirement of preferring a
  1032   over a @{text Right}-value (if they can match the same string---see
  1041   @{text Left}-value over a @{text Right}-value (if they can match the
  1033   the Priority Rule from the Introduction). For this consider
  1042   same string---see the Priority Rule from the Introduction). For this
       
  1043   consider
  1034 
  1044 
  1035   \begin{center}
  1045   \begin{center}
  1036   @{term "v == Left (Char x)"} \qquad and \qquad @{term "w == Right (Char x)"}
  1046   @{term "v == Left (Char x)"} \qquad and \qquad @{term "w == Right (Char x)"}
  1037   \end{center}
  1047   \end{center}
  1038 
  1048 
  1039   \noindent Both values match @{text x}, but at position @{text "[0]"}
  1049   \noindent Both values match @{text x}, but at position @{text "[0]"}
  1040   the norm of @{term v} is @{text 1} (the subvalue matches @{text x}), but the
  1050   the norm of @{term v} is @{text 1} (the subvalue matches @{text x}),
  1041   norm of @{text w} is @{text "-1"} (the position is outside @{text w}
  1051   but the norm of @{text w} is @{text "-1"} (the position is outside
  1042   according to how we defined the `inside' positions of @{text Left}-
  1052   @{text w} according to how we defined the `inside' positions of
  1043   and @{text Right}-values).  Of course at position @{text "[1]"}, the
  1053   @{text Left}- and @{text Right}-values).  Of course at position
  1044   norms @{term "pflat_len v [1]"} and @{term "pflat_len w [1]"} are
  1054   @{text "[1]"}, the norms @{term "pflat_len v [1]"} and @{term
  1045   reversed, but the point is that subvalues will be analysed according to
  1055   "pflat_len w [1]"} are reversed, but the point is that subvalues
  1046   lexicographically orderd positions.  This order, written @{term
  1056   will be analysed according to lexicographically ordered
  1047   "DUMMY \<sqsubset>lex DUMMY"}, can be conveniently formalised by
  1057   positions. According to this ordering, the position @{text "[0]"}
  1048   three inference rules
  1058   takes precedence.  The lexicographic ordering of positions, written
       
  1059   @{term "DUMMY \<sqsubset>lex DUMMY"}, can be conveniently formalised
       
  1060   by three inference rules
  1049 
  1061 
  1050   \begin{center}
  1062   \begin{center}
  1051   \begin{tabular}{ccc}
  1063   \begin{tabular}{ccc}
  1052   @{thm [mode=Axiom] lex_list.intros(1)}\hspace{1cm} &
  1064   @{thm [mode=Axiom] lex_list.intros(1)}\hspace{1cm} &
  1053   @{thm [mode=Rule] lex_list.intros(3)[where ?p1.0="p\<^sub>1" and ?p2.0="p\<^sub>2" and
  1065   @{thm [mode=Rule] lex_list.intros(3)[where ?p1.0="p\<^sub>1" and ?p2.0="p\<^sub>2" and
  1054                                             ?ps1.0="ps\<^sub>1" and ?ps2.0="ps\<^sub>2"]}\hspace{1cm} &
  1066                                             ?ps1.0="ps\<^sub>1" and ?ps2.0="ps\<^sub>2"]}\hspace{1cm} &
  1055   @{thm [mode=Rule] lex_list.intros(2)[where ?ps1.0="ps\<^sub>1" and ?ps2.0="ps\<^sub>2"]}
  1067   @{thm [mode=Rule] lex_list.intros(2)[where ?ps1.0="ps\<^sub>1" and ?ps2.0="ps\<^sub>2"]}
  1056   \end{tabular}
  1068   \end{tabular}
  1057   \end{center}
  1069   \end{center}
  1058 
  1070 
  1059   With the norm and lexicographic order of positions in place,
  1071   With the norm and lexicographic order in place,
  1060   we can state the key definition of Okui and Suzuki
  1072   we can state the key definition of Okui and Suzuki
  1061   \cite{OkuiSuzuki2010}: a value @{term "v\<^sub>1"} is \emph{smaller} than
  1073   \cite{OkuiSuzuki2010}: a value @{term "v\<^sub>1"} is \emph{smaller} than
  1062   @{term "v\<^sub>2"} if and only if  $(i)$ the norm at position @{text p} is
  1074   @{term "v\<^sub>2"} if and only if  $(i)$ the norm at position @{text p} is
  1063   greater in @{term "v\<^sub>1"} (that is the string @{term "flat (at v\<^sub>1 p)"} is longer 
  1075   greater in @{term "v\<^sub>1"} (that is the string @{term "flat (at v\<^sub>1 p)"} is longer 
  1064   than @{term "flat (at v\<^sub>2 p)"}) and $(ii)$ all subvalues at 
  1076   than @{term "flat (at v\<^sub>2 p)"}) and $(ii)$ all subvalues at 
  1078 
  1090 
  1079  \noindent The position @{text p} in this definition acts as the
  1091  \noindent The position @{text p} in this definition acts as the
  1080   \emph{first distinct position} of @{text "v\<^sub>1"} and @{text
  1092   \emph{first distinct position} of @{text "v\<^sub>1"} and @{text
  1081   "v\<^sub>2"}, where both values match strings of different length
  1093   "v\<^sub>2"}, where both values match strings of different length
  1082   \cite{OkuiSuzuki2010}.  Since at @{text p} the values @{text
  1094   \cite{OkuiSuzuki2010}.  Since at @{text p} the values @{text
  1083   "v\<^sub>1"} and @{text "v\<^sub>2"} macth different strings, the
  1095   "v\<^sub>1"} and @{text "v\<^sub>2"} match different strings, the
  1084   ordering is irreflexive. Derived from the definition above
  1096   ordering is irreflexive. Derived from the definition above
  1085   are the following two orderings:
  1097   are the following two orderings:
  1086   
  1098   
  1087   \begin{center}
  1099   \begin{center}
  1088   \begin{tabular}{l}
  1100   \begin{tabular}{l}
  1089   @{thm PosOrd_ex_def[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}\\
  1101   @{thm PosOrd_ex_def[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}\\
  1090   @{thm PosOrd_ex_eq_def[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
  1102   @{thm PosOrd_ex_eq_def[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
  1091   \end{tabular}
  1103   \end{tabular}
  1092   \end{center}
  1104   \end{center}
  1093 
  1105 
  1094  While we encountred a number of obstacles for establishing properties like
  1106  While we encountered a number of obstacles for establishing properties like
  1095  transitivity for the ordering of Sulzmann and Lu (and which we failed
  1107  transitivity for the ordering of Sulzmann and Lu (and which we failed
  1096  to overcome), it is relatively straightforward to establish this
  1108  to overcome), it is relatively straightforward to establish this
  1097  property for the ordering by Okui and Suzuki.
  1109  property for the ordering by Okui and Suzuki.
  1098 
  1110 
  1099  \begin{lemma}[Transitivity]\label{transitivity}
  1111  \begin{lemma}[Transitivity]\label{transitivity}
  1115  @{term "pflat_len v\<^sub>1 p' = pflat_len v\<^sub>3 p'"} holds.
  1127  @{term "pflat_len v\<^sub>1 p' = pflat_len v\<^sub>3 p'"} holds.
  1116  Suppose @{term "p' \<in> Pos v\<^sub>1"}, then we can infer from the 
  1128  Suppose @{term "p' \<in> Pos v\<^sub>1"}, then we can infer from the 
  1117  first assumption that @{term "pflat_len v\<^sub>1 p' = pflat_len v\<^sub>2 p'"}.
  1129  first assumption that @{term "pflat_len v\<^sub>1 p' = pflat_len v\<^sub>2 p'"}.
  1118  But this means that @{term "p'"} must be in  @{term "Pos v\<^sub>2"} too.
  1130  But this means that @{term "p'"} must be in  @{term "Pos v\<^sub>2"} too.
  1119  Hence we can use the second assumption and infer  @{term "pflat_len v\<^sub>2 p' = pflat_len v\<^sub>3 p'"}, which concludes
  1131  Hence we can use the second assumption and infer  @{term "pflat_len v\<^sub>2 p' = pflat_len v\<^sub>3 p'"}, which concludes
  1120  this case with @{term "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"}. 
  1132  this case with @{term "v\<^sub>1 :\<sqsubset>val v\<^sub>3"}. 
  1121  The reasoning in the other cases is similar.\qed
  1133  The reasoning in the other cases is similar.\qed
  1122  \end{proof}
  1134  \end{proof}
  1123 
  1135 
  1124  \noindent We can show that @{term "DUMMY :\<sqsubseteq>val DUMMY"} is
  1136  \noindent It is straightforward to show that @{text "\<prec>"} and
  1125  a partial order.  Okui and Suzuki also show that it is a linear order
  1137  $\preccurlyeq$ are partial orders.  Okui and Suzuki also show that it
  1126  for lexical values \cite{OkuiSuzuki2010} of a given regular
  1138  is a linear order for lexical values \cite{OkuiSuzuki2010} of a given
  1127  expression and given string, but we have not done this. It is not
  1139  regular expression and given string, but we have not done this. It is
  1128  essential for our results. What we are going to show below is that
  1140  not essential for our results. What we are going to show below is
  1129  for a given @{text r} and @{text s}, the ordering has a unique
  1141  that for a given @{text r} and @{text s}, the ordering has a unique
  1130  minimal element on the set @{term "LV r s"}, which is the POSIX value
  1142  minimal element on the set @{term "LV r s"}, which is the POSIX value
  1131  we defined in the previous section.
  1143  we defined in the previous section.
  1132 
  1144 
  1133 
  1145 
  1134  Lemma 1
  1146  Lemma 1
  1140  @{thm [mode=IfThen] PosOrd_shorterI[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} 
  1152  @{thm [mode=IfThen] PosOrd_shorterI[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]} 
  1141 
  1153 
  1142  
  1154  
  1143 
  1155 
  1144   Next we establish how Okui and Suzuki's ordering relates to our
  1156   Next we establish how Okui and Suzuki's ordering relates to our
  1145   definition of POSIX values.  Given a POSIX value @{text "v\<^sub>1"}
  1157   definition of POSIX values.  Given a @{text POSIX} value @{text "v\<^sub>1"}
  1146   for @{text r} and @{text s}, then any other lexical value @{text
  1158   for @{text r} and @{text s}, then any other lexical value @{text
  1147   "v\<^sub>2"} in @{term "LV r s"} is greater or equal than @{text
  1159   "v\<^sub>2"} in @{term "LV r s"} is greater or equal than @{text
  1148   "v\<^sub>1"}:
  1160   "v\<^sub>1"}, namely:
  1149 
  1161 
  1150 
  1162 
  1151   \begin{theorem}
  1163   \begin{theorem}\label{orderone}
  1152   @{thm [mode=IfThen] Posix_PosOrd[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
  1164   @{thm [mode=IfThen] Posix_PosOrd[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
  1153   \end{theorem}
  1165   \end{theorem}
  1154 
  1166 
  1155   \begin{proof} By induction on our POSIX rules. By
  1167   \begin{proof} By induction on our POSIX rules. By
  1156   Thm~\ref{posixdeterm} and the definition of @{const LV}, it is clear
  1168   Theorem~\ref{posixdeterm} and the definition of @{const LV}, it is clear
  1157   that @{text "v\<^sub>1"} and @{text "v\<^sub>2"} have the same
  1169   that @{text "v\<^sub>1"} and @{text "v\<^sub>2"} have the same
  1158   underlying string @{term s}.  The three base cases are
  1170   underlying string @{term s}.  The three base cases are
  1159   straightforward: for example for @{term "v\<^sub>1 = Void"}, we have
  1171   straightforward: for example for @{term "v\<^sub>1 = Void"}, we have
  1160   that @{term "v\<^sub>2 \<in> LV ONE []"} must also be of the form
  1172   that @{term "v\<^sub>2 \<in> LV ONE []"} must also be of the form
  1161   \mbox{@{term "v\<^sub>2 = Void"}}. Therefore we have @{term
  1173   \mbox{@{term "v\<^sub>2 = Void"}}. Therefore we have @{term
  1162   "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"}.  The inductive cases for
  1174   "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"}.  The inductive cases for
  1163   @{term "ALT r\<^sub>1 r\<^sub>2"} and @{term "SEQ r\<^sub>1
  1175   @{text r} being of the form @{term "ALT r\<^sub>1 r\<^sub>2"} and
  1164   r\<^sub>2"} are as follows:
  1176   @{term "SEQ r\<^sub>1 r\<^sub>2"} are as follows:
  1165 
  1177 
  1166 
  1178 
  1167   \begin{itemize} 
  1179   \begin{itemize} 
  1168 
  1180 
  1169   \item[$\bullet$] Case @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2)
  1181   \item[$\bullet$] Case @{text "P+L"} with @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2)
  1170   \<rightarrow> (Left w\<^sub>1)"}: In this case @{term "v\<^sub>1 =
  1182   \<rightarrow> (Left w\<^sub>1)"}: In this case @{term "v\<^sub>1 =
  1171   Left w\<^sub>1"} and the value @{term "v\<^sub>2"} is either of the
  1183   Left w\<^sub>1"} and the value @{term "v\<^sub>2"} is either of the
  1172   form @{term "Left w\<^sub>2"} or @{term "Right w\<^sub>2"}. In the
  1184   form @{term "Left w\<^sub>2"} or @{term "Right w\<^sub>2"}. In the
  1173   latter case we can immediately conclude with @{term "v\<^sub>1
  1185   latter case we can immediately conclude with @{term "v\<^sub>1
  1174   :\<sqsubseteq>val v\<^sub>2"} since a @{text Left}-value with the
  1186   :\<sqsubseteq>val v\<^sub>2"} since a @{text Left}-value with the
  1175   same underlying string @{text s} is always smaller or equal than a
  1187   same underlying string @{text s} is always smaller than a
  1176   @{text Right}-value.  In the former case we have @{term "w\<^sub>2
  1188   @{text Right}-value.  In the former case we have @{term "w\<^sub>2
  1177   \<in> LV r\<^sub>1 s"} and can use the induction hypothesis to infer
  1189   \<in> LV r\<^sub>1 s"} and can use the induction hypothesis to infer
  1178   @{term "w\<^sub>1 :\<sqsubseteq>val w\<^sub>2"}. Because @{term
  1190   @{term "w\<^sub>1 :\<sqsubseteq>val w\<^sub>2"}. Because @{term
  1179   "w\<^sub>1"} and @{term "w\<^sub>2"} have the same underlying string
  1191   "w\<^sub>1"} and @{term "w\<^sub>2"} have the same underlying string
  1180   @{text s}, we can conclude with @{term "Left w\<^sub>1
  1192   @{text s}, we can conclude with @{term "Left w\<^sub>1
  1181   :\<sqsubseteq>val Left w\<^sub>2"} by Prop ???.\smallskip
  1193   :\<sqsubseteq>val Left w\<^sub>2"} by Prop ???.\smallskip
  1182 
  1194 
  1183   \item[$\bullet$] Case @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2)
  1195   \item[$\bullet$] Case @{text "P+R"} with @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2)
  1184   \<rightarrow> (Right w\<^sub>1)"}: This case similar as the previous
  1196   \<rightarrow> (Right w\<^sub>1)"}: This case similar to the previous
  1185   case, except that we know that @{term "s \<notin> L
  1197   case, except that we additionally know @{term "s \<notin> L
  1186   r\<^sub>1"}. This is needed when @{term "v\<^sub>2"} is of the form
  1198   r\<^sub>1"}. This is needed when @{term "v\<^sub>2"} is of the form
  1187   @{term "Left w\<^sub>2"}: since \mbox{@{term "flat v\<^sub>2 = flat
  1199   @{term "Left w\<^sub>2"}. Since \mbox{@{term "flat v\<^sub>2 = flat
  1188   w\<^sub>2"} @{text "= s"}} and @{term "\<Turnstile> w\<^sub>2 :
  1200   w\<^sub>2"} @{text "= s"}} and @{term "\<Turnstile> w\<^sub>2 :
  1189   r\<^sub>1"}, we can derive a contradiction using
  1201   r\<^sub>1"}, we can derive a contradiction for \mbox{@{term "s \<notin> L
  1190   Prop.~\ref{inhabs}. So also in this case \mbox{@{term "v\<^sub>1
  1202   r\<^sub>1"}} using
       
  1203   Proposition~\ref{inhabs}. So also in this case \mbox{@{term "v\<^sub>1
  1191   :\<sqsubseteq>val v\<^sub>2"}}.\smallskip
  1204   :\<sqsubseteq>val v\<^sub>2"}}.\smallskip
  1192 
  1205 
  1193   \item[$\bullet$] Case @{term "(s\<^sub>1 @ s\<^sub>2) \<in> (SEQ
  1206   \item[$\bullet$] Case @{text "PS"} with @{term "(s\<^sub>1 @ s\<^sub>2) \<in> (SEQ
  1194   r\<^sub>1 r\<^sub>2) \<rightarrow> (Seq w\<^sub>1 w\<^sub>2)"}: We
  1207   r\<^sub>1 r\<^sub>2) \<rightarrow> (Seq w\<^sub>1 w\<^sub>2)"}: We
  1195   can assume @{term "v\<^sub>2 = Seq (u\<^sub>1) (u\<^sub>2)"} with
  1208   can assume @{term "v\<^sub>2 = Seq (u\<^sub>1) (u\<^sub>2)"} with
  1196   @{term "\<Turnstile> u\<^sub>1 : r\<^sub>1"} and \mbox{@{term
  1209   @{term "\<Turnstile> u\<^sub>1 : r\<^sub>1"} and \mbox{@{term
  1197   "\<Turnstile> u\<^sub>2 : r\<^sub>2"}}. We have @{term "s\<^sub>1 @
  1210   "\<Turnstile> u\<^sub>2 : r\<^sub>2"}}. We have @{term "s\<^sub>1 @
  1198   s\<^sub>2 = (flat u\<^sub>1) @ (flat u\<^sub>2)"}.  By the
  1211   s\<^sub>2 = (flat u\<^sub>1) @ (flat u\<^sub>2)"}.  By the
  1199   side-condition on out @{text PS}-rule we know that either @{term
  1212   side-condition of the @{text PS}-rule we know that either @{term
  1200   "s\<^sub>1 = flat u\<^sub>1"} or @{term "flat u\<^sub>1"} is a
  1213   "s\<^sub>1 = flat u\<^sub>1"} or that @{term "flat u\<^sub>1"} is a
  1201   strict prefix ??? of @{term "s\<^sub>1"}. In the latter case we can
  1214   strict prefix ??? of @{term "s\<^sub>1"}. In the latter case we can
  1202   infer @{term "w\<^sub>1 :\<sqsubset>val u\<^sub>1"} by ???  and from
  1215   infer @{term "w\<^sub>1 :\<sqsubset>val u\<^sub>1"} by ???  and from
  1203   this @{term "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"} by ???.  In the
  1216   this @{term "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"} by ???.  In the
  1204   former case we know @{term "u\<^sub>1 \<in> LV r\<^sub>1 s\<^sub>1"}
  1217   former case we know @{term "u\<^sub>1 \<in> LV r\<^sub>1 s\<^sub>1"}
  1205   and @{term "u\<^sub>2 \<in> LV r\<^sub>2 s\<^sub>2"}. With this we
  1218   and @{term "u\<^sub>2 \<in> LV r\<^sub>2 s\<^sub>2"}. With this we
  1206   can use the induction hypotheses to infer @{term "w\<^sub>1
  1219   can use the induction hypotheses to infer @{term "w\<^sub>1
  1207   :\<sqsubseteq>val u\<^sub>1"} and @{term "w\<^sub>2
  1220   :\<sqsubseteq>val u\<^sub>1"} and @{term "w\<^sub>2
  1208   :\<sqsubseteq>val u\<^sub>2"}. By ??? we can again infer @{term
  1221   :\<sqsubseteq>val u\<^sub>2"}. By ??? we can again infer @{term
  1209   "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"} and are done.
  1222   "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"}.
  1210 
  1223 
  1211   \end{itemize}
  1224   \end{itemize}
  1212 
  1225 
  1213   \noindent The case for @{text "P\<star>"} is similar to the @{text PS}-case.\qed
  1226   \noindent The case for @{text "P\<star>"} is similar to the @{text PS}-case and omitted.\qed
  1214   \end{proof}
  1227   \end{proof}
  1215 
  1228 
  1216   \noindent This theorem shows that our posix value @{text
  1229   \noindent This theorem shows that our @{text POSIX} value for a
  1217   "v\<^sub>1"} with @{term "s \<in> r \<rightarrow> v\<^sub>1"} is a
  1230   regular expression @{text r} and string @{term s} is in fact a
  1218   minimal element for the values in @{text "LV r s"}. By ??? we also
  1231   minimal element of the values in @{text "LV r s"}. By ??? we also
  1219   know that any value in @{text "LV s' r"}, with @{term "s'"} being a
  1232   know that any value in @{text "LV s' r"}, with @{term "s'"} being a
  1220   prefix, cannot be smaller than @{text "v\<^sub>1"}. The next theorem
  1233   prefix, cannot be smaller than @{text "v\<^sub>1"}. The next theorem
  1221   shows the opposite---namely any minimal element must be a @{text
  1234   shows the opposite---namely any minimal element in @{term "LV r s"}
  1222   POSIX}-value.  Given a lexical value @{text "v\<^sub>1"}, say, in
  1235   must be a @{text POSIX} value. For this it helps that we proved in
  1223   @{term "LV r s"}, for which there does not exists any smaller value
  1236   the previous section that whenever a string @{term "s \<in> L r"} then 
  1224   in @{term "LV r s"}, then this value is our @{text POSIX}-value:
  1237   a corresponding @{text POSIX} value exists and is a lexical value, 
       
  1238   see Theorem ??? and Lemma ???. 
  1225 
  1239 
  1226   \begin{theorem}
  1240   \begin{theorem}
  1227   @{thm [mode=IfThen] PosOrd_Posix[where ?v1.0="v\<^sub>1"]}
  1241   @{thm [mode=IfThen] PosOrd_Posix[where ?v1.0="v\<^sub>1"]} 
  1228   \end{theorem}
  1242   \end{theorem}
  1229 
  1243 
  1230   \begin{proof} By induction on @{text r}. The tree base cases are
  1244   \begin{proof} 
  1231   again straightforward.  For example if @{term "v\<^sub>1 \<in> LV
  1245   If @{thm (prem 1) PosOrd_Posix[where ?v1.0="v\<^sub>1"]} then 
  1232   ONE s"} then @{term "v\<^sub>1 = Void"} and @{term "s = []"}. We
  1246   @{term "s \<in> L r"} by Proposition~\ref{inhabs}. Hence by Theorem~\ref{lexercorrect}(2) 
  1233   know that @{term "[] \<in> ONE \<rightarrow> Void"} holds.  In the
  1247   there exists a
  1234   cases for @{term "ALT r\<^sub>1 r\<^sub>2"} and @{term "SEQ
  1248   @{text POSIX} value @{term "v\<^sub>P"} with @{term "s \<in> r \<rightarrow> v\<^sub>P"}
  1235   r\<^sub>1 r\<^sub>2"} we reason as follows:
  1249   and by Lemma~\ref{LVposix} we also have @{term "v\<^sub>P \<in> LV r s"}.
  1236 
  1250   By Theorem~\ref{orderone} we therefore have 
  1237   \begin{itemize}
  1251   @{term "v\<^sub>P :\<sqsubseteq>val v\<^sub>1"}. If @{term "v\<^sub>P = v\<^sub>1"} then
  1238 
  1252   we are done. Otherwise we have @{term "v\<^sub>P :\<sqsubset>val v\<^sub>1"} which 
  1239   \item[$\bullet$] Case @{term "v\<^sub>1 \<in> LV (ALT r\<^sub>1
  1253   however contradicts the second assumption and we are done too.\qed
  1240   r\<^sub>2) s"} with @{term "v\<^sub>1 = Left w\<^sub>1"} and @{term
       
  1241   "w\<^sub>1 \<in> LV r\<^sub>1 s"}: In order to use the induction
       
  1242   hypothesis we need to infer 
       
  1243 
       
  1244   \begin{center}
       
  1245   @{term "\<forall>v'
       
  1246   \<in> LV (ALT r\<^sub>1 r\<^sub>2) s. \<not> (v' :\<sqsubset>val
       
  1247   Left w\<^sub>1)"}
       
  1248   implies
       
  1249   @{term "\<forall>v' \<in> LV r\<^sub>1
       
  1250   s. \<not> (v' :\<sqsubset>val w\<^sub>1)"}
       
  1251   \end{center}
       
  1252 
       
  1253   \noindent This can be done because of ?? we can infer that for every
       
  1254   @{text v'} in @{term "LV r\<^sub>1 s"} and @{term "v'
       
  1255   :\<sqsubset>val w\<^sub>1"} also @{term "Left v' :\<sqsubset>val
       
  1256   Left w\<^sub>1"} holds. This gives a contradiction. Consequently, we
       
  1257   can use the induction hypothesis to obtain @{term "s \<in> r\<^sub>1
       
  1258   \<rightarrow> w\<^sub>1"} and then conclude this case with @{term "s
       
  1259   \<in> ALT r\<^sub>1 r\<^sub>2 \<rightarrow> v\<^sub>1"}.\smallskip
       
  1260 
       
  1261   \item[$\bullet$] Case @{term "v\<^sub>1 \<in> LV (ALT r\<^sub>1
       
  1262   r\<^sub>2) s"} with @{term "v\<^sub>1 = Right w\<^sub>1"} and @{term
       
  1263   "w\<^sub>1 \<in> LV r\<^sub>2 s"}: Like above we can use the 
       
  1264   induction hypothesis in order to infer @{term "s \<in> r\<^sub>2
       
  1265   \<rightarrow> w\<^sub>1"}. In order to conclude we still need to
       
  1266   obtain @{term "s \<notin> L r\<^sub>1"}. Let us suppose the opposite.
       
  1267   Then there is a @{term v'} such that @{term "v' \<in> LV r\<^sub>1 s"}
       
  1268   by Prop ??? and hence @{term "Left v' \<in> LV (ALT r\<^sub>1 r\<^sub>2) s"}.
       
  1269   But then we can use the second assumption of the theorem to infer that 
       
  1270   @{term "\<not>(Left v' :\<sqsubset>val Right w\<^sub>1)"}, which cannot be the case.
       
  1271   Therefore @{term "s \<notin> L r\<^sub>1"} must hold and we can also conclude in this
       
  1272   case.
       
  1273 
       
  1274  
       
  1275 
       
  1276 
       
  1277   \end{itemize}
       
  1278 
       
  1279   \end{proof}
  1254   \end{proof}
  1280 
  1255 
  1281   To sum up, we have shown that minimal elements according to the ordering
  1256   \begin{corollary}
  1282   by Okui and Suzuki are exactly the @{text POSIX}-values we defined inductively
  1257   @{thm [mode=IfThen] Least_existence1}
  1283   in Section ???
  1258   \end{corollary}
       
  1259 
       
  1260   \noindent To sum up, we have shown that minimal elements according
       
  1261   to the ordering by Okui and Suzuki are exactly the @{text POSIX}
       
  1262   values we defined inductively in Section~\ref{posixsec} 
       
  1263 
       
  1264 
       
  1265    IS THE minimal element unique? We have not shown totality.
  1284 *}
  1266 *}
  1285 
  1267 
  1286 
  1268 section {* Optimisations *}
  1287 section {* Extensions and Optimisations*}
       
  1288 
  1269 
  1289 text {*
  1270 text {*
  1290 
       
  1291   If we are interested in tokenising a string, then we need to not just
       
  1292   split up the string into tokens, but also ``classify'' the tokens (for
       
  1293   example whether it is a keyword or an identifier). This can be done with
       
  1294   only minor modifications to the algorithm by introducing \emph{record
       
  1295   regular expressions} and \emph{record values} (for example
       
  1296   \cite{Sulzmann2014b}):
       
  1297 
       
  1298   \begin{center}  
       
  1299   @{text "r :="}
       
  1300   @{text "..."} $\mid$
       
  1301   @{text "(l : r)"} \qquad\qquad
       
  1302   @{text "v :="}
       
  1303   @{text "..."} $\mid$
       
  1304   @{text "(l : v)"}
       
  1305   \end{center}
       
  1306   
       
  1307   \noindent where @{text l} is a label, say a string, @{text r} a regular
       
  1308   expression and @{text v} a value. All functions can be smoothly extended
       
  1309   to these regular expressions and values. For example \mbox{@{text "(l :
       
  1310   r)"}} is nullable iff @{term r} is, and so on. The purpose of the record
       
  1311   regular expression is to mark certain parts of a regular expression and
       
  1312   then record in the calculated value which parts of the string were matched
       
  1313   by this part. The label can then serve as classification for the tokens.
       
  1314   For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for
       
  1315   keywords and identifiers from the Introduction. With the record regular
       
  1316   expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}}
       
  1317   and then traverse the calculated value and only collect the underlying
       
  1318   strings in record values. With this we obtain finite sequences of pairs of
       
  1319   labels and strings, for example
       
  1320 
       
  1321   \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
       
  1322   
       
  1323   \noindent from which tokens with classifications (keyword-token,
       
  1324   identifier-token and so on) can be extracted.
       
  1325 
  1271 
  1326   Derivatives as calculated by \Brz's method are usually more complex
  1272   Derivatives as calculated by \Brz's method are usually more complex
  1327   regular expressions than the initial one; the result is that the
  1273   regular expressions than the initial one; the result is that the
  1328   derivative-based matching and lexing algorithms are often abysmally slow.
  1274   derivative-based matching and lexing algorithms are often abysmally slow.
  1329   However, various optimisations are possible, such as the simplifications
  1275   However, various optimisations are possible, such as the simplifications
  1458   particular for @{term "r"} being the derivative @{term "der c
  1404   particular for @{term "r"} being the derivative @{term "der c
  1459   r"}). Let @{term "r\<^sub>s"} be the simplified derivative regular expression, that is @{term
  1405   r"}). Let @{term "r\<^sub>s"} be the simplified derivative regular expression, that is @{term
  1460   "fst (simp (der c r))"}, and @{term "f\<^sub>r"} be the rectification
  1406   "fst (simp (der c r))"}, and @{term "f\<^sub>r"} be the rectification
  1461   function, that is @{term "snd (simp (der c r))"}.  We distinguish the cases
  1407   function, that is @{term "snd (simp (der c r))"}.  We distinguish the cases
  1462   whether (*) @{term "s \<in> L (der c r)"} or not. In the first case we
  1408   whether (*) @{term "s \<in> L (der c r)"} or not. In the first case we
  1463   have by Thm.~\ref{lexercorrect}(2) a value @{term "v"} so that @{term
  1409   have by Theorem~\ref{lexercorrect}(2) a value @{term "v"} so that @{term
  1464   "lexer (der c r) s = Some v"} and @{term "s \<in> der c r \<rightarrow> v"} hold.
  1410   "lexer (der c r) s = Some v"} and @{term "s \<in> der c r \<rightarrow> v"} hold.
  1465   By Lem.~\ref{slexeraux}(1) we can also infer from~(*) that @{term "s
  1411   By Lemma~\ref{slexeraux}(1) we can also infer from~(*) that @{term "s
  1466   \<in> L r\<^sub>s"} holds.  Hence we know by Thm.~\ref{lexercorrect}(2) that
  1412   \<in> L r\<^sub>s"} holds.  Hence we know by Theorem~\ref{lexercorrect}(2) that
  1467   there exists a @{term "v'"} with @{term "lexer r\<^sub>s s = Some v'"} and
  1413   there exists a @{term "v'"} with @{term "lexer r\<^sub>s s = Some v'"} and
  1468   @{term "s \<in> r\<^sub>s \<rightarrow> v'"}. From the latter we know by
  1414   @{term "s \<in> r\<^sub>s \<rightarrow> v'"}. From the latter we know by
  1469   Lem.~\ref{slexeraux}(2) that @{term "s \<in> der c r \<rightarrow> (f\<^sub>r v')"} holds.
  1415   Lemma~\ref{slexeraux}(2) that @{term "s \<in> der c r \<rightarrow> (f\<^sub>r v')"} holds.
  1470   By the uniqueness of the POSIX relation (Thm.~\ref{posixdeterm}) we
  1416   By the uniqueness of the POSIX relation (Theorem~\ref{posixdeterm}) we
  1471   can infer that @{term v} is equal to @{term "f\<^sub>r v'"}---that is the 
  1417   can infer that @{term v} is equal to @{term "f\<^sub>r v'"}---that is the 
  1472   rectification function applied to @{term "v'"}
  1418   rectification function applied to @{term "v'"}
  1473   produces the original @{term "v"}.  Now the case follows by the
  1419   produces the original @{term "v"}.  Now the case follows by the
  1474   definitions of @{const lexer} and @{const slexer}.
  1420   definitions of @{const lexer} and @{const slexer}.
  1475 
  1421 
  1476   In the second case where @{term "s \<notin> L (der c r)"} we have that
  1422   In the second case where @{term "s \<notin> L (der c r)"} we have that
  1477   @{term "lexer (der c r) s = None"} by Thm.~\ref{lexercorrect}(1).  We
  1423   @{term "lexer (der c r) s = None"} by Theorem~\ref{lexercorrect}(1).  We
  1478   also know by Lem.~\ref{slexeraux}(1) that @{term "s \<notin> L r\<^sub>s"}. Hence
  1424   also know by Lemma~\ref{slexeraux}(1) that @{term "s \<notin> L r\<^sub>s"}. Hence
  1479   @{term "lexer r\<^sub>s s = None"} by Thm.~\ref{lexercorrect}(1) and
  1425   @{term "lexer r\<^sub>s s = None"} by Theorem~\ref{lexercorrect}(1) and
  1480   by IH then also @{term "slexer r\<^sub>s s = None"}. With this we can
  1426   by IH then also @{term "slexer r\<^sub>s s = None"}. With this we can
  1481   conclude in this case too.\qed   
  1427   conclude in this case too.\qed   
  1482 
  1428 
  1483   \end{proof} 
  1429   \end{proof} 
       
  1430 
       
  1431 *}
       
  1432 
       
  1433 section {* Extensions*}
       
  1434 
       
  1435 text {*
       
  1436 
       
  1437   A strong point in favour of
       
  1438   Sulzmann and Lu's algorithm is that it can be extended in various
       
  1439   ways.
       
  1440 
       
  1441   If we are interested in tokenising a string, then we need to not just
       
  1442   split up the string into tokens, but also ``classify'' the tokens (for
       
  1443   example whether it is a keyword or an identifier). This can be done with
       
  1444   only minor modifications to the algorithm by introducing \emph{record
       
  1445   regular expressions} and \emph{record values} (for example
       
  1446   \cite{Sulzmann2014b}):
       
  1447 
       
  1448   \begin{center}  
       
  1449   @{text "r :="}
       
  1450   @{text "..."} $\mid$
       
  1451   @{text "(l : r)"} \qquad\qquad
       
  1452   @{text "v :="}
       
  1453   @{text "..."} $\mid$
       
  1454   @{text "(l : v)"}
       
  1455   \end{center}
       
  1456   
       
  1457   \noindent where @{text l} is a label, say a string, @{text r} a regular
       
  1458   expression and @{text v} a value. All functions can be smoothly extended
       
  1459   to these regular expressions and values. For example \mbox{@{text "(l :
       
  1460   r)"}} is nullable iff @{term r} is, and so on. The purpose of the record
       
  1461   regular expression is to mark certain parts of a regular expression and
       
  1462   then record in the calculated value which parts of the string were matched
       
  1463   by this part. The label can then serve as classification for the tokens.
       
  1464   For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for
       
  1465   keywords and identifiers from the Introduction. With the record regular
       
  1466   expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}}
       
  1467   and then traverse the calculated value and only collect the underlying
       
  1468   strings in record values. With this we obtain finite sequences of pairs of
       
  1469   labels and strings, for example
       
  1470 
       
  1471   \[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
       
  1472   
       
  1473   \noindent from which tokens with classifications (keyword-token,
       
  1474   identifier-token and so on) can be extracted.
       
  1475 
       
  1476 
  1484 *}
  1477 *}
  1485 
  1478 
  1486 
  1479 
  1487 
  1480 
  1488 section {* The Correctness Argument by Sulzmann and Lu\label{argu} *}
  1481 section {* The Correctness Argument by Sulzmann and Lu\label{argu} *}