lexing: comparison thys/Journal/Paper.thy

equal deleted inserted replaced

-:f46ebc84408d
+:f16019b11179
 val.Right ("Right _" [1000] 78) and
 val.Seq ("Seq _ _" [79,79] 78) and
 val.Stars ("Stars _" [79] 78) and
 L ("L'(_')" [10] 78) and
+LV ("LV _ _" [80,73] 78) and
 der_syn ("_\\_" [79, 1000] 76) and
 ders_syn ("_\\_" [79, 1000] 76) and
 flat ("|_|" [75] 74) and
 Sequ ("_ @ _" [78,77] 63) and
 injval ("inj _ _ _" [79,77,79] 76) and
 (*
 comments not implemented
-p9. The condtion "not exists s3 s4..." appears often enough (in particular in
+p9. The condition "not exists s3 s4..." appears often enough (in particular in
 the proof of Lemma 3) to warrant a definition.
 *)
 (*>*)
 identifier. For @{text "if"} we obtain by the Priority Rule a keyword
 token, not an identifier token---even if @{text "r\<^bsub>id\<^esub>"}
 matches also. By the Star Rule we know @{text "(r\<^bsub>key\<^esub> +
 r\<^bsub>id\<^esub>)\<^sup>\<star>"} matches @{text "iffoo"},
 respectively @{text "if"}, in exactly one `iteration' of the star. The
-Empty String Rule is for cases where @{text
+Empty String Rule is for cases where, for example, @{text
-"(a\<^sup>\<star>)\<^sup>\<star>"}, for example, matches against the
+"(a\<^sup>\<star>)\<^sup>\<star>"} matches against the
 string @{text "bc"}. Then the longest initial matched substring is the
 empty string, which is matched by both the whole regular expression
-and the parenthesised sub-expression.
+and the parenthesised subexpression.
 One limitation of Brzozowski's matcher is that it only generates a
 YES/NO answer for whether a string is being matched by a regular
 expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this matcher
 to allow generation not just of a YES/NO answer but of an actual
-matching, called a [lexical] {\em value}. \marginpar{explain values;
+matching, called a [lexical] {\em value}. Assuming a regular
-who introduced them} They give a simple algorithm to calculate a value
+expression matches a string, values encode the information of
-that appears to be the value associated with POSIX matching.  The
+\emph{how} the string is matched by the regular expression---that is,
-challenge then is to specify that value, in an algorithm-independent
+which part of the string is matched by which part of the regular
-fashion, and to show that Sulzmann and Lu's derivative-based algorithm
+expression. For this consider again the the string @{text "xy"} and
-does indeed calculate a value that is correct according to the
+the regular expression \mbox{@{text "(x + (y +
-specification.
+xy))\<^sup>\<star>"}}. The POSIX value, which corresponds to using the
+star in only one repetition,
-The answer given by Sulzmann and Lu \cite{Sulzmann2014} is to define a
-relation (called an ``order relation'') on the set of values of @{term
-r}, and to show that (once a string to be matched is chosen) there is
+\marginpar{explain values; who introduced them}
-a maximum element and that it is computed by their derivative-based
-algorithm. This proof idea is inspired by work of Frisch and Cardelli
-\cite{Frisch2004} on a GREEDY regular expression matching
+Sulzmann and Lu give a simple algorithm to calculate a value that
-algorithm. However, we were not able to establish transitivity and
+appears to be the value associated with POSIX matching.  The challenge
-totality for the ``order relation'' by Sulzmann and Lu.
+then is to specify that value, in an algorithm-independent fashion,
-There are some inherent problems with their approach (of
+and to show that Sulzmann and Lu's derivative-based algorithm does
-which some of the proofs are not published in \cite{Sulzmann2014});
+indeed calculate a value that is correct according to the
-perhaps more importantly, we give in this paper a simple inductive (and
+specification.  The answer given by Sulzmann and Lu
-algorithm-independent) definition of what we call being a {\em POSIX
+\cite{Sulzmann2014} is to define a relation (called an ``order
-value} for a regular expression @{term r} and a string @{term s}; we
+relation'') on the set of values of @{term r}, and to show that (once
-show that the algorithm computes such a value and that such a value is
+a string to be matched is chosen) there is a maximum element and that
-unique. Our proofs are both done by hand and checked in Isabelle/HOL.  The
+it is computed by their derivative-based algorithm. This proof idea is
-experience of doing our proofs has been that this mechanical checking
+inspired by work of Frisch and Cardelli \cite{Frisch2004} on a GREEDY
-was absolutely essential: this subject area has hidden snares. This
+regular expression matching algorithm. However, we were not able to
-was also noted by Kuklewicz \cite{Kuklewicz} who found that nearly all
+establish transitivity and totality for the ``order relation'' by
-POSIX matching implementations are ``buggy'' \cite[Page
+Sulzmann and Lu.  There are some inherent problems with their approach
-203]{Sulzmann2014} and by Grathwohl et al \cite[Page 36]{CrashCourse2014}
+(of which some of the proofs are not published in
-who wrote:
+\cite{Sulzmann2014}); perhaps more importantly, we give in this paper
+a simple inductive (and algorithm-independent) definition of what we
+call being a {\em POSIX value} for a regular expression @{term r} and
+a string @{term s}; we show that the algorithm by Sulzmann and Lu
+computes such a value and that such a value is unique. Our proofs are
+both done by hand and checked in Isabelle/HOL.  The experience of
+doing our proofs has been that this mechanical checking was absolutely
+essential: this subject area has hidden snares. This was also noted by
+Kuklewicz \cite{Kuklewicz} who found that nearly all POSIX matching
+implementations are ``buggy'' \cite[Page 203]{Sulzmann2014} and by
+Grathwohl et al \cite[Page 36]{CrashCourse2014} who wrote:
 \begin{quote}
 \it{}``The POSIX strategy is more complicated than the greedy because of
 the dependence on information about the length of matched strings in the
 various subexpressions.''
 of values based on positions by Okui and Suzuki \cite{OkuiSuzuki2010}.
 Derivatives as calculated by Brzozowski's method are usually more complex
 regular expressions than the initial one; various optimisations are
 possible. We prove the correctness when simplifications of @{term "ALT ZERO
 r"}, @{term "ALT r ZERO"}, @{term "SEQ ONE r"} and @{term "SEQ r ONE"} to
-@{term r} are applied.
+@{term r} are applied. We extend our results to ???
 *}
 section {* Preliminaries *}
 \end{center}
 \noindent Sulzmann and Lu also define inductively an inhabitation relation
 that associates values to regular expressions. We define this relation as
 follows:\footnote{Note that the rule for @{term Stars} differs from our
-erlier paper \cite{AusafDyckhoffUrban2016}. There we used the original
+earlier paper \cite{AusafDyckhoffUrban2016}. There we used the original
 definition by Sulzmann and Lu which does not require that the values @{term "v \<in> set vs"}
 flatten to a non-empty string. The reason for introducing the
 more restricted version of lexical values is convenience later
 on when reasoning about
 an ordering relation for values.}
 \begin{theorem}\mbox{}\smallskip\\\label{posixdeterm}
 \begin{tabular}{ll}
-@{text "(1)"} & If @{thm (prem 1) Posix1(1)} then @{thm (concl)
+(1) & If @{thm (prem 1) Posix1(1)} then @{thm (concl)
 Posix1(1)} and @{thm (concl) Posix1(2)}.\\
-@{text "(2)"} & @{thm[mode=IfThen] Posix_determ(1)[of _ _ "v" "v'"]}
+(2) & @{thm[mode=IfThen] Posix_determ(1)[of _ _ "v" "v'"]}
 \end{tabular}
 \end{theorem}
 \begin{proof} Both by induction on the definition of @{term "s \<in> r \<rightarrow> v"}.
 The second parts follows by a case analysis of @{term "s \<in> r \<rightarrow> v'"} and
 Interesting is also the rule for sequence regular expressions (@{text
 "PS"}). The first two premises state that @{term "v\<^sub>1"} and @{term "v\<^sub>2"}
 are the POSIX values for @{term "(s\<^sub>1, r\<^sub>1)"} and @{term "(s\<^sub>2, r\<^sub>2)"}
 respectively. Consider now the third premise and note that the POSIX value
 of this rule should match the string \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}}. According to the
-longest match rule, we want that the @{term "s\<^sub>1"} is the longest initial
+Longest Match Rule, we want that the @{term "s\<^sub>1"} is the longest initial
 split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} such that @{term "s\<^sub>2"} is still recognised
 by @{term "r\<^sub>2"}. Let us assume, contrary to the third premise, that there
 \emph{exist} an @{term "s\<^sub>3"} and @{term "s\<^sub>4"} such that @{term "s\<^sub>2"}
 can be split up into a non-empty string @{term "s\<^sub>3"} and a possibly empty
 string @{term "s\<^sub>4"}. Moreover the longer string @{term "s\<^sub>1 @ s\<^sub>3"} can be
 matched by @{text "r\<^sub>1"} and the shorter @{term "s\<^sub>4"} can still be
 matched by @{term "r\<^sub>2"}. In this case @{term "s\<^sub>1"} would \emph{not} be the
 longest initial split of \mbox{@{term "s\<^sub>1 @ s\<^sub>2"}} and therefore @{term "Seq v\<^sub>1
 v\<^sub>2"} cannot be a POSIX value for @{term "(s\<^sub>1 @ s\<^sub>2, SEQ r\<^sub>1 r\<^sub>2)"}.
-The main point is that our side-condition ensures the longest
+The main point is that our side-condition ensures the Longest
-match rule is satisfied.
+Match Rule is satisfied.
 A similar condition is imposed on the POSIX value in the @{text
 "P\<star>"}-rule. Also there we want that @{term "s\<^sub>1"} is the longest initial
 split of @{term "s\<^sub>1 @ s\<^sub>2"} and furthermore the corresponding value
 @{term v} cannot be flattened to the empty string. In effect, we require
 that in each ``iteration'' of the star, some non-empty substring needs to
 be ``chipped'' away; only in case of the empty string we accept @{term
 "Stars []"} as the POSIX value. Indeed we can show that our POSIX value
-is a lexical value which excludes those @{text Stars} containing values
+is a lexical value which excludes those @{text Stars} containing subvalues
 that flatten to the empty string.
-\begin{lemma}
+\begin{lemma}\label{LVposix}
 @{thm [mode=IfThen] Posix_LV}
 \end{lemma}
 \begin{proof}
 By routine induction on @{thm (prem 1) Posix_LV}.\qed
 "s \<notin> L (der c r\<^sub>1)"} and @{term "s \<in> der c r\<^sub>2 \<rightarrow> v'"}. In @{text "(a)"} we
 know @{term "s \<in> der c r\<^sub>1 \<rightarrow> v'"}, from which we can infer @{term "(c # s)
 \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v'"} by induction hypothesis and hence @{term "(c #
 s) \<in> ALT r\<^sub>1 r\<^sub>2 \<rightarrow> injval (ALT r\<^sub>1 r\<^sub>2) c (Left v')"} as needed. Similarly
 in subcase @{text "(b)"} where, however, in addition we have to use
-Prop.~\ref{derprop}(2) in order to infer @{term "c # s \<notin> L r\<^sub>1"} from @{term
+Proposition~\ref{derprop}(2) in order to infer @{term "c # s \<notin> L r\<^sub>1"} from @{term
-"s \<notin> L (der c r\<^sub>1)"}.
+"s \<notin> L (der c r\<^sub>1)"}.\smallskip
 \item[$\bullet$] Case @{term "r = SEQ r\<^sub>1 r\<^sub>2"}. There are three subcases:
 \begin{quote}
 \begin{description}
 \noindent For @{text "(a)"} we know @{term "s\<^sub>1 \<in> der c r\<^sub>1 \<rightarrow> v\<^sub>1"} and
 @{term "s\<^sub>2 \<in> r\<^sub>2 \<rightarrow> v\<^sub>2"} as well as
 %
 \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> s\<^sub>1 @ s\<^sub>3 \<in> L (der c r\<^sub>1) \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\]
-\noindent From the latter we can infer by Prop.~\ref{derprop}(2):
+\noindent From the latter we can infer by Proposition~\ref{derprop}(2):
 %
 \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = s\<^sub>2 \<and> (c # s\<^sub>1) @ s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\]
 \noindent We can use the induction hypothesis for @{text "r\<^sub>1"} to obtain
 @{term "(c # s\<^sub>1) \<in> r\<^sub>1 \<rightarrow> injval r\<^sub>1 c v\<^sub>1"}. Putting this all together allows us to infer
 we have @{term "(c # s) \<in> r\<^sub>2 \<rightarrow> (injval r\<^sub>2 c v\<^sub>1)"} by induction hypothesis
 for @{term "r\<^sub>2"}. From the latter we can infer
 %
 \[@{term "\<not> (\<exists>s\<^sub>3 s\<^sub>4. s\<^sub>3 \<noteq> [] \<and> s\<^sub>3 @ s\<^sub>4 = c # s \<and> s\<^sub>3 \<in> L r\<^sub>1 \<and> s\<^sub>4 \<in> L r\<^sub>2)"}\]
-\noindent By Lem.~\ref{lemmkeps} we know @{term "[] \<in> r\<^sub>1 \<rightarrow> (mkeps r\<^sub>1)"}
+\noindent By Lemma~\ref{lemmkeps} we know @{term "[] \<in> r\<^sub>1 \<rightarrow> (mkeps r\<^sub>1)"}
 holds. Putting this all together, we can conclude with @{term "(c #
 s) \<in> SEQ r\<^sub>1 r\<^sub>2 \<rightarrow> Seq (mkeps r\<^sub>1) (injval r\<^sub>2 c v\<^sub>1)"}, as required.
 Finally suppose @{term "r = STAR r\<^sub>1"}. This case is very similar to the
 sequence case, except that we need to also ensure that @{term "flat (injval r\<^sub>1
 r\<^sub>1 \<rightarrow> v\<^sub>1"} and the induction hypothesis).\qed
 \end{itemize}
 \end{proof}
 \noindent
-With Lem.~\ref{Posix2} in place, it is completely routine to establish
+With Lemma~\ref{Posix2} in place, it is completely routine to establish
 that the Sulzmann and Lu lexer satisfies our specification (returning
 the null value @{term "None"} iff the string is not in the language of the regular expression,
 and returning a unique POSIX value iff the string \emph{is} in the language):
 \begin{theorem}\mbox{}\smallskip\\\label{lexercorrect}
 (2) & @{thm (lhs) lexer_correct_Some} if and only if @{thm (rhs) lexer_correct_Some}\\
 \end{tabular}
 \end{theorem}
 \begin{proof}
-By induction on @{term s} using Lem.~\ref{lemmkeps} and \ref{Posix2}.\qed
+By induction on @{term s} using Lemma~\ref{lemmkeps} and \ref{Posix2}.\qed
 \end{proof}
-\noindent In (2) we further know by Thm.~\ref{posixdeterm} that the
+\noindent In (2) we further know by Theorem~\ref{posixdeterm} that the
 value returned by the lexer must be unique.   A simple corollary
 of our two theorems is:
 \begin{corollary}\mbox{}\smallskip\\\label{lexercorrectcor}
 \begin{tabular}{ll}
 (1) & @{thm (lhs) lexer_correctness(2)} if and only if @{thm (rhs) lexer_correctness(2)}\\
 (2) & @{thm (lhs) lexer_correctness(1)} if and only if @{thm (rhs) lexer_correctness(1)}\\
 \end{tabular}
 \end{corollary}
-\noindent
+\noindent This concludes our correctness proof. Note that we have
-This concludes our
+not changed the algorithm of Sulzmann and Lu,\footnote{All
-correctness proof. Note that we have not changed the algorithm of
+deviations we introduced are harmless.} but introduced our own
-Sulzmann and Lu,\footnote{All deviations we introduced are
+specification for what a correct result---a POSIX value---should be.
-harmless.} but introduced our own specification for what a correct
+In the next section we show that our specification coincides with
-result---a POSIX value---should be. A strong point in favour of
+another one given by Okui and Suzuki using a different technique.
-Sulzmann and Lu's algorithm is that it can be extended in various
-ways.
 *}
 section {* Ordering of Values according to Okui and Suzuki*}
 Okui and Suzuki \cite{OkuiSuzuki2010,OkuiSuzukiTech} described
 another ordering of values, which they use to establish the
 correctness of their automata-based algorithm for POSIX matching.
 Their ordering resembles some aspects of the one given by Sulzmann
 and Lu, but is quite different. To begin with, Okui and Suzuki
-identify POSIX values as least, rather than maximal, elements in
+identify POSIX values as minimal, rather than maximal, elements in
 their ordering. A more substantial difference is that the ordering
 by Okui and Suzuki uses \emph{positions} in order to identify and
 compare subvalues. Positions are lists of natural numbers. This
 allows them to quite naturally formalise the Longest Match and
 Priority rules of the informal POSIX standard.  Consider for example
 @{thm (lhs) Pos_stars} & @{text "\<equiv>"} & @{thm (rhs) Pos_stars}\\
 \end{tabular}
 \end{center}
 \noindent
-In the last clause @{text len} stands for the length of a list. Clearly
+whereby @{text len} stands for the length of a list. Clearly
 for every position inside a value there exists a subvalue at that position.
 To help understanding the ordering of Okui and Suzuki, consider again
 the earlier value
 @{thm pflat_len_def}
 \end{center}
 \noindent where we take the length of the flattened value at
 position @{text p}, provided the position is inside @{text v}; if
-not, then the norm is @{text "-1"}. The default is crucial
+not, then the norm is @{text "-1"}. The default for outside
-for the POSIX requirement of preferring a @{text Left}-value
+positions is crucial for the POSIX requirement of preferring a
-over a @{text Right}-value (if they can match the same string---see
+@{text Left}-value over a @{text Right}-value (if they can match the
-the Priority Rule from the Introduction). For this consider
+same string---see the Priority Rule from the Introduction). For this
+consider
 \begin{center}
 @{term "v == Left (Char x)"} \qquad and \qquad @{term "w == Right (Char x)"}
 \end{center}
 \noindent Both values match @{text x}, but at position @{text "[0]"}
-the norm of @{term v} is @{text 1} (the subvalue matches @{text x}), but the
+the norm of @{term v} is @{text 1} (the subvalue matches @{text x}),
-norm of @{text w} is @{text "-1"} (the position is outside @{text w}
+but the norm of @{text w} is @{text "-1"} (the position is outside
-according to how we defined the `inside' positions of @{text Left}-
+@{text w} according to how we defined the `inside' positions of
-and @{text Right}-values).  Of course at position @{text "[1]"}, the
+@{text Left}- and @{text Right}-values).  Of course at position
-norms @{term "pflat_len v [1]"} and @{term "pflat_len w [1]"} are
+@{text "[1]"}, the norms @{term "pflat_len v [1]"} and @{term
-reversed, but the point is that subvalues will be analysed according to
+"pflat_len w [1]"} are reversed, but the point is that subvalues
-lexicographically orderd positions.  This order, written @{term
+will be analysed according to lexicographically ordered
-"DUMMY \<sqsubset>lex DUMMY"}, can be conveniently formalised by
+positions. According to this ordering, the position @{text "[0]"}
-three inference rules
+takes precedence.  The lexicographic ordering of positions, written
+@{term "DUMMY \<sqsubset>lex DUMMY"}, can be conveniently formalised
+by three inference rules
 \begin{center}
 \begin{tabular}{ccc}
 @{thm [mode=Axiom] lex_list.intros(1)}\hspace{1cm} &
 @{thm [mode=Rule] lex_list.intros(3)[where ?p1.0="p\<^sub>1" and ?p2.0="p\<^sub>2" and
 ?ps1.0="ps\<^sub>1" and ?ps2.0="ps\<^sub>2"]}\hspace{1cm} &
 @{thm [mode=Rule] lex_list.intros(2)[where ?ps1.0="ps\<^sub>1" and ?ps2.0="ps\<^sub>2"]}
 \end{tabular}
 \end{center}
-With the norm and lexicographic order of positions in place,
+With the norm and lexicographic order in place,
 we can state the key definition of Okui and Suzuki
 \cite{OkuiSuzuki2010}: a value @{term "v\<^sub>1"} is \emph{smaller} than
 @{term "v\<^sub>2"} if and only if  $(i)$ the norm at position @{text p} is
 greater in @{term "v\<^sub>1"} (that is the string @{term "flat (at v\<^sub>1 p)"} is longer
 than @{term "flat (at v\<^sub>2 p)"}) and $(ii)$ all subvalues at
 \noindent The position @{text p} in this definition acts as the
 \emph{first distinct position} of @{text "v\<^sub>1"} and @{text
 "v\<^sub>2"}, where both values match strings of different length
 \cite{OkuiSuzuki2010}.  Since at @{text p} the values @{text
-"v\<^sub>1"} and @{text "v\<^sub>2"} macth different strings, the
+"v\<^sub>1"} and @{text "v\<^sub>2"} match different strings, the
 ordering is irreflexive. Derived from the definition above
 are the following two orderings:
 \begin{center}
 \begin{tabular}{l}
 @{thm PosOrd_ex_def[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}\\
 @{thm PosOrd_ex_eq_def[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
 \end{tabular}
 \end{center}
-While we encountred a number of obstacles for establishing properties like
+While we encountered a number of obstacles for establishing properties like
 transitivity for the ordering of Sulzmann and Lu (and which we failed
 to overcome), it is relatively straightforward to establish this
 property for the ordering by Okui and Suzuki.
 \begin{lemma}[Transitivity]\label{transitivity}
 @{term "pflat_len v\<^sub>1 p' = pflat_len v\<^sub>3 p'"} holds.
 Suppose @{term "p' \<in> Pos v\<^sub>1"}, then we can infer from the
 first assumption that @{term "pflat_len v\<^sub>1 p' = pflat_len v\<^sub>2 p'"}.
 But this means that @{term "p'"} must be in  @{term "Pos v\<^sub>2"} too.
 Hence we can use the second assumption and infer  @{term "pflat_len v\<^sub>2 p' = pflat_len v\<^sub>3 p'"}, which concludes
-this case with @{term "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"}.
+this case with @{term "v\<^sub>1 :\<sqsubset>val v\<^sub>3"}.
 The reasoning in the other cases is similar.\qed
 \end{proof}
-\noindent We can show that @{term "DUMMY :\<sqsubseteq>val DUMMY"} is
+\noindent It is straightforward to show that @{text "\<prec>"} and
-a partial order.  Okui and Suzuki also show that it is a linear order
+$\preccurlyeq$ are partial orders.  Okui and Suzuki also show that it
-for lexical values \cite{OkuiSuzuki2010} of a given regular
+is a linear order for lexical values \cite{OkuiSuzuki2010} of a given
-expression and given string, but we have not done this. It is not
+regular expression and given string, but we have not done this. It is
-essential for our results. What we are going to show below is that
+not essential for our results. What we are going to show below is
-for a given @{text r} and @{text s}, the ordering has a unique
+that for a given @{text r} and @{text s}, the ordering has a unique
 minimal element on the set @{term "LV r s"}, which is the POSIX value
 we defined in the previous section.
 Lemma 1
 @{thm [mode=IfThen] PosOrd_shorterI[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
 Next we establish how Okui and Suzuki's ordering relates to our
-definition of POSIX values.  Given a POSIX value @{text "v\<^sub>1"}
+definition of POSIX values.  Given a @{text POSIX} value @{text "v\<^sub>1"}
 for @{text r} and @{text s}, then any other lexical value @{text
 "v\<^sub>2"} in @{term "LV r s"} is greater or equal than @{text
-"v\<^sub>1"}:
+"v\<^sub>1"}, namely:
-\begin{theorem}
+\begin{theorem}\label{orderone}
 @{thm [mode=IfThen] Posix_PosOrd[where ?v1.0="v\<^sub>1" and ?v2.0="v\<^sub>2"]}
 \end{theorem}
 \begin{proof} By induction on our POSIX rules. By
-Thm~\ref{posixdeterm} and the definition of @{const LV}, it is clear
+Theorem~\ref{posixdeterm} and the definition of @{const LV}, it is clear
 that @{text "v\<^sub>1"} and @{text "v\<^sub>2"} have the same
 underlying string @{term s}.  The three base cases are
 straightforward: for example for @{term "v\<^sub>1 = Void"}, we have
 that @{term "v\<^sub>2 \<in> LV ONE []"} must also be of the form
 \mbox{@{term "v\<^sub>2 = Void"}}. Therefore we have @{term
 "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"}.  The inductive cases for
-@{term "ALT r\<^sub>1 r\<^sub>2"} and @{term "SEQ r\<^sub>1
+@{text r} being of the form @{term "ALT r\<^sub>1 r\<^sub>2"} and
-r\<^sub>2"} are as follows:
+@{term "SEQ r\<^sub>1 r\<^sub>2"} are as follows:
 \begin{itemize}
-\item[$\bullet$] Case @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2)
+\item[$\bullet$] Case @{text "P+L"} with @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2)
 \<rightarrow> (Left w\<^sub>1)"}: In this case @{term "v\<^sub>1 =
 Left w\<^sub>1"} and the value @{term "v\<^sub>2"} is either of the
 form @{term "Left w\<^sub>2"} or @{term "Right w\<^sub>2"}. In the
 latter case we can immediately conclude with @{term "v\<^sub>1
 :\<sqsubseteq>val v\<^sub>2"} since a @{text Left}-value with the
-same underlying string @{text s} is always smaller or equal than a
+same underlying string @{text s} is always smaller than a
 @{text Right}-value.  In the former case we have @{term "w\<^sub>2
 \<in> LV r\<^sub>1 s"} and can use the induction hypothesis to infer
 @{term "w\<^sub>1 :\<sqsubseteq>val w\<^sub>2"}. Because @{term
 "w\<^sub>1"} and @{term "w\<^sub>2"} have the same underlying string
 @{text s}, we can conclude with @{term "Left w\<^sub>1
 :\<sqsubseteq>val Left w\<^sub>2"} by Prop ???.\smallskip
-\item[$\bullet$] Case @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2)
+\item[$\bullet$] Case @{text "P+R"} with @{term "s \<in> (ALT r\<^sub>1 r\<^sub>2)
-\<rightarrow> (Right w\<^sub>1)"}: This case similar as the previous
+\<rightarrow> (Right w\<^sub>1)"}: This case similar to the previous
-case, except that we know that @{term "s \<notin> L
+case, except that we additionally know @{term "s \<notin> L
 r\<^sub>1"}. This is needed when @{term "v\<^sub>2"} is of the form
-@{term "Left w\<^sub>2"}: since \mbox{@{term "flat v\<^sub>2 = flat
+@{term "Left w\<^sub>2"}. Since \mbox{@{term "flat v\<^sub>2 = flat
 w\<^sub>2"} @{text "= s"}} and @{term "\<Turnstile> w\<^sub>2 :
-r\<^sub>1"}, we can derive a contradiction using
+r\<^sub>1"}, we can derive a contradiction for \mbox{@{term "s \<notin> L
-Prop.~\ref{inhabs}. So also in this case \mbox{@{term "v\<^sub>1
+r\<^sub>1"}} using
+Proposition~\ref{inhabs}. So also in this case \mbox{@{term "v\<^sub>1
 :\<sqsubseteq>val v\<^sub>2"}}.\smallskip
-\item[$\bullet$] Case @{term "(s\<^sub>1 @ s\<^sub>2) \<in> (SEQ
+\item[$\bullet$] Case @{text "PS"} with @{term "(s\<^sub>1 @ s\<^sub>2) \<in> (SEQ
 r\<^sub>1 r\<^sub>2) \<rightarrow> (Seq w\<^sub>1 w\<^sub>2)"}: We
 can assume @{term "v\<^sub>2 = Seq (u\<^sub>1) (u\<^sub>2)"} with
 @{term "\<Turnstile> u\<^sub>1 : r\<^sub>1"} and \mbox{@{term
 "\<Turnstile> u\<^sub>2 : r\<^sub>2"}}. We have @{term "s\<^sub>1 @
 s\<^sub>2 = (flat u\<^sub>1) @ (flat u\<^sub>2)"}.  By the
-side-condition on out @{text PS}-rule we know that either @{term
+side-condition of the @{text PS}-rule we know that either @{term
-"s\<^sub>1 = flat u\<^sub>1"} or @{term "flat u\<^sub>1"} is a
+"s\<^sub>1 = flat u\<^sub>1"} or that @{term "flat u\<^sub>1"} is a
 strict prefix ??? of @{term "s\<^sub>1"}. In the latter case we can
 infer @{term "w\<^sub>1 :\<sqsubset>val u\<^sub>1"} by ???  and from
 this @{term "v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"} by ???.  In the
 former case we know @{term "u\<^sub>1 \<in> LV r\<^sub>1 s\<^sub>1"}
 and @{term "u\<^sub>2 \<in> LV r\<^sub>2 s\<^sub>2"}. With this we
 can use the induction hypotheses to infer @{term "w\<^sub>1
 :\<sqsubseteq>val u\<^sub>1"} and @{term "w\<^sub>2
 :\<sqsubseteq>val u\<^sub>2"}. By ??? we can again infer @{term
-"v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"} and are done.
+"v\<^sub>1 :\<sqsubseteq>val v\<^sub>2"}.
 \end{itemize}
-\noindent The case for @{text "P\<star>"} is similar to the @{text PS}-case.\qed
+\noindent The case for @{text "P\<star>"} is similar to the @{text PS}-case and omitted.\qed
 \end{proof}
-\noindent This theorem shows that our posix value @{text
+\noindent This theorem shows that our @{text POSIX} value for a
-"v\<^sub>1"} with @{term "s \<in> r \<rightarrow> v\<^sub>1"} is a
+regular expression @{text r} and string @{term s} is in fact a
-minimal element for the values in @{text "LV r s"}. By ??? we also
+minimal element of the values in @{text "LV r s"}. By ??? we also
 know that any value in @{text "LV s' r"}, with @{term "s'"} being a
 prefix, cannot be smaller than @{text "v\<^sub>1"}. The next theorem
-shows the opposite---namely any minimal element must be a @{text
+shows the opposite---namely any minimal element in @{term "LV r s"}
-POSIX}-value.  Given a lexical value @{text "v\<^sub>1"}, say, in
+must be a @{text POSIX} value. For this it helps that we proved in
-@{term "LV r s"}, for which there does not exists any smaller value
+the previous section that whenever a string @{term "s \<in> L r"} then
-in @{term "LV r s"}, then this value is our @{text POSIX}-value:
+a corresponding @{text POSIX} value exists and is a lexical value,
+see Theorem ??? and Lemma ???.
 \begin{theorem}
 @{thm [mode=IfThen] PosOrd_Posix[where ?v1.0="v\<^sub>1"]}
 \end{theorem}
-\begin{proof} By induction on @{text r}. The tree base cases are
+\begin{proof}
-again straightforward.  For example if @{term "v\<^sub>1 \<in> LV
+If @{thm (prem 1) PosOrd_Posix[where ?v1.0="v\<^sub>1"]} then
-ONE s"} then @{term "v\<^sub>1 = Void"} and @{term "s = []"}. We
+@{term "s \<in> L r"} by Proposition~\ref{inhabs}. Hence by Theorem~\ref{lexercorrect}(2)
-know that @{term "[] \<in> ONE \<rightarrow> Void"} holds.  In the
+there exists a
-cases for @{term "ALT r\<^sub>1 r\<^sub>2"} and @{term "SEQ
+@{text POSIX} value @{term "v\<^sub>P"} with @{term "s \<in> r \<rightarrow> v\<^sub>P"}
-r\<^sub>1 r\<^sub>2"} we reason as follows:
+and by Lemma~\ref{LVposix} we also have @{term "v\<^sub>P \<in> LV r s"}.
+By Theorem~\ref{orderone} we therefore have
-\begin{itemize}
+@{term "v\<^sub>P :\<sqsubseteq>val v\<^sub>1"}. If @{term "v\<^sub>P = v\<^sub>1"} then
+we are done. Otherwise we have @{term "v\<^sub>P :\<sqsubset>val v\<^sub>1"} which
-\item[$\bullet$] Case @{term "v\<^sub>1 \<in> LV (ALT r\<^sub>1
+however contradicts the second assumption and we are done too.\qed
-r\<^sub>2) s"} with @{term "v\<^sub>1 = Left w\<^sub>1"} and @{term
-"w\<^sub>1 \<in> LV r\<^sub>1 s"}: In order to use the induction
-hypothesis we need to infer
-\begin{center}
-@{term "\<forall>v'
-\<in> LV (ALT r\<^sub>1 r\<^sub>2) s. \<not> (v' :\<sqsubset>val
-Left w\<^sub>1)"}
-implies
-@{term "\<forall>v' \<in> LV r\<^sub>1
-s. \<not> (v' :\<sqsubset>val w\<^sub>1)"}
-\end{center}
-\noindent This can be done because of ?? we can infer that for every
-@{text v'} in @{term "LV r\<^sub>1 s"} and @{term "v'
-:\<sqsubset>val w\<^sub>1"} also @{term "Left v' :\<sqsubset>val
-Left w\<^sub>1"} holds. This gives a contradiction. Consequently, we
-can use the induction hypothesis to obtain @{term "s \<in> r\<^sub>1
-\<rightarrow> w\<^sub>1"} and then conclude this case with @{term "s
-\<in> ALT r\<^sub>1 r\<^sub>2 \<rightarrow> v\<^sub>1"}.\smallskip
-\item[$\bullet$] Case @{term "v\<^sub>1 \<in> LV (ALT r\<^sub>1
-r\<^sub>2) s"} with @{term "v\<^sub>1 = Right w\<^sub>1"} and @{term
-"w\<^sub>1 \<in> LV r\<^sub>2 s"}: Like above we can use the
-induction hypothesis in order to infer @{term "s \<in> r\<^sub>2
-\<rightarrow> w\<^sub>1"}. In order to conclude we still need to
-obtain @{term "s \<notin> L r\<^sub>1"}. Let us suppose the opposite.
-Then there is a @{term v'} such that @{term "v' \<in> LV r\<^sub>1 s"}
-by Prop ??? and hence @{term "Left v' \<in> LV (ALT r\<^sub>1 r\<^sub>2) s"}.
-But then we can use the second assumption of the theorem to infer that
-@{term "\<not>(Left v' :\<sqsubset>val Right w\<^sub>1)"}, which cannot be the case.
-Therefore @{term "s \<notin> L r\<^sub>1"} must hold and we can also conclude in this
-case.
-\end{itemize}
 \end{proof}
-To sum up, we have shown that minimal elements according to the ordering
+\begin{corollary}
-by Okui and Suzuki are exactly the @{text POSIX}-values we defined inductively
+@{thm [mode=IfThen] Least_existence1}
-in Section ???
+\end{corollary}
+\noindent To sum up, we have shown that minimal elements according
+to the ordering by Okui and Suzuki are exactly the @{text POSIX}
+values we defined inductively in Section~\ref{posixsec}
+IS THE minimal element unique? We have not shown totality.
 *}
+section {* Optimisations *}
-section {* Extensions and Optimisations*}
 text {*
-If we are interested in tokenising a string, then we need to not just
-split up the string into tokens, but also ``classify'' the tokens (for
-example whether it is a keyword or an identifier). This can be done with
-only minor modifications to the algorithm by introducing \emph{record
-regular expressions} and \emph{record values} (for example
-\cite{Sulzmann2014b}):
-\begin{center}
-@{text "r :="}
-@{text "..."} $\mid$
-@{text "(l : r)"} \qquad\qquad
-@{text "v :="}
-@{text "..."} $\mid$
-@{text "(l : v)"}
-\end{center}
-\noindent where @{text l} is a label, say a string, @{text r} a regular
-expression and @{text v} a value. All functions can be smoothly extended
-to these regular expressions and values. For example \mbox{@{text "(l :
-r)"}} is nullable iff @{term r} is, and so on. The purpose of the record
-regular expression is to mark certain parts of a regular expression and
-then record in the calculated value which parts of the string were matched
-by this part. The label can then serve as classification for the tokens.
-For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for
-keywords and identifiers from the Introduction. With the record regular
-expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}}
-and then traverse the calculated value and only collect the underlying
-strings in record values. With this we obtain finite sequences of pairs of
-labels and strings, for example
-\[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
-\noindent from which tokens with classifications (keyword-token,
-identifier-token and so on) can be extracted.
 Derivatives as calculated by \Brz's method are usually more complex
 regular expressions than the initial one; the result is that the
 derivative-based matching and lexing algorithms are often abysmally slow.
 However, various optimisations are possible, such as the simplifications
 particular for @{term "r"} being the derivative @{term "der c
 r"}). Let @{term "r\<^sub>s"} be the simplified derivative regular expression, that is @{term
 "fst (simp (der c r))"}, and @{term "f\<^sub>r"} be the rectification
 function, that is @{term "snd (simp (der c r))"}.  We distinguish the cases
 whether (*) @{term "s \<in> L (der c r)"} or not. In the first case we
-have by Thm.~\ref{lexercorrect}(2) a value @{term "v"} so that @{term
+have by Theorem~\ref{lexercorrect}(2) a value @{term "v"} so that @{term
 "lexer (der c r) s = Some v"} and @{term "s \<in> der c r \<rightarrow> v"} hold.
-By Lem.~\ref{slexeraux}(1) we can also infer from~(*) that @{term "s
+By Lemma~\ref{slexeraux}(1) we can also infer from~(*) that @{term "s
-\<in> L r\<^sub>s"} holds.  Hence we know by Thm.~\ref{lexercorrect}(2) that
+\<in> L r\<^sub>s"} holds.  Hence we know by Theorem~\ref{lexercorrect}(2) that
 there exists a @{term "v'"} with @{term "lexer r\<^sub>s s = Some v'"} and
 @{term "s \<in> r\<^sub>s \<rightarrow> v'"}. From the latter we know by
-Lem.~\ref{slexeraux}(2) that @{term "s \<in> der c r \<rightarrow> (f\<^sub>r v')"} holds.
+Lemma~\ref{slexeraux}(2) that @{term "s \<in> der c r \<rightarrow> (f\<^sub>r v')"} holds.
-By the uniqueness of the POSIX relation (Thm.~\ref{posixdeterm}) we
+By the uniqueness of the POSIX relation (Theorem~\ref{posixdeterm}) we
 can infer that @{term v} is equal to @{term "f\<^sub>r v'"}---that is the
 rectification function applied to @{term "v'"}
 produces the original @{term "v"}.  Now the case follows by the
 definitions of @{const lexer} and @{const slexer}.
 In the second case where @{term "s \<notin> L (der c r)"} we have that
-@{term "lexer (der c r) s = None"} by Thm.~\ref{lexercorrect}(1).  We
+@{term "lexer (der c r) s = None"} by Theorem~\ref{lexercorrect}(1).  We
-also know by Lem.~\ref{slexeraux}(1) that @{term "s \<notin> L r\<^sub>s"}. Hence
+also know by Lemma~\ref{slexeraux}(1) that @{term "s \<notin> L r\<^sub>s"}. Hence
-@{term "lexer r\<^sub>s s = None"} by Thm.~\ref{lexercorrect}(1) and
+@{term "lexer r\<^sub>s s = None"} by Theorem~\ref{lexercorrect}(1) and
 by IH then also @{term "slexer r\<^sub>s s = None"}. With this we can
 conclude in this case too.\qed
 \end{proof}
+*}
+section {* Extensions*}
+text {*
+A strong point in favour of
+Sulzmann and Lu's algorithm is that it can be extended in various
+ways.
+If we are interested in tokenising a string, then we need to not just
+split up the string into tokens, but also ``classify'' the tokens (for
+example whether it is a keyword or an identifier). This can be done with
+only minor modifications to the algorithm by introducing \emph{record
+regular expressions} and \emph{record values} (for example
+\cite{Sulzmann2014b}):
+\begin{center}
+@{text "r :="}
+@{text "..."} $\mid$
+@{text "(l : r)"} \qquad\qquad
+@{text "v :="}
+@{text "..."} $\mid$
+@{text "(l : v)"}
+\end{center}
+\noindent where @{text l} is a label, say a string, @{text r} a regular
+expression and @{text v} a value. All functions can be smoothly extended
+to these regular expressions and values. For example \mbox{@{text "(l :
+r)"}} is nullable iff @{term r} is, and so on. The purpose of the record
+regular expression is to mark certain parts of a regular expression and
+then record in the calculated value which parts of the string were matched
+by this part. The label can then serve as classification for the tokens.
+For this recall the regular expression @{text "(r\<^bsub>key\<^esub> + r\<^bsub>id\<^esub>)\<^sup>\<star>"} for
+keywords and identifiers from the Introduction. With the record regular
+expression we can form \mbox{@{text "((key : r\<^bsub>key\<^esub>) + (id : r\<^bsub>id\<^esub>))\<^sup>\<star>"}}
+and then traverse the calculated value and only collect the underlying
+strings in record values. With this we obtain finite sequences of pairs of
+labels and strings, for example
+\[@{text "(l\<^sub>1 : s\<^sub>1), ..., (l\<^sub>n : s\<^sub>n)"}\]
+\noindent from which tokens with classifications (keyword-token,
+identifier-token and so on) can be extracted.
 *}
 section {* The Correctness Argument by Sulzmann and Lu\label{argu} *}

changeset 272	f16019b11179
parent 271	f46ebc84408d
child 273	e85099ac4c6c