regexp: comparison Paper/Paper.thy

equal deleted inserted replaced

-:62fdb4bf7239
+:ec774952190c
 \hspace{7mm}
 @{thm pow.simps(2)[THEN eq_reflection, where A1="A" and n1="n"]}
 \end{center}
 \noindent
-where @{text "@"} is the usual list-append operation. The Kleene-star of a language @{text A}
+where @{text "@"} is the list-append operation. The Kleene-star of a language @{text A}
 is defined as the union over all powers, namely @{thm Star_def}. In the paper
 we will make use of the following properties of these constructions.
 \begin{proposition}\label{langprops}\mbox{}\\
 \begin{tabular}{@ {}ll}
 \end{tabular}
 \end{proposition}
 \noindent
 In @{text "(ii)"} we use the notation @{term "length s"} for the length of a
-string.  This property states that if @{term "[] \<notin> A"} then the lengths of
+string; this property states that if @{term "[] \<notin> A"} then the lengths of
 the strings in @{term "A \<up> (Suc n)"} must be longer than @{text n}.  We omit
 the proofs for these properties, but invite the reader to consult our
 formalisation.\footnote{Available at ???}
 The notation in Isabelle/HOL for the quotient of a language @{text A} according to an
 \noindent
 This allows us to prove
 \begin{lemma}\label{ardenable}
+Given an equation @{text "X = rhs"}.
 If @{text "X = \<Union>\<calL> ` rhs"},
 @{thm (prem 2) Arden_keeps_eq} and
 @{thm (prem 3) Arden_keeps_eq}, then
 @{text "X = \<Union>\<calL> ` (Arden X rhs)"}
 \end{lemma}
 \noindent
 This principle states that given an invariant (which we will specify below)
 we can prove a property
 @{text "P"} involving @{const Solve}. For this we have to discharge the following
 proof obligations: first the
-initial equational system satisfies the invariant; second that the iteration
+initial equational system satisfies the invariant; second the iteration
 step @{text "Iter"} preserves the the invariant as long as the condition @{term Cond} holds;
-third that @{text "Iter"} decreases the termination order, and fourth that
+third @{text "Iter"} decreases the termination order, and fourth that
 once the condition does not hold anymore then the property @{text P} must hold.
 The property @{term P} in our proof will state that @{term "Solve X (Init (UNIV // \<approx>A))"}
 returns with a single equation @{text "X = xrhs"} for some @{text "xrhs"}, and
 that this equational system still satisfies the invariant. In order to get
 \begin{proof}
 Finiteness is given by the assumption and the way how we set up the
 initial equational system. Soundness is proved in Lem.~\ref{inv}. Distinctness
 follows from the fact that the equivalence classes are disjoint. The ardenable
-property also follows from the setup of the equational system, as does
+property also follows from the setup of the initial equational system, as does
 validity.\qed
 \end{proof}
+\noindent
+Next we show that @{text Iter} preserves the invariant.
 \begin{lemma}\label{iterone}
 @{thm[mode=IfThen] iteration_step_invariant[where xrhs="rhs"]}
 \end{lemma}
 \end{center}
 \noindent
 Finiteness is straightforward, as @{const Subst} and @{const Arden} operations
 keep the equational system finite. These operation also preserve soundness
-distinctness (we proved soundness for @{const Arden} in Lem.~\ref{ardenable}).
+and distinctness (we proved soundness for @{const Arden} in Lem.~\ref{ardenable}).
-The property Ardenable is clearly preserved because the append-operation
+The property ardenable is clearly preserved because the append-operation
 cannot make a regular expression to match the empty string. Validity is
 given because @{const Arden} removes an equivalence class from @{text yrhs}
 and then @{const Subst_all} removes @{text Y} from the equational system.
 Having proved the implication above, we can replace @{text "ES"} with @{text "ES - {(Y, yrhs)}"}
 which matches with our proof-obligation of @{const "Subst_all"}. Since
 \mbox{@{term "ES = ES - {(Y, yrhs)} \<union> {(Y, yrhs)}"}}, we can use our assumption
 to complete the proof.\qed
 \end{proof}
+\noindent
+We also need the fact that @{text Iter} decreases the termination measure.
 \begin{lemma}\label{itertwo}
 @{thm[mode=IfThen] iteration_step_measure[simplified (no_asm), where xrhs="rhs"]}
 \end{lemma}
 \begin{proof}
 that @{term "Y \<noteq> X"}. We further know that @{text "Remove ES Y yrhs"}
 removes the equation @{text "Y = yrhs"} from the system, and therefore
 the cardinality of @{const Iter} strictly decreases.\qed
 \end{proof}
+\noindent
+This brings us to our property we want establish for @{text Solve}.
 \begin{lemma}
 If @{thm (prem 1) Solve} and @{thm (prem 2) Solve} then there exists
 a @{text rhs} such that  @{term "Solve X (Init (UNIV // \<approx>A)) = {(X, rhs)}"}
 and @{term "invariant {(X, rhs)}"}.
 \end{lemma}
 \begin{proof}
 In order to prove this lemma using \eqref{whileprinciple}, we have to use a slightly
 stronger invariant since Lem.~\ref{iterone} and \ref{itertwo} have the precondition
 that @{term "(X, rhs) \<in> ES"} for some @{text rhs}. This precondition is needed
 in order to choose in the @{const Iter}-step an equation that is not \mbox{@{term "X = rhs"}}.
-Therefore our invariant is cannot be just @{term "invariant ES"}, but must be
+Therefore our invariant cannot be just @{term "invariant ES"}, but must be
 @{term "invariant ES \<and> (\<exists>rhs. (X, rhs) \<in> ES)"}. By assumption
 @{thm (prem 2) Solve} and Lem.~\ref{invzero}, the more general invariant holds for
 the initial equational system. This is premise 1 of~\eqref{whileprinciple}.
 Premise 2 is given by Lem.~\ref{iterone} and the fact that @{const Iter} might
 modify the @{text rhs} in the equation @{term "X = rhs"}, but does not remove it.
 Premise 3 of~\eqref{whileprinciple} is by Lem.~\ref{itertwo}. Now in premise 4
 we like to show that there exists a @{text rhs} such that @{term "ES = {(X, rhs)}"}
 and that @{text "invariant {(X, rhs)}"} holds, provided the condition @{text "Cond"}
-does not hols. By the stronger invariant we know there exists such a @{text "rhs"}
+does not holds. By the stronger invariant we know there exists such a @{text "rhs"}
 with @{term "(X, rhs) \<in> ES"}. Because @{text Cond} is not true, we know the cardinality
 of @{text ES} is @{text 1}. This means @{text "ES"} must actually be the equation @{text "X = rhs"},
 for which the invariant holds. This allows us to conclude that
-@{term "Solve X (Init (UNIV // \<approx>A)) = {(X, rhs)}"} and @{term "invariant {(X, rhs)}"} hold.\qed
+@{term "Solve X (Init (UNIV // \<approx>A)) = {(X, rhs)}"} and @{term "invariant {(X, rhs)}"} hold,
+as needed.\qed
 \end{proof}
 \noindent
 With this lemma in place we can show that for every equivalence class in @{term "UNIV // \<approx>A"}
 there exists a regular expression.
 know @{text "X = \<Union>\<calL> ` rhs"}. We further know that
 this is equal to \mbox{@{text "\<Union>\<calL> ` (Arden X rhs)"}} using the properties of the
 invariant and Lem.\ref{ardenable}. Using the validity property for the equation @{text "X = rhs"},
 we can infer that @{term "rhss rhs \<subseteq> {X}"} and because the arden operation
 removes that @{text X} from @{text rhs}, that @{term "rhss (Arden X rhs) = {}"}.
-That means @{term "Arden X rhs"} can only consist of terms of the form @{term "Lam r"}.
+This means the right-hand side @{term "Arden X rhs"} can only consist of terms of the form @{term "Lam r"}.
 So we can collect those (finitely many) regular expressions and have @{term "X = L (\<Uplus>rs)"}.
 With this we can conclude the proof.\qed
 \end{proof}
 \noindent
 equivalence classes generated by the Myhill-Nerode relation (this is usually
 the purpose of the pumping lemma \cite{Kozen97}).  We can also use it to
 establish the standard textbook results about closure properties of regular
 languages. Interesting is the case of closure under complement, because
 it seems difficult to construct a regular expression for the complement
-language by direct means. However the existence can be easily proved using
+language by direct means. However the existence of such a regular expression
-the Myhill-Nerode theorem since clearly
+can be easily proved using the Myhill-Nerode theorem since
 \begin{center}
 @{term "s\<^isub>1 \<approx>A s\<^isub>2"} if and only if @{term "s\<^isub>1 \<approx>(-A) s\<^isub>2"}
 \end{center}
 \noindent
 holds for any strings @{text "s\<^isub>1"} and @{text
 "s\<^isub>2"}. Therefore @{text A} and @{term "-A"} give rise to the same
-partitions.  From the closure under complementation follows also the closure
+partitions.
-under intersection and set difference by some simple set calculations.
 Proving the same result via automata would be quite involved. It includes the
 steps: regular expression @{text "\<Rightarrow>"} non-deterministic automaton @{text
 "\<Rightarrow>"} deterministic automaton @{text "\<Rightarrow>"} complement automaton @{text "\<Rightarrow>"}
 regular expression.
-Our formalisation consists of ??? lines of Isar code for the first
+Our formalisation consists of ??? lines of Isabelle/Isar code for the first
-direction and ??? for the second. While this might be seen as too large
+direction and ??? for the second. While this might be seen as too large to
-to count as a concise proof pearl, this should be seen in the context
+count as a concise proof pearl, this should be seen in the context of the
-of the work done by Constable at al \cite{Constable00} who formalised
+work done by Constable at al \cite{Constable00} who formalised the
-the Myhill-Nerode theorem in Nuprl using automata. They write that
+Myhill-Nerode theorem in Nuprl using automata. This is the most
-their four-member team needed something on the magnitute of 18 months
+They write that their
-to formalise the Myhill-Nerode theorem. Our estimate is that we needed
+four-member team needed something on the magnitute of 18 months to formalise
-approximately 3 months for our fomalisation and this included the time
+the Myhill-Nerode theorem. The estimate for our formalisation is that we
-to find our proof arguments, as we could not find them in the literature.
+needed approximately 3 months and this included the time to find our proof
-So for us the formalisation was not the bottleneck. It is hard for us
+arguments. Unlike Constable et al, who were able to follow the proofs from
-to gauge the size of a formalisation in Nurpl, but from what is shown in
+\cite{???}, we had to find our own arguments.  So for us the formalisation
-the Nuprl Math Library their development seems substantially larger.
+was not the bottleneck. It is hard to gauge the size of a formalisation in
+Nurpl, but from what is shown in the Nuprl Math Library about their
+development it seems substantially larger than ours. The code of
+ours can be found under
+\begin{center}
+???
+\end{center}
 Our proof of the first direction is very much inspired by \emph{Brzozowski's
 algebraic mehod} used to convert a finite automaton to a regular
-expression. The close connection can be seen by considering the equivalence
+expression \cite{Brzozowski64}. The close connection can be seen by considering the equivalence
 classes as the states of the minimal automaton for the regular language.
 However there are some subtle differences. If we identify equivalence
 classes with the states of the automaton, then the most natural choice is to
 characterise each state with the set of strings starting from the initial
-state leading up to that state. Usually the states are characterised as the
+state leading up to that state. Usually, however, the states are characterised as the
 ones starting from that state leading to the terminal states.  The first
 choice has consequences how the initial equational system is set up. We have
 the $\lambda$-term on our ``initial state'', while Brzozowski has it on the
 terminal states. This means we also need to reverse the direction of Arden's
 lemma.
 We briefly considered using the method Brzozowski presented in the Appendix
-of \cite{Brzozowski64} in order to prove the second direction of the
+of~\cite{Brzozowski64} in order to prove the second direction of the
 Myhill-Nerode theorem. There he calculates the derivatives for regular
 expressions and shows that there can be only finitely many of them. We could
 use as the tag of a string @{text s} the derivative of a regular expression
 generated with respect to @{text s}.  Using the fact that two strings are
 Myhill-Nerode related whenever their derivative is the same together with
 the fact that there are only finitely many derivatives for a regular
-expression would give us the same argument. However it seems not so easy to
+expression would give us the same argument as ours. However it seems not so easy to
 calculate the derivatives and then to count them. Therefore we preferred our
 direct method of using tagging-functions involving equivalence classes. This
 is also where our method shines, because we can completely side-step the
 standard argument \cite{Kozen97} where automata need to be composed, which
-is not so convenient to formalise in a HOL-based theorem prover.
+as stated in the Introduction is not so convenient to formalise in a
+HOL-based theorem prover.
 While regular expressions are convenient in formalisations, they have some
 limitations. One is that there seems to be no notion of a minimal regular
-expression, like there is a notion of a minimal automaton for a regular
+expression, like there is for automata. For an implementation of a simple
-expression.
+regular expression matcher, whose correctness has been formally
+established, we refer the reader to \cite{OwensSlind08}.
 *}
 (*<*)

changeset 113	ec774952190c
parent 112	62fdb4bf7239
child 114	c5eb5f3065ae