regexp: comparison Paper/Paper.thy

equal deleted inserted replaced

-:79b37ef9505f
+:e500cab16be4
 \noindent
 Moreover, it is not so clear how to conveniently impose a finiteness condition
 upon functions in order to represent \emph{finite} automata. The best is
 probably to resort to more advanced reasoning frameworks, such as \emph{locales}
 or \emph{type classes},
-which are not avaiable in \emph{all} HOL-based theorem provers.
+which are \emph{not} avaiable in all HOL-based theorem provers.
 Because of these problems to do with representing automata, there seems
 to be no substantial formalisation of automata theory and regular languages
 carried out in HOL-based theorem provers. Nipkow establishes in
 \cite{Nipkow98} the link between regular expressions and automata in
 A language @{text A} is \emph{regular}, provided there is a regular expression that matches all
 strings of @{text "A"}.
 \end{definition}
 \noindent
-The reason is that regular expressions, unlike graphs, matrices and functons, can
+The reason is that regular expressions, unlike graphs, matrices and functions, can
 be easily defined as inductive datatype. Consequently a corresponding reasoning
 infrastructure comes for free. This has recently been exploited in HOL4 with a formalisation
 of regular expression matching based on derivatives \cite{OwensSlind08} and
 with an equivalence checker for regular expressions in Isabelle/HOL \cite{KraussNipkow11}.
 The purpose of this paper is to
 Given a finite set of regular expressions @{text rs}, we will make use of the operation of generating
 a regular expression that matches all languages of @{text rs}. We only need to know the existence
 of such a regular expression and therefore we use Isabelle/HOL's @{const "fold_graph"} and Hilbert's
 @{text "\<epsilon>"} to define @{term "\<Uplus>rs"}. This operation, roughly speaking, folds @{const ALT} over the
 set @{text rs} with @{const NULL} for the empty set. We can prove that for a finite set @{text rs}
+%
-\begin{center}
+\begin{equation}\label{uplus}
-@{thm (lhs) folds_alt_simp} @{text "= \<Union> (\<calL> ` rs)"}
+\mbox{@{thm (lhs) folds_alt_simp} @{text "= \<Union> (\<calL> ` rs)"}}
-\end{center}
+\end{equation}
 \noindent
 holds, whereby @{text "\<calL> ` rs"} stands for the
 image of the set @{text rs} under function @{text "\<calL>"}.
 *}
 \noindent
 which means that if we concatenate the character @{text c} to the end of all
 strings in the equivalence class @{text Y}, we obtain a subset of
 @{text X}. Note that we do not define an automaton here, we merely relate two sets
-(with respect to a character). In our concrete example we have
+(with the help of a character). In our concrete example we have
 @{term "X\<^isub>1 \<Turnstile>c\<Rightarrow> X\<^isub>2"}, @{term "X\<^isub>1 \<Turnstile>d\<Rightarrow> X\<^isub>3"} with @{text d} being any
 other character than @{text c}, and @{term "X\<^isub>3 \<Turnstile>d\<Rightarrow> X\<^isub>3"} for any @{text d}.
-Next we build an \emph{initial} equational system that
+Next we build an \emph{initial equational system} that
 contains an equation for each equivalence class. Suppose we have
 the equivalence classes @{text "X\<^isub>1,\<dots>,X\<^isub>n"}, there must be one and only one that
 contains the empty string @{text "[]"} (since equivalence classes are disjoint).
 Let us assume @{text "[] \<in> X\<^isub>1"}. We build the following equational system
 \noindent
 where the terms @{text "(Y\<^isub>i\<^isub>j, CHAR c\<^isub>i\<^isub>j)"}
 stand for all transitions @{term "Y\<^isub>i\<^isub>j \<Turnstile>c\<^isub>i\<^isub>j\<Rightarrow>
 X\<^isub>i"}.   There can only be
-finitely many such terms in a right-hand side since there are only finitely many
+finitely many such terms in a right-hand side since by assumption there are only finitely many
 equivalence classes and only finitely many characters.  The term @{text
 "\<lambda>(EMPTY)"} in the first equation acts as a marker for the equivalence class
 containing @{text "[]"}.\footnote{Note that we mark, roughly speaking, the
 single ``initial'' state in the equational system, which is different from
 the method by Brzozowski \cite{Brzozowski64}, where he marks the
 in Lem.~\ref{inv}. From the solved form we will be able to read
 off the regular expressions.
 In order to transform an equational system into solved form, we have two
 operations: one that takes an equation of the form @{text "X = rhs"} and removes
-any recursive occurences of @{text X} in the @{text rhs} using our variant of Arden's
+any recursive occurrences of @{text X} in the @{text rhs} using our variant of Arden's
 Lemma. The other operation takes an equation @{text "X = rhs"}
 and substitutes @{text X} throughout the rest of the equational system
-adjusting the remaining regular expressions approriately. To define this adjustment
+adjusting the remaining regular expressions appropriately. To define this adjustment
 we define the \emph{append-operation} taking a term and a regular expression as argument
 \begin{center}
 @{thm append_rexp.simps(2)[where X="Y" and r="r\<^isub>1" and rexp="r\<^isub>2", THEN eq_reflection]}\hspace{10mm}
 @{thm append_rexp.simps(1)[where r="r\<^isub>1" and rexp="r\<^isub>2", THEN eq_reflection]}
 \noindent
 We lift this operation to entire right-hand sides of equations, written as
 @{thm (lhs) append_rhs_rexp_def[where rexp="r"]}. With this we can define
 the \emph{arden-operation} for an equation of the form @{text "X = rhs"} as:
+%
-\begin{center}
+\begin{equation}\label{arden_def}
-\begin{tabular}{rc@ {\hspace{2mm}}r@ {\hspace{1mm}}l}
+\mbox{\begin{tabular}{rc@ {\hspace{2mm}}r@ {\hspace{1mm}}l}
 @{thm (lhs) Arden_def} & @{text "\<equiv>"}~~\mbox{} & \multicolumn{2}{@ {\hspace{-2mm}}l}{@{text "let"}}\\
 & & @{text "rhs' ="} & @{term "rhs - {Trn X r | r. Trn X r \<in> rhs}"} \\
 & & @{text "r' ="}   & @{term "STAR (\<Uplus> {r. Trn X r \<in> rhs})"}\\
 & &  \multicolumn{2}{@ {\hspace{-2mm}}l}{@{text "in"}~~@{term "append_rhs_rexp rhs' r'"}}\\
-\end{tabular}
+\end{tabular}}
-\end{center}
+\end{equation}
 \noindent
 In this definition, we first delete all terms of the form @{text "(X, r)"} from @{text rhs};
-then we calculate the combinded regular expressions for all @{text r} coming
+then we calculate the combined regular expressions for all @{text r} coming
 from the deleted @{text "(X, r)"}, and take the @{const STAR} of it;
 finally we append this regular expression to @{text rhs'}. It can be easily seen
-that this operation mimics Arden's lemma on the level of equations.
+that this operation mimics Arden's lemma on the level of equations. To ensure
+the non-emptiness condition of Arden's lemma we say that a right-hand side is
+\emph{ardenable} provided
+\begin{center}
+@{thm ardenable_def}
+\end{center}
+\noindent
+This allows us to prove
+\begin{lemma}\label{ardenable}
+If @{text "X = \<Union>\<calL> ` rhs"},
+@{thm (prem 2) Arden_keeps_eq} and
+@{thm (prem 3) Arden_keeps_eq}, then
+@{text "X = \<Union>\<calL> ` (Arden X rhs)"}
+\end{lemma}
+\noindent
 The \emph{substituion-operation} takes an equation
 of the form @{text "X = xrhs"} and substitutes it into the right-hand side @{text rhs}.
 \begin{center}
 \begin{tabular}{rc@ {\hspace{2mm}}r@ {\hspace{1mm}}l}
 & &  \multicolumn{2}{@ {\hspace{-2mm}}l}{@{text "in"}~~@{term "rhs' \<union> append_rhs_rexp xrhs r'"}}\\
 \end{tabular}
 \end{center}
 \noindent
-We again delete first all occurence of @{text "(X, r)"} in @{text rhs}; we then calculate
+We again delete first all occurrence of @{text "(X, r)"} in @{text rhs}; we then calculate
 the regular expression corresponding to the deleted terms; finally we append this
 regular expression to @{text "xrhs"} and union it up with @{text rhs'}. When we use
 the substitution operation we will arrange it so that @{text "xrhs"} does not contain
-any occurence of @{text X}.
+any occurrence of @{text X}.
 With these two operation in place, we can define the operation that removes one equation
 from an equational systems @{text ES}. The operation @{const Subst_all}
 substitutes an equation @{text "X = xrhs"} throughout an equational system @{text ES};
 @{const Remove} then completely removes such an equation from @{text ES} by substituting
-it to the rest of the equational system, but first eliminating all recursive occurences
+it to the rest of the equational system, but first eliminating all recursive occurrences
 of @{text X} by applying @{const Arden} to @{text "xrhs"}.
 \begin{center}
 \begin{tabular}{rcl}
 @{thm (lhs) Subst_all_def} & @{text "\<equiv>"} & @{thm (rhs) Subst_all_def}\\
 @{thm (lhs) Remove_def}    & @{text "\<equiv>"} & @{thm (rhs) Remove_def}
 \end{tabular}
 \end{center}
 \noindent
-Finially, we can define how an equational system should be solved. For this
+Finally, we can define how an equational system should be solved. For this
 we will need to iterate the process of eliminating equations until only one equation
 will be left in the system. However, we not just want to have any equation
 as being the last one, but the one involving the equivalence class for
 which we want to calculate the regular
 expression. Let us suppose this equivalence class is @{text X}.
 & &  \multicolumn{2}{@ {\hspace{-4mm}}l}{@{text "in"}~~@{term "Remove ES Y yrhs"}}\\
 \end{tabular}
 \end{center}
 \noindent
-The last definition we need applies @{term Iter} over and over again until a condition
+The last definition we need applies @{term Iter} over and over until a condition
-@{text COND} is \emph{not} satisfied anymore. The condition states that there
+@{text Cond} is \emph{not} satisfied anymore. The condition states that there
-are more than one equation left in the equational system @{text ES}. For this
+are more than one equation left in the equational system @{text ES}. To solve
-we use Isabelle/HOL's @{text while}-operator as follows:
+an equational system we use Isabelle/HOL's @{text while}-operator as follows:
 \begin{center}
 @{thm Solve_def}
 \end{center}
 of the equational system @{text ES}. This enables us to prove
 properties about our definition of @{const Solve} when we ``call'' it with
 the equivalence class @{text X} and the initial equational system
 @{term "Init (UNIV // \<approx>A)"} from
 \eqref{initcs} using the principle:
+%
+\begin{equation}\label{whileprinciple}
-\begin{center}
+\mbox{\begin{tabular}{l}
-\begin{tabular}{l}
 @{term "invariant (Init (UNIV // \<approx>A))"} \\
 @{term "\<forall>ES. invariant ES \<and> Cond ES \<longrightarrow> invariant (Iter X ES)"}\\
 @{term "\<forall>ES. invariant ES \<and> Cond ES \<longrightarrow> card (Iter X ES) < card ES"}\\
 @{term "\<forall>ES. invariant ES \<and> \<not> Cond ES \<longrightarrow> P ES"}\\
 \hline
 \multicolumn{1}{c}{@{term "P (Solve X (Init (UNIV // \<approx>A)))"}}
-\end{tabular}
+\end{tabular}}
-\end{center}
+\end{equation}
 \noindent
 This principle states that given an invariant (which we will specify below)
 we can prove a property
 @{text "P"} involving @{const Solve}. For this we have to discharge the following
 @{term "finite ES"} & @{text "(finiteness)"}\\
 & @{text "\<and>"} & @{thm (rhs) finite_rhs_def} & @{text "(finiteness rhs)"}\\
 & @{text "\<and>"} & @{text "\<forall>(X, rhs)\<in>ES. X = \<Union>\<calL> ` rhs"} & @{text "(soundness)"}\\
 & @{text "\<and>"} & @{thm (rhs) distinct_equas_def}\\
 &             &  & @{text "(distinctness)"}\\
-& @{text "\<and>"} & @{thm (rhs) ardenable_def} & @{text "(ardenable)"}\\
+& @{text "\<and>"} & @{thm (rhs) ardenable_all_def} & @{text "(ardenable)"}\\
 & @{text "\<and>"} & @{thm (rhs) valid_eqs_def} & @{text "(validity)"}\\
 \end{tabular}
 \end{center}
 \noindent
 The first two ensure that the equational system is always finite (number of equations
 and number of terms in each equation); the second makes sure the ``meaning'' of the
 equations is preserved under our transformations. The other properties are a bit more
 technical, but are needed to get our proof through. Distinctness states that every
-equation in the system is distinct; @{text "ardenable"} ensures that we can always
+equation in the system is distinct. Ardenable ensures that we can always
-apply the arden operation. For this we have to make sure that in every @{text rhs},
+apply the arden operation.
-terms of the form @{term "Trn Y r"} cannot have a regular expresion that matches the
-empty string. Therefore @{text "rhs_nonempty"} is defined as
-\begin{center}
-@{thm rhs_nonempty_def}
-\end{center}
-\noindent
 The last property states that every @{text rhs} can only contain equivalence classes
 for which there is an equation. Therefore @{text lhss} is just the set containing
 the first components of an equational system,
 while @{text "rhss"} collects all equivalence classes @{text X} in the terms of the
-form @{term "Trn X r"} (that means @{thm (lhs) lhss_def}~@{text "\<equiv> {X | (X, rhs) \<in> ES}"}
+form @{term "Trn X r"}. That means @{thm (lhs) lhss_def}~@{text "\<equiv> {X | (X, rhs) \<in> ES}"}
-and @{thm (lhs) rhss_def}~@{text "\<equiv> {X | (X, r) \<in> rhs}"}).
+and @{thm (lhs) rhss_def}~@{text "\<equiv> {X | (X, r) \<in> rhs}"}.
-It is straightforward to prove that the inital equational system satisfies the
+It is straightforward to prove that the initial equational system satisfies the
 invariant.
-\begin{lemma}
+\begin{lemma}\label{invzero}
 @{thm[mode=IfThen] Init_ES_satisfies_invariant}
 \end{lemma}
 \begin{proof}
 Finiteness is given by the assumption and the way how we set up the
 initial equational system. Soundness is proved in Lem.~\ref{inv}. Distinctness
 follows from the fact that the equivalence classes are disjoint. The ardenable
-property also follows from the setup of the equational system as does
+property also follows from the setup of the equational system, as does
 validity.\qed
 \end{proof}
-\begin{lemma}
+\begin{lemma}\label{iterone}
 @{thm[mode=IfThen] iteration_step_invariant[where xrhs="rhs"]}
 \end{lemma}
 \begin{proof}
-???
+This boils down to choosing an equation @{text "Y = yrhs"} to be eliminated
+and to show that @{term "Subst_all (ES - {(Y, yrhs)}) Y (Arden Y yrhs)"}
+preserves the invariant.
+We prove this as follows:
+\begin{center}
+@{text "\<forall> ES."} @{thm (prem 1) Subst_all_satisfies_invariant} implies
+@{thm (concl) Subst_all_satisfies_invariant}
+\end{center}
+\noindent
+Finiteness is straightforward, as @{const Subst} and @{const Arden} operations
+keep the equational system finite. These operation also preserve soundness
+distinctness (we proved soundness for @{const Arden} in Lem.~\ref{ardenable}).
+The property Ardenable is clearly preserved because the append-operation
+cannot make a regular expression to match the empty string. Validity is
+given because @{const Arden} removes an equivalence class from @{text yrhs}
+and then @{const Subst_all} removes @{text Y} from the equational system.
+Having proved the implication above, we can replace @{text "ES"} with @{text "ES - {(Y, yrhs)}"}
+which matches with our proof-obligation of @{const "Subst_all"}. Since
+\mbox{@{term "ES = ES - {(Y, yrhs)} \<union> {(Y, yrhs)}"}}, we can use our assumption
+to complete the proof.\qed
 \end{proof}
-\begin{lemma}
+\begin{lemma}\label{itertwo}
 @{thm[mode=IfThen] iteration_step_measure[simplified (no_asm), where xrhs="rhs"]}
 \end{lemma}
 \begin{proof}
 By assumption we know that @{text "ES"} is finite and has more than one element.
 Therefore there must be an element @{term "(Y, yrhs) \<in> ES"} with
-@{term "(Y, yrhs) \<noteq> (X, rhs)"}. Using the distictness property we can infer
+@{term "(Y, yrhs) \<noteq> (X, rhs)"}. Using the distinctness property we can infer
 that @{term "Y \<noteq> X"}. We further know that @{text "Remove ES Y yrhs"}
 removes the equation @{text "Y = yrhs"} from the system, and therefore
 the cardinality of @{const Iter} strictly decreases.\qed
 \end{proof}
 a @{text rhs} such that  @{term "Solve X (Init (UNIV // \<approx>A)) = {(X, rhs)}"}
 and @{term "invariant {(X, rhs)}"}.
 \end{lemma}
 \begin{proof}
-???
+In order to prove this lemma using \eqref{whileprinciple}, we have to use a slightly
+stronger invariant since Lem.~\ref{iterone} and \ref{itertwo} have the precondition
+that @{term "(X, rhs) \<in> ES"} for some @{text rhs}. This precondition is needed
+in order to choose in the @{const Iter}-step an equation that is not \mbox{@{term "X = rhs"}}.
+Therefore our invariant is cannot be just @{term "invariant ES"}, but must be
+@{term "invariant ES \<and> (\<exists>rhs. (X, rhs) \<in> ES)"}. By assumption
+@{thm (prem 2) Solve} and Lem.~\ref{invzero}, the more general invariant holds for
+the initial equational system. This is premise 1 of~\eqref{whileprinciple}.
+Premise 2 is given by Lem.~\ref{iterone} and the fact that @{const Iter} might
+modify the @{text rhs} in the equation @{term "X = rhs"}, but does not remove it.
+Premise 3 of~\eqref{whileprinciple} is by Lem.~\ref{itertwo}. Now in premise 4
+we like to show that there exists a @{text rhs} such that @{term "ES = {(X, rhs)}"}
+and that @{text "invariant {(X, rhs)}"} holds, provided the condition @{text "Cond"}
+does not hols. By the stronger invariant we know there exists such a @{text "rhs"}
+with @{term "(X, rhs) \<in> ES"}. Because @{text Cond} is not true, we know the cardinality
+of @{text ES} is @{text 1}. This means @{text "ES"} must actually be the equation @{text "X = rhs"},
+for which the invariant holds. This allows us to conclude that
+@{term "Solve X (Init (UNIV // \<approx>A)) = {(X, rhs)}"} and @{term "invariant {(X, rhs)}"} hold.\qed
 \end{proof}
 \noindent
 With this lemma in place we can show that for every equivalence class in @{term "UNIV // \<approx>A"}
 there exists a regular expression.
 By the preceeding Lemma, we know that there exists a @{text "rhs"} such
 that @{term "Solve X (Init (UNIV // \<approx>A))"} returns the equation @{text "X = rhs"},
 and that the invariant holds for this equation. That means we
 know @{text "X = \<Union>\<calL> ` rhs"}. We further know that
 this is equal to \mbox{@{text "\<Union>\<calL> ` (Arden X rhs)"}} using the properties of the
-invariant and Lem.~???. Using the validity property for the equation @{text "X = rhs"},
+invariant and Lem.\ref{ardenable}. Using the validity property for the equation @{text "X = rhs"},
 we can infer that @{term "rhss rhs \<subseteq> {X}"} and because the arden operation
 removes that @{text X} from @{text rhs}, that @{term "rhss (Arden X rhs) = {}"}.
 That means @{term "Arden X rhs"} can only consist of terms of the form @{term "Lam r"}.
 So we can collect those (finitely many) regular expressions and have @{term "X = L (\<Uplus>rs)"}.
 With this we can conclude the proof.\qed
 of the Myhill-Nerode theorem.
 \begin{proof}[of Thm.~\ref{myhillnerodeone}]
 By Lem.~\ref{every_eqcl_has_reg} we know that there exists a regular language for
 every equivalence class in @{term "UNIV // \<approx>A"}. Since @{text "finals A"} is
-a subset of  @{term "UNIV // \<approx>A"}, we also know that for every equvalence class
+a subset of  @{term "UNIV // \<approx>A"}, we also know that for every equivalence class
 in @{term "finals A"} there exists a regular language. Moreover by assumption
 we know that @{term "finals A"} must be finite, and therefore there must be a finite
 set of regular expressions @{text "rs"} such that
 \begin{center}

changeset 110	e500cab16be4
parent 109	79b37ef9505f
child 111	d65d071798ff