regexp: comparison Paper/Paper.thy

equal deleted inserted replaced

-:ab6637008963
+:23c0e6f2929d
 \noindent
 In @{text "(ii)"} we use the notation @{term "length s"} for the length of a
 string; this property states that if @{term "[] \<notin> A"} then the lengths of
 the strings in @{term "A \<up> (Suc n)"} must be longer than @{text n}.  We omit
 the proofs for these properties, but invite the reader to consult our
-formalisation.\footnote{Available at \url{http://www4.in.tum.de/~urbanc/cgi-bin/repos.cgi/regexp/}}
+formalisation.\footnote{Available at \url{http://www4.in.tum.de/~urbanc/regexp.html}}
 The notation in Isabelle/HOL for the quotient of a language @{text A} according to an
 equivalence relation @{term REL} is @{term "A // REL"}. We will write
 @{text "\<lbrakk>x\<rbrakk>\<^isub>\<approx>"} for the equivalence class defined
 as @{text "{y | y \<approx> x}"}.
 Central to our proof will be the solution of equational systems
 involving equivalence classes of languages. For this we will use Arden's lemma \cite{Brzozowski64},
 which solves equations of the form @{term "X = A ;; X \<union> B"} provided
 @{term "[] \<notin> A"}. However we will need the following `reverse'
-version of Arden's lemma (`reverse' in the sense of changing the order of @{text "\<cdot>"}).
+version of Arden's lemma (`reverse' in the sense of changing the order of @{term "A ;; X"} to
+\mbox{@{term "X ;; A"}}).
 \begin{lemma}[Reverse Arden's Lemma]\label{arden}\mbox{}\\
 If @{thm (prem 1) arden} then
 @{thm (lhs) arden} if and only if
 @{thm (rhs) arden}.
 \emph{Myhill-Nerode relation}, which states that w.r.t.~a language two
 strings are related, provided there is no distinguishing extension in this
 language. This can be defined as tertiary relation.
 \begin{definition}[Myhill-Nerode Relation] Given a language @{text A}, two strings @{text x} and
-@{text y} are related provided
+@{text y} are Myhill-Nerode related provided
 \begin{center}
 @{thm str_eq_def[simplified str_eq_rel_def Pair_Collect]}
 \end{center}
 \end{definition}
 containing @{text "[]"}.\footnote{Note that we mark, roughly speaking, the
 single `initial' state in the equational system, which is different from
 the method by Brzozowski \cite{Brzozowski64}, where he marks the
 `terminal' states. We are forced to set up the equational system in our
 way, because the Myhill-Nerode relation determines the `direction' of the
-transitions. The successor `state' of an equivalence class @{text Y} can
+transitions---the successor `state' of an equivalence class @{text Y} can
-be reached by adding characters to the end of @{text Y}. This is also the
+be reached by adding a character to the end of @{text Y}. This is also the
 reason why we have to use our reverse version of Arden's lemma.}
 Overloading the function @{text \<calL>} for the two kinds of terms in the
 equational system, we have
 \begin{center}
 apply the arden operation.
 The last property states that every @{text rhs} can only contain equivalence classes
 for which there is an equation. Therefore @{text lhss} is just the set containing
 the first components of an equational system,
 while @{text "rhss"} collects all equivalence classes @{text X} in the terms of the
-form @{term "Trn X r"}. That means @{thm (lhs) lhss_def}~@{text "\<equiv> {X | (X, rhs) \<in> ES}"}
+form @{term "Trn X r"}. That means formally @{thm (lhs) lhss_def}~@{text "\<equiv> {X | (X, rhs) \<in> ES}"}
 and @{thm (lhs) rhss_def}~@{text "\<equiv> {X | (X, r) \<in> rhs}"}.
 It is straightforward to prove that the initial equational system satisfies the
 invariant.
 Premise 3 of~\eqref{whileprinciple} is by Lem.~\ref{itertwo}. Now in premise 4
 we like to show that there exists a @{text rhs} such that @{term "ES = {(X, rhs)}"}
 and that @{text "invariant {(X, rhs)}"} holds, provided the condition @{text "Cond"}
 does not holds. By the stronger invariant we know there exists such a @{text "rhs"}
 with @{term "(X, rhs) \<in> ES"}. Because @{text Cond} is not true, we know the cardinality
-of @{text ES} is @{text 1}. This means @{text "ES"} must actually be the equation @{text "X = rhs"},
+of @{text ES} is @{text 1}. This means @{text "ES"} must actually be the set @{text "{(X, rhs)}"},
 for which the invariant holds. This allows us to conclude that
 @{term "Solve X (Init (UNIV // \<approx>A)) = {(X, rhs)}"} and @{term "invariant {(X, rhs)}"} hold,
 as needed.\qed
 \end{proof}
 By the preceeding Lemma, we know that there exists a @{text "rhs"} such
 that @{term "Solve X (Init (UNIV // \<approx>A))"} returns the equation @{text "X = rhs"},
 and that the invariant holds for this equation. That means we
 know @{text "X = \<Union>\<calL> ` rhs"}. We further know that
 this is equal to \mbox{@{text "\<Union>\<calL> ` (Arden X rhs)"}} using the properties of the
-invariant and Lem.\ref{ardenable}. Using the validity property for the equation @{text "X = rhs"},
+invariant and Lem.~\ref{ardenable}. Using the validity property for the equation @{text "X = rhs"},
 we can infer that @{term "rhss rhs \<subseteq> {X}"} and because the arden operation
 removes that @{text X} from @{text rhs}, that @{term "rhss (Arden X rhs) = {}"}.
 This means the right-hand side @{term "Arden X rhs"} can only consist of terms of the form @{term "Lam r"}.
 So we can collect those (finitely many) regular expressions and have @{term "X = L (\<Uplus>rs)"}.
 With this we can conclude the proof.\qed
 \noindent
 Lem.~\ref{every_eqcl_has_reg} allows us to finally give a proof for the first direction
 of the Myhill-Nerode theorem.
 \begin{proof}[of Thm.~\ref{myhillnerodeone}]
-By Lem.~\ref{every_eqcl_has_reg} we know that there exists a regular language for
+By Lem.~\ref{every_eqcl_has_reg} we know that there exists a regular expression for
 every equivalence class in @{term "UNIV // \<approx>A"}. Since @{text "finals A"} is
 a subset of  @{term "UNIV // \<approx>A"}, we also know that for every equivalence class
-in @{term "finals A"} there exists a regular language. Moreover by assumption
+in @{term "finals A"} there exists a regular expression. Moreover by assumption
 we know that @{term "finals A"} must be finite, and therefore there must be a finite
 set of regular expressions @{text "rs"} such that
 \begin{center}
 @{term "\<Union>(finals A) = L (\<Uplus>rs)"}
 Our method will rely on some
 \emph{tagging functions} defined over strings. Given the inductive hypothesis, it will
 be easy to prove that the range of these tagging functions is finite
 (the range of a function @{text f} is defined as @{text "range f \<equiv> f ` UNIV"}).
-With this we will be able to infer that the tagging functions, seen as a relation,
+With this we will be able to infer that the tagging functions, seen as relations,
 give rise to finitely many equivalence classes of @{const UNIV}. Finally we
 will show that the tagging relation is more refined than @{term "\<approx>(L r)"}, which
-implies that @{term "UNIV // \<approx>(L r)"} must also be finite.
+implies that @{term "UNIV // \<approx>(L r)"} must also be finite (a relation @{text "R\<^isub>1"}
-A relation @{text "R\<^isub>1"} is said to \emph{refine} @{text "R\<^isub>2"} provided @{text "R\<^isub>1 \<subseteq> R\<^isub>2"}.
+is said to \emph{refine} @{text "R\<^isub>2"} provided @{text "R\<^isub>1 \<subseteq> R\<^isub>2"}).
-We formally define the notion of a \emph{tag-relation} as follows.
+We formally define the notion of a \emph{tagging-relation} as follows.
-\begin{definition}[Tagging-Relation] Given a tag-function @{text tag}, then two strings @{text x}
+\begin{definition}[Tagging-Relation] Given a tagging-function @{text tag}, then two strings @{text x}
 and @{text y} are \emph{tag-related} provided
 \begin{center}
 @{text "x =tag= y \<equiv> tag x = tag y"}
 \end{center}
 \end{definition}
 \noindent
-In order to establish finiteness of a set @{text A} we shall use the following powerful
+In order to establish finiteness of a set @{text A}, we shall use the following powerful
 principle from Isabelle/HOL's library.
 %
 \begin{equation}\label{finiteimageD}
 @{thm[mode=IfThen] finite_imageD}
 \end{equation}
 \noindent
-It states that if an image of a set under an injective function @{text f} (over this set)
+It states that if an image of a set under an injective function @{text f} (injective over this set)
 is finite, then @{text A} itself must be finite. We can use it to establish the following
 two lemmas.
 \begin{lemma}\label{finone}
 @{thm[mode=IfThen] finite_eq_tag_rel}
 \end{lemma}
 \begin{proof}
 We set in \eqref{finiteimageD}, @{text f} to be @{text "X \<mapsto> tag ` X"}. We have
-@{text "range f"} to be subset of @{term "Pow (range tag)"}, which we know must be
+@{text "range f"} to be a subset of @{term "Pow (range tag)"}, which we know must be
 finite by assumption. Now @{term "f (UNIV // =tag=)"} is a subset of @{text "range f"},
 and so also finite. Injectivity amounts to showing that @{text "X = Y"} under the
 assumptions that @{text "X, Y \<in> "}~@{term "UNIV // =tag="} and @{text "f X = f Y"}.
-From the assumption we can obtain a @{text "x \<in> X"} and @{text "y \<in> Y"} with
+From the assumption we can obtain @{text "x \<in> X"} and @{text "y \<in> Y"} with
-@{text "tag x = tag y"}. This in turn means that the equivalence classes @{text X}
+@{text "tag x = tag y"}. Since @{text x} and @{text y} are tag-related, this in
+turn means that the equivalence classes @{text X}
 and @{text Y} must be equal.\qed
 \end{proof}
 \begin{lemma}\label{fintwo}
-Given two equivalence relations @{text "R\<^isub>1"} and @{text "R\<^isub>2"}, where
+Given two equivalence relations @{text "R\<^isub>1"} and @{text "R\<^isub>2"}, whereby
 @{text "R\<^isub>1"} refines @{text "R\<^isub>2"}.
 If @{thm (prem 1) refined_partition_finite[where ?R1.0="R\<^isub>1" and ?R2.0="R\<^isub>2"]}
 then @{thm (concl) refined_partition_finite[where ?R1.0="R\<^isub>1" and ?R2.0="R\<^isub>2"]}.
 \end{lemma}
 \begin{proof}
-We prove this lemma again using \eqref{finiteimageD}. For this we set @{text f} to
+We prove this lemma again using \eqref{finiteimageD}. This time we set @{text f} to
 be @{text "X \<mapsto>"}~@{term "{R\<^isub>1 `` {x} | x. x \<in> X}"}. It is easy to see that
 @{text "finite (f ` (UNIV // R\<^isub>2))"} because it is a subset of @{text "Pow (UNIV // R\<^isub>1)"},
 which is finite by assumption. What remains to be shown is that @{text f} is injective
 on @{term "UNIV // R\<^isub>2"}. This is equivalent to showing that two equivalence
 classes, say @{text "X"} and @{text Y}, in @{term "UNIV // R\<^isub>2"} are equal, provided
 @{text "f X = f Y"}. For @{text "X = Y"} to be equal, we have to find two elements
 @{text "x \<in> X"} and @{text "y \<in> Y"} such that they are @{text R\<^isub>2} related.
 We know there exists a @{text x} with @{term "X = R\<^isub>2 `` {x}"}
 and @{text "x \<in> X"}. From the latter fact we can infer that @{term "R\<^isub>1 ``{x} \<in> f X"}
-and further @{term "R\<^isub>1 ``{x} \<in> f Y"}. From this we can obtain a @{text y}
+and further @{term "R\<^isub>1 ``{x} \<in> f Y"}. This means we can obtain a @{text y}
-such that @{term "R\<^isub>1 `` {x} = R\<^isub>1 `` {y}"} holds. This means @{text x} and @{text y}
+such that @{term "R\<^isub>1 `` {x} = R\<^isub>1 `` {y}"} holds. Consequently @{text x} and @{text y}
 are @{text "R\<^isub>1"}-related. Since by assumption @{text "R\<^isub>1"} refines @{text "R\<^isub>2"},
 they must also be @{text "R\<^isub>2"}-related, as we need to show.\qed
 \end{proof}
 \noindent
 Chaining Lem.~\ref{finone} and \ref{fintwo} together, means in order to show
 that @{term "UNIV // \<approx>(L r)"} is finite, we have to find a tagging function whose
 range can be shown to be finite and whose tagging-relation refines @{term "\<approx>(L r)"}.
-Let us attempt the @{const ALT}-case.
+Let us attempt the @{const ALT}-case first.
 \begin{proof}[@{const "ALT"}-Case]
 We take as tagging function
 \begin{center}
 \noindent
 where @{text "A"} and @{text "B"} are some arbitrary languages.
 We can show in general, if @{term "finite (UNIV // \<approx>A)"} and @{term "finite (UNIV // \<approx>B)"}
 then @{term "finite ((UNIV // \<approx>A) \<times> (UNIV // \<approx>B))"} holds. The range of
-@{term "tag_str_ALT A B"} is a subset of this product set. It remains to show
+@{term "tag_str_ALT A B"} is a subset of this product set, so finite. It remains to be shown
 that @{text "=tag\<^isub>A\<^isub>L\<^isub>T A B="} refines @{term "\<approx>(A \<union> B)"}. This amounts to
 showing
 %
 \begin{center}
 @{term "tag\<^isub>A\<^isub>L\<^isub>T A B x = tag\<^isub>A\<^isub>L\<^isub>T A B y \<longrightarrow> x \<approx>(A \<union> B) y"}
 @{text "\<forall>z. tag\<^isub>A\<^isub>L\<^isub>T A B x = tag\<^isub>A\<^isub>L\<^isub>T A B y \<and> x @ z \<in> A \<union> B \<longrightarrow> y @ z \<in> A \<union> B"}
 \end{equation}
 \noindent
 since both @{text "=tag\<^isub>A\<^isub>L\<^isub>T A B="} and @{term "\<approx>(A \<union> B)"} are symmetric. To solve
-\eqref{pattern} we just have to unfold the tagging-relation and analyse
+\eqref{pattern} we just have to unfold the definition of the tagging-relation and analyse
-in which set, @{text A} or @{text B}, the string @{term "x @ z"} is. Fianlly we
+in which set, @{text A} or @{text B}, the string @{term "x @ z"} is.
+The definition of the tagging-function will give us in each case the
+information to infer that @{text "y @ z \<in> A \<union> B"}.
+Finally we
 can discharge this case by setting @{text A} to @{term "L r\<^isub>1"} and @{text B} to @{term "L r\<^isub>2"}.\qed
 \end{proof}
 \noindent
 The pattern in \eqref{pattern} is repeated for the other two cases. Unfortunately,
-they are slightly more complicated.
+they are slightly more complicated. In the @{const SEQ}-case we essentially have
+to be able to infer that
+\begin{center}
+@{term "x @ z \<in> A ;; B \<longrightarrow> y @ z \<in> A ;; B"}
+\end{center}
+\noindent
+using the information given by the appropriate tagging function. The complication
+is to find out what the possible splits of @{text "x @ z"} are to be in @{term "A ;; B"}.
 \begin{proof}[@{const SEQ}-Case]
 \begin{center}
 "\<Rightarrow>"} deterministic automaton @{text "\<Rightarrow>"} complement automaton @{text "\<Rightarrow>"}
 regular expression.
 While regular expressions are convenient in formalisations, they have some
 limitations. One is that there seems to be no method of calculating a
-minimal regular expression (for example in terms of length), like there is
+minimal regular expression (for example in terms of length) for a regular
-for automata. For an implementation of a simple regular expression matcher,
+language, like there is
+for automata. On the other hand, efficient regular expression matching,
+without using automata, poses no problem \cite{OwensReppyTuron09}.
+For an implementation of a simple regular expression matcher,
 whose correctness has been formally established, we refer the reader to
 Owens and Slind \cite{OwensSlind08}.
 Our formalisation consists of 790 lines of Isabelle/Isar code for the first
-direction and 475 for the second, plus 300 lines of standard material about
+direction and 475 for the second, plus around 300 lines of standard material about
 regular languages. While this might be seen as too large to count as a
 concise proof pearl, this should be seen in the context of the work done by
 Constable at al \cite{Constable00} who formalised the Myhill-Nerode theorem
 in Nuprl using automata. They write that their four-member team needed
 something on the magnitute of 18 months for their formalisation. The
 from what is shown in the Nuprl Math Library about their development it
 seems substantially larger than ours. The code of ours can be found in the
 Mercurial Repository at
 \begin{center}
-\url{http://www4.in.tum.de/~urbanc/cgi-bin/repos.cgi/regexp/}
+\url{http://www4.in.tum.de/~urbanc/regexp.html}
 \end{center}
 Our proof of the first direction is very much inspired by \emph{Brzozowski's
 algebraic mehod} used to convert a finite automaton to a regular
 classes as the states of the minimal automaton for the regular language.
 However there are some subtle differences. Since we identify equivalence
 classes with the states of the automaton, then the most natural choice is to
 characterise each state with the set of strings starting from the initial
 state leading up to that state. Usually, however, the states are characterised as the
-ones starting from that state leading to the terminal states.  The first
+strings starting from that state leading to the terminal states.  The first
-choice has consequences how the initial equational system is set up. We have
+choice has consequences about how the initial equational system is set up. We have
 the $\lambda$-term on our `initial state', while Brzozowski has it on the
 terminal states. This means we also need to reverse the direction of Arden's
 lemma.
 We briefly considered using the method Brzozowski presented in the Appendix
 of~\cite{Brzozowski64} in order to prove the second direction of the
 Myhill-Nerode theorem. There he calculates the derivatives for regular
 expressions and shows that there can be only finitely many of them. We could
 have used as the tag of a string @{text s} the derivative of a regular expression
 generated with respect to @{text s}.  Using the fact that two strings are
-Myhill-Nerode related whenever their derivative is the same together with
+Myhill-Nerode related whenever their derivative is the same, together with
 the fact that there are only finitely many derivatives for a regular
 expression would give us a similar argument as ours. However it seems not so easy to
 calculate the derivatives and then to count them. Therefore we preferred our
-direct method of using tagging-functions involving equivalence classes. This
+direct method of using tagging-functions. This
 is also where our method shines, because we can completely side-step the
 standard argument \cite{Kozen97} where automata need to be composed, which
 as stated in the Introduction is not so convenient to formalise in a
 HOL-based theorem prover. However, it is also the direction where we had to
-spend most of the `conceptual' time, as our proof-argument based on tagging functions
+spend most of the `conceptual' time, as our proof-argument based on tagging-functions
-is new for establishing the Myhill-Nerode theorem.
+is new for establishing the Myhill-Nerode theorem. All standard proofs
+of this direction proceed by arguments over automata.
 *}
 (*<*)

changeset 123	23c0e6f2929d
parent 122	ab6637008963
child 124	8233510cab6c