regexp: comparison Journal/Paper.thy

equal deleted inserted replaced

-:a9598a206c41
+:40b8d485ce8d
 lexing. Berghofer and Reiter \cite{BerghoferReiter09} formalise automata
 working over bit strings in the context of Presburger arithmetic.  The only
 larger formalisations of automata theory are carried out in Nuprl
 \cite{Constable00} and in Coq, e.g.~\cite{Filliatre97,Almeidaetal10}.
-Also one might consider that automata are convenient `vehicles' for
+Also, one might consider automata as just convenient `vehicles' for
 establishing properties about regular languages.  However, paper proofs
 about automata often involve subtle side-conditions which are easily
 overlooked, but which make formal reasoning rather painful. For example
 Kozen's proof of the Myhill-Nerode theorem requires that automata do not
 have inaccessible states \cite{Kozen97}. Another subtle side-condition is
 that if we define a regular language as one for which there exists \emph{a}
 finite automaton that recognises all its strings (see
 Definition~\ref{baddef}), then we need a lemma which ensures that another
 equivalent one can be found satisfying the side-condition, and also need to
 make sure our operations on automata preserve them. Unfortunately, such
-`little' and `obvious' lemmas make a formalisation of automata theory a
+`little' and `obvious' lemmas make formalisations of automata theory
-hair-pulling experience.
+hair-pulling experiences.
 In this paper, we will not attempt to formalise automata theory in
 Isabelle/HOL nor will we attempt to formalise automata proofs from the
 literature, but take a different approach to regular languages than is
 usually taken. Instead of defining a regular language as one where there
 main purpose of this paper is to show that a central result about regular
 languages---the Myhill-Nerode theorem---can be recreated by only using
 regular expressions. This theorem gives necessary and sufficient conditions
 for when a language is regular. As a corollary of this theorem we can easily
 establish the usual closure properties, including complementation, for
-regular languages. We also use in one example the continuation lemma, which
+regular languages. We use the continuation lemma \cite{Rosenberg06},
-is based on Myhill-Nerode, for establishing non-regularity of languages
+which is also a corollary of the Myhill-Nerode theorem, for establishing
-\cite{Rosenberg06}.\medskip
+the non-regularity of the language @{text "a\<^isup>nb\<^isup>n"}.\medskip
 \noindent
 {\bf Contributions:} There is an extensive literature on regular languages.
 To our best knowledge, our proof of the Myhill-Nerode theorem is the first
 that is based on regular expressions, only. The part of this theorem stating
 that finitely many partitions imply regularity of the language is proved by
 an argument about solving equational systems.  This argument appears to be
 folklore. For the other part, we give two proofs: one direct proof using
 certain tagging-functions, and another indirect proof using Antimirov's
 partial derivatives \cite{Antimirov95}. Again to our best knowledge, the
-tagging-functions have not been used before to establish the Myhill-Nerode
+tagging-functions have not been used before for establishing the Myhill-Nerode
 theorem. Derivatives of regular expressions have been used recently quite
 widely in the literature; partial derivatives, in contrast, attract much
 less attention. However, partial derivatives are more suitable in the
 context of the Myhill-Nerode theorem, since it is easier to establish
 formally their finiteness result. We are not aware of any proof that uses
 Since the left-hand side is equal to @{text A}, we can use @{term "\<Uplus>rs"}
 as the regular expression that is needed in the theorem.
 \end{proof}
 \noindent
-Note that solving our equational system also gives us a method for
+Note that our algorithm for solving equational systems provides also a method for
-calculating the regular expression for a complement of a regular language:
+calculating a regular expression for the complement of a regular language:
-similar to the construction on automata, if we combine all regular
+if we combine all regular
 expressions corresponding to equivalence classes not in @{term "finals A"},
-we obtain a regular expression for the complement @{term "- A"}.
+then we obtain a regular expression for the complement language @{term "- A"}.
+This is similar to the usual construction of a `complement automaton'.
 *}
 equivalence classes. To show that there are only finitely many of them, it
 suffices to show in each induction step that another relation, say @{text
 R}, has finitely many equivalence classes and refines @{term "\<approx>(lang r)"}.
 \begin{dfntn}
-A relation @{text "R\<^isub>1"} is said to \emph{refine} @{text "R\<^isub>2"}
+A relation @{text "R\<^isub>1"} \emph{refines} @{text "R\<^isub>2"}
 provided @{text "R\<^isub>1 \<subseteq> R\<^isub>2"}.
 \end{dfntn}
 \noindent
 For constructing @{text R}, will rely on some \emph{tagging-functions}
 %   & @{thm (rhs) Derivs_simps(3)[where ?s1.0="s\<^isub>1" and ?s2.0="s\<^isub>2"]}\\
 \end{tabular}}
 \end{equation}
 \noindent
-where @{text "\<Delta>"} in the fifth line is a function that tests whether the
+Note that in the last equation we use the list-cons operator written
-empty string is in the language and returns @{term "{[]}"} or @{term "{}"},
-accordingly.  Note also that in the last equation we use the list-cons operator written
 \mbox{@{text "_ :: _"}}.  The only interesting case is the case of @{term "A\<star>"}
 where we use Property~\ref{langprops}@{text "(i)"} in order to infer that
 @{term "Deriv c (A\<star>) = Deriv c (A \<cdot> A\<star>)"}. We can then complete the proof by
-using the fifth equation and noting that @{term "Delta A \<cdot> Deriv c (A\<star>) \<subseteq> (Deriv
+using the fifth equation and noting that @{term "Deriv c (A\<star>) \<subseteq> (Deriv
-c A) \<cdot> A\<star>"}.
+c A) \<cdot> A\<star>"} provided @{text "[] \<in> A"}.
 Brzozowski observed that the left-quotients for languages of
 regular expressions can be calculated directly using the notion of
 \emph{derivatives of a regular expression} \cite{Brzozowski64}. We define
 this notion in Isabelle/HOL as follows:
 \noindent
 These two properties confirm the observation made earlier
 that by using sets, partial derivatives have the @{text "ACI"}-identities
 of derivatives already built in.
-Antimirov also proved that for every language and regular expression
+Antimirov also proved that for every language and every regular expression
 there are only finitely many partial derivatives, whereby the set of partial
 derivatives of @{text r} w.r.t.~a language @{text A} is defined as
 \begin{equation}\label{Pdersdef}
 @{thm pderivs_lang_def}
 \end{center}
 \noindent
 Now the range of @{term "\<lambda>x. pderivs x r"} is a subset of @{term "Pow (pderivs_lang UNIV r)"},
 which we know is finite by Theorem~\ref{antimirov}. Consequently there
-are only finitely many equivalence classes of @{text "\<^raw:$\threesim$>\<^bsub>(\<lambda>x. pders x r)\<^esub>"},
+are only finitely many equivalence classes of @{text "\<^raw:$\threesim$>\<^bsub>(\<lambda>x. pders x r)\<^esub>"}.
-which refines @{term "\<approx>(lang r)"}, and therefore we can again conclude the
+This relation refines @{term "\<approx>(lang r)"}, and therefore we can again conclude the
 second part of the Myhill-Nerode theorem.
 \end{proof}
 *}
 section {* Closure Properties of Regular Languages\label{closure} *}
 "Deriv_lang B A"} is regular. To see this consider the following argument
 using partial derivatives: From @{text A} being regular we know there exists
 a regular expression @{text r} such that @{term "A = lang r"}. We also know
 that @{term "pderivs_lang B r"} is finite for every language @{text B} and
 regular expression @{text r} (recall Theorem~\ref{antimirov}). By definition
-and \eqref{Derspders} therefore
+and \eqref{Derspders} we have
 \begin{equation}\label{eqq}
 @{term "Deriv_lang B (lang r) = (\<Union> lang ` (pderivs_lang B r))"}
 \end{equation}
 @{thm[mode=Rule] emb1[where as="x" and b="c" and bs="y"]}\hspace{10mm}
 @{thm[mode=Rule] emb2[where as="x" and a="c" and bs="y"]}
 \end{center}
 \noindent
-It can be easily proved that @{text "\<preceq>"} is a partial order. Now define the
+It is straightforward to prove that @{text "\<preceq>"} is a partial order. Now define the
 \emph{language of substrings} and \emph{superstrings} of a language @{text A}
 respectively as
 \begin{center}
 \begin{tabular}{l}
 \noindent
 We like to establish
 \begin{lmm}\label{subseqreg}
-For every language @{text A}, the languages @{term "SUBSEQ A"} and @{term "SUPSEQ A"}
+For every language @{text A}, the languages @{text "(i)"} @{term "SUBSEQ A"} and
+@{text "(ii)"} @{term "SUPSEQ A"}
 are regular.
 \end{lmm}
 \noindent
-Our proof follows the one given in \cite[92--95]{Shallit08}, except that we use
+Our proof follows the one given in \cite[Pages 92--95]{Shallit08}, except that we use
 Higman's Lemma, which is already proved in the Isabelle/HOL library \cite{Berghofer03}.
 Higman's Lemma allows us to infer that every set @{text A} of antichains, satisfying
 \begin{equation}\label{higman}
 @{text "\<forall>x, y \<in> A."}~@{term "x \<noteq> y \<longrightarrow> \<not>(x \<preceq> y) \<and> \<not>(y \<preceq> x)"}
 \noindent
 is finite.
 The first step in our proof of Lemma~\ref{subseqreg} is to establish the
-following properties for @{term SUPSEQ}
+following simple properties for @{term SUPSEQ}
 \begin{equation}\label{supseqprops}
 \mbox{\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
 @{thm (lhs) SUPSEQ_simps(1)} & @{text "\<equiv>"} & @{thm (rhs) SUPSEQ_simps(1)}\\
 @{thm (lhs) SUPSEQ_simps(2)} & @{text "\<equiv>"} & @{thm (rhs) SUPSEQ_simps(2)}\\
 By Higman's Lemma \eqref{higman} we know
 that @{term "M \<equiv> {x \<in> A. minimal x A}"} is finite, since every minimal element is incomparable,
 except with itself.
 It is also straightforward to show that @{term "SUPSEQ M \<subseteq> SUPSEQ A"}. For
 the other direction we have  @{term "x \<in> SUPSEQ A"}. From this we obtain
-a @{text y} such that @{term "y \<in> A"} and @{term "y \<preceq> x"}. Since we know that
+a @{text y} such that @{term "y \<in> A"} and @{term "y \<preceq> x"}. Since we have that
 the relation \mbox{@{term "{(y, x). y \<preceq> x \<and> x \<noteq> y}"}} is well-founded, there must
 be a minimal element @{text "z"} such that @{term "z \<in> A"} and @{term "z \<preceq> y"},
 and hence by transitivity also \mbox{@{term "z \<preceq> x"}} (here we deviate from the argument
 given in \cite{Shallit08}, because Isabelle/HOL provides already an extensive infrastructure
 for reasoning about well-foundedness). Since @{term "z"} is
 we established already that regularity is preserved under complement, also @{term "SUBSEQ A"}
 must be regular.
 \end{proof}
 Finally we like to show that the Myhill-Nerode theorem is also convenient for establishing
-non-regularity of languages. For this we use the following version of the Continuation
+the non-regularity of languages. For this we use the following version of the Continuation
 Lemma (see for example~\cite{Rosenberg06}).
 \begin{lmm}[Continuation Lemma]
-If the language @{text A} is regular and the set @{text B} is infinite,
+If a language @{text A} is regular and a set @{text B} is infinite,
 then there exist two distinct strings @{text x} and @{text y} in @{text B}
 such that @{term "x \<approx>A y"}.
 \end{lmm}
 \noindent
 This lemma can be easily deduced from the Myhill-Nerode theorem and the Pigeonhole
 Principle: Since @{text A} is regular, there can be only finitely many
-equivalence classes by the Myhill-Nerode relation. Hence an infinite set must contain
+equivalence classes. Hence an infinite set must contain
 at least two strings that are in the same equivalence class, that is
 they need to be related by the Myhill-Nerode relation.
 Using this lemma, it is straightforward to establish that the language
-\mbox{@{text "A \<equiv> \<Union>\<^isub>n a\<^sup>n @ b\<^sup>n"}}, where @{text "a\<^sup>n"} stands
+\mbox{@{text "A \<equiv> \<Union>\<^isub>n a\<^sup>n @ b\<^sup>n"}} is not regular (@{text "a\<^sup>n"} stands
-for the strings consisting of @{text n} times the character a, is not
+for the strings consisting of @{text n} times the character a; similarly for
-regular. For this consider the infinite set @{text "B \<equiv> \<Union>\<^isub>n a\<^sup>n"}.
+@{text "b\<^isub>n"}). For this consider the infinite set @{text "B \<equiv> \<Union>\<^isub>n a\<^sup>n"}.
 \begin{lmm}
 No two distinct strings in @{text "B"} are Myhill-Nerode related by @{text A}.
 \end{lmm}
 (for example in terms of length) for a regular language, like there is for
 automata. On the other hand, efficient regular expression matching, without
 using automata, poses no problem \cite{OwensReppyTuron09}.  For an
 implementation of a simple regular expression matcher, whose correctness has
 been formally established, we refer the reader to Owens and Slind
-\cite{OwensSlind08}.
+\cite{OwensSlind08}. In our opinion, their formalisation is considerably
+slicker than for example the approach to regular expression matching taken
+in \cite{Harper99} and \cite{Yi06}.
 Our proof of the first direction is very much inspired by \emph{Brzozowski's
 algebraic method} used to convert a finite automaton to a regular expression
 \cite{Brzozowski64}. The close connection can be seen by considering the
 equivalence classes as the states of the minimal automaton for the regular
 our first proof based on tagging-functions is new for establishing the
 Myhill-Nerode theorem. All standard proofs of this direction proceed by
 arguments over automata.
 The indirect proof for the second direction arose from our interest in
-Brzozowski's derivatives for regular expression matching. A corresponding
+Brzozowski's derivatives for regular expression matching.  While Brzozowski
-regular expression matcher has been formalised by Owens and Slind in HOL4
+already established that there are only
-\cite{OwensSlind08}. In our opinion, their formalisation is considerably
-slicker than for example the approach to regular expression matching taken
-in \cite{Harper99} and \cite{Yi06}. While Brzozowski's derivatives lead to a
-simple regular expression matcher and he established that there are only
 finitely many dissimilar derivatives for every regular expression, this
 result is not as straightforward to formalise in a theorem prover as one
 might wish. The reason is that the set of dissimilar derivatives is not
 defined inductively, but in terms of an ACI-equivalence relation. This
 difficulty prevented for example Krauss and Nipkow to prove termination of
 their equivalence checker for regular expressions
 \cite{KraussNipkow11}. Their checker is based on Brzozowski's derivatives
 and for their argument the lack of a formal proof of termination is not
 crucial (it merely lets them ``sleep better'' \cite{KraussNipkow11}).  We
 expect that their development simplifies by using partial derivatives,
-instead of derivatives, and that termination of the algorithm can be
+instead of derivatives, and that the termination of the algorithm can be
-formally established (the main incredience is
+formally established (the main ingredient is
 Theorem~\ref{antimirov}). However, since partial derivatives use sets of
 regular expressions, one needs to carefully analyse whether the resulting
 algorithm is still executable. Given the existing infrastructure for
 executable sets in Isabelle/HOL \cite{Haftmann09}, it should.
 While our formalisation might appear large, it should be seen
 in the context of the work done by Constable at al \cite{Constable00} who
 formalised the Myhill-Nerode theorem in Nuprl using automata. They write
 that their four-member team needed something on the magnitude of 18 months
-for their formalisation. Also, Filli\^atre reports that his formalisation in
+for their formalisation. It is hard to gauge the size of a
-Coq of automata theory and Kleene's theorem is ``rather big''.
+formalisation in Nurpl, but from what is shown in the Nuprl Math Library
-\cite{Filliatre97} More recently, Almeida et al reported about another
+about their development it seems substantially larger than ours. We attribute
+this to our use of regular expressions, which meant we did not need to `fight'
+the theorem prover.
+Also, Filli\^atre reports that his formalisation in
+Coq of automata theory and Kleene's theorem is ``rather big''
+\cite{Filliatre97}. More recently, Almeida et al reported about another
 formalisation of regular languages in Coq \cite{Almeidaetal10}. Their
 main result is the
 correctness of Mirkin's construction of an automaton from a regular
 expression using partial derivatives. This took approximately 10600 lines
-of code.  The estimate for our formalisation is that we
+of code.  In terms of time, the estimate for our formalisation is that we
 needed approximately 3 months and this included the time to find our proof
 arguments. Unlike Constable et al, who were able to follow the Myhill-Nerode
 proof from \cite{HopcroftUllman69}, we had to find our own arguments.  So for us the
-formalisation was not the bottleneck. It is hard to gauge the size of a
+formalisation was not the bottleneck.  The code of
-formalisation in Nurpl, but from what is shown in the Nuprl Math Library
-about their development it seems substantially larger than ours. We attribute
-this to our use of regular expressions, which meant we did not need to `fight'
-the theorem prover. The code of
 our formalisation can be found in the Archive of Formal Proofs at
-\mbox{\url{http://afp.sourceforge.net/devel-entries/Myhill-Nerode.shtml}} \cite{myhillnerodeafp11}.\medskip
+\mbox{\url{http://afp.sourceforge.net/devel-entries/Myhill-Nerode.shtml}}
+\cite{myhillnerodeafp11}.\medskip
 \noindent
 {\bf Acknowledgements:}
 We are grateful for the comments we received from Larry Paulson.  Tobias
 Nipkow made us aware of the properties in Lemma~\ref{subseqreg} and Tjark
-Weber helped us with their proofs.
+Weber helped us with proving them.
 *}
 (*<*)
 end

changeset 245	40b8d485ce8d
parent 242	093e45c44d91
child 247	087e6c255e33