regexp: comparison Journal/Paper.thy

equal deleted inserted replaced

-:087e6c255e33
+:47446f111550
 \noindent
 changes the type---the disjoint union is not a set, but a set of
 pairs. Using this definition for disjoint union means we do not have a
 single type for the states of automata. As a result we will not be able to
-define in our fomalisation a regular language as one for which there exists
+define a regular language as one for which there exists
 an automaton that recognises all its strings (Definition~\ref{baddef}). This
 is because we cannot make a definition in HOL that is only polymorphic in
 the state type, but not in the predicate for regularity; and there is no
 type quantification available in HOL (unlike in Coq, for
 example).\footnote{Slind already pointed out this problem in an email to the
 Also, one might consider automata as just convenient `vehicles' for
 establishing properties about regular languages.  However, paper proofs
 about automata often involve subtle side-conditions which are easily
 overlooked, but which make formal reasoning rather painful. For example
-Kozen's proof of the Myhill-Nerode theorem requires that automata do not
+Kozen's proof of the Myhill-Nerode Theorem requires that automata do not
 have inaccessible states \cite{Kozen97}. Another subtle side-condition is
 completeness of automata, that is automata need to have total transition
 functions and at most one `sink' state from which there is no connection to
 a final state (Brzozowski mentions this side-condition in the context of
 state complexity of automata \cite{Brzozowski10}). Such side-conditions mean
 like we need for automata. This convenience of regular expressions has
 recently been exploited in HOL4 with a formalisation of regular expression
 matching based on derivatives \cite{OwensSlind08} and with an equivalence
 checker for regular expressions in Isabelle/HOL \cite{KraussNipkow11}.  The
 main purpose of this paper is to show that a central result about regular
-languages---the Myhill-Nerode theorem---can be recreated by only using
+languages---the Myhill-Nerode Theorem---can be recreated by only using
 regular expressions. This theorem gives necessary and sufficient conditions
 for when a language is regular. As a corollary of this theorem we can easily
 establish the usual closure properties, including complementation, for
 regular languages. We use the continuation lemma \cite{Rosenberg06},
-which is also a corollary of the Myhill-Nerode theorem, for establishing
+which is also a corollary of the Myhill-Nerode Theorem, for establishing
 the non-regularity of the language @{text "a\<^isup>nb\<^isup>n"}.\medskip
 \noindent
 {\bf Contributions:} There is an extensive literature on regular languages.
-To our best knowledge, our proof of the Myhill-Nerode theorem is the first
+To our best knowledge, our proof of the Myhill-Nerode Theorem is the first
 that is based on regular expressions, only. The part of this theorem stating
 that finitely many partitions imply regularity of the language is proved by
 an argument about solving equational systems.  This argument appears to be
 folklore. For the other part, we give two proofs: one direct proof using
 certain tagging-functions, and another indirect proof using Antimirov's
 partial derivatives \cite{Antimirov95}. Again to our best knowledge, the
 tagging-functions have not been used before for establishing the Myhill-Nerode
-theorem. Derivatives of regular expressions have been used recently quite
+Theorem. Derivatives of regular expressions have been used recently quite
 widely in the literature; partial derivatives, in contrast, attract much
 less attention. However, partial derivatives are more suitable in the
-context of the Myhill-Nerode theorem, since it is easier to establish
+context of the Myhill-Nerode Theorem, since it is easier to establish
 formally their finiteness result. We are not aware of any proof that uses
-either of them for proving the Myhill-Nerode theorem.
+either of them for proving the Myhill-Nerode Theorem.
 *}
 section {* Preliminaries *}
 text {*
 section {* The Myhill-Nerode Theorem, First Part *}
 text {*
 \noindent
-The key definition in the Myhill-Nerode theorem is the
+The key definition in the Myhill-Nerode Theorem is the
-\emph{Myhill-Nerode relation}, which states that w.r.t.~a language two
+\emph{Myhill-Nerode Relation}, which states that w.r.t.~a language two
 strings are related, provided there is no distinguishing extension in this
 language. This can be defined as a tertiary relation.
 \begin{dfntn}[Myhill-Nerode Relation]\label{myhillneroderel}
 Given a language @{text A}, two strings @{text x} and
 @{text "X\<^isub>2 = {[c]}"}\\
 @{text "X\<^isub>3 = UNIV - {[], [c]}"}
 \end{tabular}
 \end{center}
-One direction of the Myhill-Nerode theorem establishes
+One direction of the Myhill-Nerode Theorem establishes
 that if there are finitely many equivalence classes, like in the example above, then
 the language is regular. In our setting we therefore have to show:
 \begin{thrm}\label{myhillnerodeone}
 @{thm[mode=IfThen] Myhill_Nerode1}
 is the equivalence class
 containing the empty string @{text "[]"}.\footnote{Note that we mark, roughly speaking, the
 single `initial' state in the equational system, which is different from
 the method by Brzozowski \cite{Brzozowski64}, where he marks the
 `terminal' states. We are forced to set up the equational system in our
-way, because the Myhill-Nerode relation determines the `direction' of the
+way, because the Myhill-Nerode Relation determines the `direction' of the
 transitions---the successor `state' of an equivalence class @{text Y} can
 be reached by adding a character to the end of @{text Y}. This is also the
 reason why we have to use our reversed version of Arden's Lemma.}
 In our running example we have the initial equational system
 With this we can conclude the proof.
 \end{proof}
 \noindent
 Lemma~\ref{every_eqcl_has_reg} allows us to finally give a proof for the first direction
-of the Myhill-Nerode theorem.
+of the Myhill-Nerode Theorem.
 \begin{proof}[Proof of Theorem~\ref{myhillnerodeone}]
 By Lemma~\ref{every_eqcl_has_reg} we know that there exists a regular expression for
 every equivalence class in @{term "UNIV // \<approx>A"}. Since @{text "finals A"} is
 a subset of  @{term "UNIV // \<approx>A"}, we also know that for every equivalence class
 section {* Myhill-Nerode, Second Part *}
 text {*
 \noindent
 In this section we will give a proof for establishing the second
-part of the Myhill-Nerode theorem. It can be formulated in our setting as follows:
+part of the Myhill-Nerode Theorem. It can be formulated in our setting as follows:
 \begin{thrm}\label{myhillnerodetwo}
 Given @{text "r"} is a regular expression, then @{thm Myhill_Nerode2}.
 \end{thrm}
 A relation @{text "R\<^isub>1"} \emph{refines} @{text "R\<^isub>2"}
 provided @{text "R\<^isub>1 \<subseteq> R\<^isub>2"}.
 \end{dfntn}
 \noindent
-For constructing @{text R}, will rely on some \emph{tagging-functions}
+For constructing @{text R}, we will rely on some \emph{tagging-functions}
 defined over strings. Given the inductive hypothesis, it will be easy to
 prove that the \emph{range} of these tagging-functions is finite. The range
 of a function @{text f} is defined as
 \begin{center}
 B"}. As shown in the pictures above, there are two cases to be
 considered. First, there exists a @{text "z\<^isub>p"} and @{text
 "z\<^isub>s"} such that @{term "x @ z\<^isub>p \<in> A"} and @{text "z\<^isub>s
 \<in> B"}.  By the assumption about @{term "tag_Times A B"} we have @{term "\<approx>A
 `` {x} = \<approx>A `` {y}"} and thus @{term "x \<approx>A y"}. Hence by the Myhill-Nerode
-relation @{term "y @ z\<^isub>p \<in> A"} holds. Using @{text "z\<^isub>s \<in> B"},
+Relation @{term "y @ z\<^isub>p \<in> A"} holds. Using @{text "z\<^isub>s \<in> B"},
 we can conclude in this case with @{term "y @ z \<in> A \<cdot> B"} (recall @{text
 "z\<^isub>p @ z\<^isub>s = z"}).
 Second there exists a partition @{text "x\<^isub>p"} and @{text "x\<^isub>s"} with
 @{text "x\<^isub>p \<in> A"} and @{text "x\<^isub>s @ z \<in> B"}. We therefore have
 \end{center}
 \noindent
 This means there must be a partition @{text "y\<^isub>p"} and @{text "y\<^isub>s"}
 such that @{term "y\<^isub>p \<in> A"} and @{term "\<approx>B `` {x\<^isub>s} = \<approx>B ``
-{y\<^isub>s}"}. Unfolding the Myhill-Nerode relation and together with the
+{y\<^isub>s}"}. Unfolding the Myhill-Nerode Relation and together with the
 facts that @{text "x\<^isub>p \<in> A"} and \mbox{@{text "x\<^isub>s @ z \<in> B"}}, we
 obtain @{term "y\<^isub>p \<in> A"} and @{text "y\<^isub>s @ z \<in> B"}, as needed in
 this case.  We again can complete the @{const TIMES}-case by setting @{text
 A} to @{term "lang r\<^isub>1"} and @{text B} to @{term "lang r\<^isub>2"}.
 \end{proof}
 \end{center}
 \noindent
 From this we know there exist a partition @{text "y\<^isub>p"} and @{text
 "y\<^isub>s"} with @{term "y\<^isub>p \<in> A\<star>"} and also @{term "x\<^isub>s \<approx>A
-y\<^isub>s"}. Unfolding the Myhill-Nerode relation we know @{term
+y\<^isub>s"}. Unfolding the Myhill-Nerode Relation we know @{term
 "y\<^isub>s @ z\<^isub>a \<in> A"}. We also know that @{term "z\<^isub>b \<in> A\<star>"}.
 Therefore @{term "y\<^isub>p @ (y\<^isub>s @ z\<^isub>a) @ z\<^isub>b \<in>
 A\<star>"}, which means @{term "y @ z \<in> A\<star>"}. The last step is to set
 @{text "A"} to @{term "lang r"} and thus complete the proof.
 \end{proof}
 section {* Second Part proved using Partial Derivatives\label{derivatives} *}
 text {*
 \noindent
 As we have seen in the previous section, in order to establish
-the second direction of the Myhill-Nerode theorem, it is sufficient to find
+the second direction of the Myhill-Nerode Theorem, it is sufficient to find
 a more refined relation than @{term "\<approx>(lang r)"} for which we can
 show that there are only finitely many equivalence classes. So far we
 showed this directly by induction on @{text "r"} using tagging-functions.
 However, there is also  an indirect method to come up with such a refined
 relation by using derivatives of regular expressions~\cite{Brzozowski64}.
 @{abbrev "Derivss s As"}
 \end{center}
 \noindent
 where we apply the left-quotient to a set of languages and then combine the results.
-Clearly we have the following equivalence between the Myhill-Nerode relation
+Clearly we have the following equivalence between the Myhill-Nerode Relation
 (Definition~\ref{myhillneroderel}) and left-quotients
 \begin{equation}\label{mhders}
 @{term "x \<approx>A y"} \hspace{4mm}\text{if and only if}\hspace{4mm} @{term "Derivs x A = Derivs y A"}
 \end{equation}
 @{thm Derivs_derivs}
 \end{tabular}}
 \end{equation}
 \noindent
-The importance of this fact in the context of the Myhill-Nerode theorem is that
+The importance of this fact in the context of the Myhill-Nerode Theorem is that
 we can use \eqref{mhders} and \eqref{Dersders} in order to
 establish that
 \begin{center}
 @{term "x \<approx>(lang r) y"} \hspace{4mm}if and only if\hspace{4mm}
 \end{equation}
 \noindent
 This means the right-hand side (seen as a relation) refines the Myhill-Nerode
-relation.  Consequently, we can use @{text
+Relation.  Consequently, we can use @{text
 "\<^raw:$\threesim$>\<^bsub>(\<lambda>x. ders x r)\<^esub>"} as a
 tagging-relation. However, in order to be useful for the second part of the
-Myhill-Nerode theorem, we have to be able to establish that for the
+Myhill-Nerode Theorem, we have to be able to establish that for the
 corresponding language there are only finitely many derivatives---thus
 ensuring that there are only finitely many equivalence
 classes. Unfortunately, this is not true in general. Sakarovitch gives an
 example where a regular expression has infinitely many derivatives
 w.r.t.~the language @{text "(ab)\<^isup>\<star> \<union> (ab)\<^isup>\<star>a"}, which is formally
 closely in our formalisation (only the last two cases of
 \eqref{pdersunivone} involve some non-routine induction arguments), we omit
 the details.
 Let us now return to our proof for the second direction in the Myhill-Nerode
-theorem. The point of the above calculations is to use
+Theorem. The point of the above calculations is to use
 @{text "\<^raw:$\threesim$>\<^bsub>(\<lambda>x. pders x r)\<^esub>"} as tagging-relation.
 \begin{proof}[Proof of Theorem~\ref{myhillnerodetwo} (second version)]
 Using \eqref{mhders}
 \noindent
 Now the range of @{term "\<lambda>x. pderivs x r"} is a subset of @{term "Pow (pderivs_lang UNIV r)"},
 which we know is finite by Theorem~\ref{antimirov}. Consequently there
 are only finitely many equivalence classes of @{text "\<^raw:$\threesim$>\<^bsub>(\<lambda>x. pders x r)\<^esub>"}.
 This relation refines @{term "\<approx>(lang r)"}, and therefore we can again conclude the
-second part of the Myhill-Nerode theorem.
+second part of the Myhill-Nerode Theorem.
 \end{proof}
 *}
 section {* Closure Properties of Regular Languages\label{closure} *}
 operations. Closure under union, concatenation and Kleene-star are trivial
 to establish given our definition of regularity (recall Definition~\ref{regular}).
 More interesting in our setting is the closure under complement, because it seems difficult
 to construct a regular expression for the complement language by direct
 means. However the existence of such a regular expression can now be easily
-proved using both parts of the Myhill-Nerode theorem, since
+proved using both parts of the Myhill-Nerode Theorem, since
 \begin{center}
 @{term "s\<^isub>1 \<approx>A s\<^isub>2"} if and only if @{term "s\<^isub>1 \<approx>(-A) s\<^isub>2"}
 \end{center}
 \end{thrm}
 \noindent
 Our proof follows the one given in \cite[Pages 92--95]{Shallit08}, except that we use
 Higman's Lemma, which is already proved in the Isabelle/HOL library
-\cite{Berghofer03}.\footnote{Unfortunately, Berghofer's formalisation of Higman's lemma
+\cite{Berghofer03}.\footnote{Unfortunately, Berghofer's formalisation of Higman's Lemma
 is restricted to 2-letter alphabets,
 which means also our formalisation of Theorem~\ref{subseqreg} is `tainted' with
 this constraint. However our methodology is applicable to any alphabet of finite size.}
 Higman's Lemma allows us to infer that every set @{text A} of antichains, satisfying
 is regular, which means @{term "- SUBSEQ A"} is regular. But since
 we established already that regularity is preserved under complement, also @{term "SUBSEQ A"}
 must be regular.
 \end{proof}
-Finally we like to show that the Myhill-Nerode theorem is also convenient for establishing
+Finally we like to show that the Myhill-Nerode Theorem is also convenient for establishing
 the non-regularity of languages. For this we use the following version of the Continuation
 Lemma (see for example~\cite{Rosenberg06}).
 \begin{lmm}[Continuation Lemma]
 If a language @{text A} is regular and a set @{text B} is infinite,
 then there exist two distinct strings @{text x} and @{text y} in @{text B}
 such that @{term "x \<approx>A y"}.
 \end{lmm}
 \noindent
-This lemma can be easily deduced from the Myhill-Nerode theorem and the Pigeonhole
+This lemma can be easily deduced from the Myhill-Nerode Theorem and the Pigeonhole
 Principle: Since @{text A} is regular, there can be only finitely many
 equivalence classes. Hence an infinite set must contain
 at least two strings that are in the same equivalence class, that is
-they need to be related by the Myhill-Nerode relation.
+they need to be related by the Myhill-Nerode Relation.
 Using this lemma, it is straightforward to establish that the language
 \mbox{@{text "A \<equiv> \<Union>\<^isub>n a\<^sup>n @ b\<^sup>n"}} is not regular (@{text "a\<^sup>n"} stands
 for the strings consisting of @{text n} times the character a; similarly for
 @{text "b\<^isup>n"}). For this consider the infinite set @{text "B \<equiv> \<Union>\<^isub>n a\<^sup>n"}.
 the string on the right-hand side satisfies this property, but not the one on
 the left-hand side. Therefore the strings cannot be equal and we have a contradiction.
 \end{proof}
 \noindent
-To conclude, this lemma and the Continuation Lemma leads to a contradiction assuming @{text A}
+To conclude the proof on non-regularity of language @{text A}, the
-is regular. Therefore the language @{text A} is not regular, as we wanted to show.
+Continuation Lemma and the lemma above lead to a contradiction assuming
+@{text A} is regular. Therefore the language @{text A} is not regular, as we
+wanted to show.
 *}
 section {* Conclusion and Related Work *}
 In this paper we took the view that a regular language is one where there
 exists a regular expression that matches all of its strings. Regular
 expressions can conveniently be defined as a datatype in HOL-based theorem
 provers. For us it was therefore interesting to find out how far we can push
 this point of view. We have established in Isabelle/HOL both directions
-of the Myhill-Nerode theorem.
+of the Myhill-Nerode Theorem.
 %
 \begin{thrm}[The Myhill-Nerode Theorem]\mbox{}\\
 A language @{text A} is regular if and only if @{thm (rhs) Myhill_Nerode}.
 \end{thrm}
 \noindent
 Having formalised this theorem means we pushed our point of view quite
 far. Using this theorem we can obviously prove when a language is \emph{not}
 regular---by establishing that it has infinitely many equivalence classes
-generated by the Myhill-Nerode relation (this is usually the purpose of the
+generated by the Myhill-Nerode Relation (this is usually the purpose of the
 Pumping Lemma \cite{Kozen97}).  We can also use it to establish the standard
 textbook results about closure properties of regular languages. Interesting
 is the case of closure under complement, because it seems difficult to
 construct a regular expression for the complement language by direct
 means. However the existence of such a regular expression can be easily
-proved using the Myhill-Nerode theorem.
+proved using the Myhill-Nerode Theorem.
-Our insistence on regular expressions for proving the Myhill-Nerode theorem
+Our insistence on regular expressions for proving the Myhill-Nerode Theorem
-arose from the limitations of HOL, used in the popular theorem provers HOL4,
+arose from the limitations of HOL, which is the logic underlying the popular theorem provers HOL4,
 HOLlight and Isabelle/HOL. In order to guarantee consistency,
 formalisations in HOL can only extend the logic with definitions that introduce a new concept in
 terms of already existing notions. A convenient definition for automata
 (based on graphs) uses a polymorphic type for the state nodes. This allows
 us to use the standard operation for disjoint union whenever we need to compose two
 automata. Unfortunately, we cannot use such a polymorphic definition
 in HOL as part of the definition for regularity of a language (a predicate
-over set of strings).  Consider for example the following attempt:
+over sets of strings).  Consider for example the following attempt:
 \begin{center}
 @{text "is_regular A \<equiv> \<exists>M(\<alpha>). is_dfa (M) \<and> \<calL>(M) = A"}
 \end{center}
 \noindent
 In this definifion, the definiens is polymorphic in the type of the automata
-@{text "M"} (indicated by dependency on @{text "\<alpha>"}), but the definiendum
+@{text "M"} (indicated by dependency on the type-variable @{text "\<alpha>"}), but the definiendum
-@{text "is_regular"} is not. Such definitions are excluded in HOL, because
+@{text "is_regular"} is not. Such definitions are excluded from HOL, because
 they can lead easily to inconsistencies (see \cite{PittsHOL4} for a simple
 example). Also HOL does not contain type-quantifiers which would allow us to
 get rid of the polymorphism by quantifying over the type-variable @{text
 "\<alpha>"}. Therefore when defining regularity in terms of automata, the only
 natural way out in HOL is to resort to state nodes with an identity, for
 unpleasant. Regular expressions proved much more convenient for reasoning in
 HOL since they can be defined as inductive datatype and a reasoning
 infrastructure comes for free. The definition of regularity in terms of
 regular expressions poses no problem at all for HOL.  We showed in this
 paper that they can be used for establishing the central result in regular
-language theory---the Myhill-Nerode theorem.
+language theory---the Myhill-Nerode Theorem.
 While regular expressions are convenient, they have some limitations. One is
 that there seems to be no method of calculating a minimal regular expression
 (for example in terms of length) for a regular language, like there is for
 automata. On the other hand, efficient regular expression matching, without
 Our proof of the first direction is very much inspired by \emph{Brzozowski's
 algebraic method} used to convert a finite automaton to a regular expression
 \cite{Brzozowski64}. The close connection can be seen by considering the
 equivalence classes as the states of the minimal automaton for the regular
 language.  However there are some subtle differences. Because our equivalence
-classes (or correspondingly states) arise from the Myhil-Nerode Relation, the most natural
+classes (or correspondingly states) arise from the Myhill-Nerode Relation, the most natural
 choice is to characterise each state with the set of strings starting from
 the initial state leading up to that state. Usually, however, the states are
 characterised as the strings starting from that state leading to the
 terminal states.  The first choice has consequences about how the initial
 equational system is set up. We have the $\lambda$-term on our `initial
 state', while Brzozowski has it on the terminal states. This means we also
 need to reverse the direction of Arden's Lemma. We have not found anything
 in the literature about our way of proving the first direction of the
-Myhill-Nerode theorem, but it appears to be folklore.
+Myhill-Nerode Theorem, but it appears to be folklore.
 We presented two proofs for the second direction of the Myhill-Nerode
-theorem. One direct proof using tagging-functions and another using partial
+Theorem. One direct proof using tagging-functions and another using partial
 derivatives. This part of our work is where our method using regular
 expressions shines, because we can completely side-step the standard
 argument \cite{Kozen97} where automata need to be composed. However, it is
 also the direction where we had to spend most of the `conceptual' time, as
 our first proof based on tagging-functions is new for establishing the
-Myhill-Nerode theorem. All standard proofs of this direction proceed by
+Myhill-Nerode Theorem. All standard proofs of this direction proceed by
 arguments over automata.
 The indirect proof for the second direction arose from our interest in
 Brzozowski's derivatives for regular expression matching.  While Brzozowski
 already established that there are only
 Theorem~\ref{antimirov}). However, since partial derivatives use sets of
 regular expressions, one needs to carefully analyse whether the resulting
 algorithm is still executable. Given the existing infrastructure for
 executable sets in Isabelle/HOL \cite{Haftmann09}, it should.
-Our formalisation of the Myhill-Nerode theorem consists of 780 lines of
+Our formalisation of the Myhill-Nerode Theorem consists of 780 lines of
 Isabelle/Isar code for the first direction and 460 for the second (the one
 based on tagging-functions), plus around 300 lines of standard material
 about regular languages. The formalisation of derivatives and partial
 derivatives shown in Section~\ref{derivatives} consists of 390 lines of
 code.  The closure properties in Section~\ref{closure} (except Theorem~\ref{subseqreg})
 The algorithm for solving equational systems, which we
 used in the first direction, is conceptually relatively simple. Still the
 use of sets over which the algorithm operates means it is not as easy to
 formalise as one might hope. However, it seems sets cannot be avoided since
 the `input' of the algorithm consists of equivalence classes and we cannot
-see how to reformulate the theory so that we can use lists. Lists would be
+see how to reformulate the theory so that we can use lists or matrices. Lists would be
 much easier to reason about, since we can define functions over them by
 recursion. For sets we have to use set-comprehensions, which is slightly
-unwieldy.
+unwieldy. Matrices would allow us to use the slick formalisation by
+Nipkow of the Gauss-Jordan algorithm \cite{Nipkow11}.
 While our formalisation might appear large, it should be seen
 in the context of the work done by Constable at al \cite{Constable00} who
-formalised the Myhill-Nerode theorem in Nuprl using automata. They write
+formalised the Myhill-Nerode Theorem in Nuprl using automata. They write
 that their four-member team needed something on the magnitude of 18 months
 for their formalisation. It is hard to gauge the size of a
 formalisation in Nurpl, but from what is shown in the Nuprl Math Library
 about their development it seems substantially larger than ours. We attribute
 this to our use of regular expressions, which meant we did not need to `fight'

changeset 248	47446f111550
parent 247	087e6c255e33
child 249	061b32d78471