# HG changeset patch # User urbanc # Date 1315387693 0 # Node ID 40b8d485ce8d85099386f2127a63447a9dfc379c # Parent a9598a206c41418abf42341e109331a7acb82d11 polished a bit the journal paper diff -r a9598a206c41 -r 40b8d485ce8d Journal/Paper.thy --- a/Journal/Paper.thy Tue Sep 06 02:52:26 2011 +0000 +++ b/Journal/Paper.thy Wed Sep 07 09:28:13 2011 +0000 @@ -382,7 +382,7 @@ larger formalisations of automata theory are carried out in Nuprl \cite{Constable00} and in Coq, e.g.~\cite{Filliatre97,Almeidaetal10}. - Also one might consider that automata are convenient `vehicles' for + Also, one might consider automata as just convenient `vehicles' for establishing properties about regular languages. However, paper proofs about automata often involve subtle side-conditions which are easily overlooked, but which make formal reasoning rather painful. For example @@ -397,8 +397,8 @@ Definition~\ref{baddef}), then we need a lemma which ensures that another equivalent one can be found satisfying the side-condition, and also need to make sure our operations on automata preserve them. Unfortunately, such - `little' and `obvious' lemmas make a formalisation of automata theory a - hair-pulling experience. + `little' and `obvious' lemmas make formalisations of automata theory + hair-pulling experiences. In this paper, we will not attempt to formalise automata theory in Isabelle/HOL nor will we attempt to formalise automata proofs from the @@ -426,9 +426,9 @@ regular expressions. This theorem gives necessary and sufficient conditions for when a language is regular. As a corollary of this theorem we can easily establish the usual closure properties, including complementation, for - regular languages. We also use in one example the continuation lemma, which - is based on Myhill-Nerode, for establishing non-regularity of languages - \cite{Rosenberg06}.\medskip + regular languages. We use the continuation lemma \cite{Rosenberg06}, + which is also a corollary of the Myhill-Nerode theorem, for establishing + the non-regularity of the language @{text "a\<^isup>nb\<^isup>n"}.\medskip \noindent {\bf Contributions:} There is an extensive literature on regular languages. @@ -439,7 +439,7 @@ folklore. For the other part, we give two proofs: one direct proof using certain tagging-functions, and another indirect proof using Antimirov's partial derivatives \cite{Antimirov95}. Again to our best knowledge, the - tagging-functions have not been used before to establish the Myhill-Nerode + tagging-functions have not been used before for establishing the Myhill-Nerode theorem. Derivatives of regular expressions have been used recently quite widely in the literature; partial derivatives, in contrast, attract much less attention. However, partial derivatives are more suitable in the @@ -1166,11 +1166,12 @@ \end{proof} \noindent - Note that solving our equational system also gives us a method for - calculating the regular expression for a complement of a regular language: - similar to the construction on automata, if we combine all regular + Note that our algorithm for solving equational systems provides also a method for + calculating a regular expression for the complement of a regular language: + if we combine all regular expressions corresponding to equivalence classes not in @{term "finals A"}, - we obtain a regular expression for the complement @{term "- A"}. + then we obtain a regular expression for the complement language @{term "- A"}. + This is similar to the usual construction of a `complement automaton'. *} @@ -1252,7 +1253,7 @@ R}, has finitely many equivalence classes and refines @{term "\(lang r)"}. \begin{dfntn} - A relation @{text "R\<^isub>1"} is said to \emph{refine} @{text "R\<^isub>2"} + A relation @{text "R\<^isub>1"} \emph{refines} @{text "R\<^isub>2"} provided @{text "R\<^isub>1 \ R\<^isub>2"}. \end{dfntn} @@ -1698,14 +1699,12 @@ \end{equation} \noindent - where @{text "\"} in the fifth line is a function that tests whether the - empty string is in the language and returns @{term "{[]}"} or @{term "{}"}, - accordingly. Note also that in the last equation we use the list-cons operator written + Note that in the last equation we use the list-cons operator written \mbox{@{text "_ :: _"}}. The only interesting case is the case of @{term "A\"} where we use Property~\ref{langprops}@{text "(i)"} in order to infer that @{term "Deriv c (A\) = Deriv c (A \ A\)"}. We can then complete the proof by - using the fifth equation and noting that @{term "Delta A \ Deriv c (A\) \ (Deriv - c A) \ A\"}. + using the fifth equation and noting that @{term "Deriv c (A\) \ (Deriv + c A) \ A\"} provided @{text "[] \ A"}. Brzozowski observed that the left-quotients for languages of regular expressions can be calculated directly using the notion of @@ -1928,7 +1927,7 @@ that by using sets, partial derivatives have the @{text "ACI"}-identities of derivatives already built in. - Antimirov also proved that for every language and regular expression + Antimirov also proved that for every language and every regular expression there are only finitely many partial derivatives, whereby the set of partial derivatives of @{text r} w.r.t.~a language @{text A} is defined as @@ -2005,8 +2004,8 @@ \noindent Now the range of @{term "\x. pderivs x r"} is a subset of @{term "Pow (pderivs_lang UNIV r)"}, which we know is finite by Theorem~\ref{antimirov}. Consequently there - are only finitely many equivalence classes of @{text "\<^raw:$\threesim$>\<^bsub>(\x. pders x r)\<^esub>"}, - which refines @{term "\(lang r)"}, and therefore we can again conclude the + are only finitely many equivalence classes of @{text "\<^raw:$\threesim$>\<^bsub>(\x. pders x r)\<^esub>"}. + This relation refines @{term "\(lang r)"}, and therefore we can again conclude the second part of the Myhill-Nerode theorem. \end{proof} *} @@ -2099,7 +2098,7 @@ a regular expression @{text r} such that @{term "A = lang r"}. We also know that @{term "pderivs_lang B r"} is finite for every language @{text B} and regular expression @{text r} (recall Theorem~\ref{antimirov}). By definition - and \eqref{Derspders} therefore + and \eqref{Derspders} we have \begin{equation}\label{eqq} @@ -2126,7 +2125,7 @@ \end{center} \noindent - It can be easily proved that @{text "\"} is a partial order. Now define the + It is straightforward to prove that @{text "\"} is a partial order. Now define the \emph{language of substrings} and \emph{superstrings} of a language @{text A} respectively as @@ -2141,12 +2140,13 @@ We like to establish \begin{lmm}\label{subseqreg} - For every language @{text A}, the languages @{term "SUBSEQ A"} and @{term "SUPSEQ A"} + For every language @{text A}, the languages @{text "(i)"} @{term "SUBSEQ A"} and + @{text "(ii)"} @{term "SUPSEQ A"} are regular. \end{lmm} \noindent - Our proof follows the one given in \cite[92--95]{Shallit08}, except that we use + Our proof follows the one given in \cite[Pages 92--95]{Shallit08}, except that we use Higman's Lemma, which is already proved in the Isabelle/HOL library \cite{Berghofer03}. Higman's Lemma allows us to infer that every set @{text A} of antichains, satisfying @@ -2158,7 +2158,7 @@ is finite. The first step in our proof of Lemma~\ref{subseqreg} is to establish the - following properties for @{term SUPSEQ} + following simple properties for @{term SUPSEQ} \begin{equation}\label{supseqprops} \mbox{\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} @@ -2226,7 +2226,7 @@ except with itself. It is also straightforward to show that @{term "SUPSEQ M \ SUPSEQ A"}. For the other direction we have @{term "x \ SUPSEQ A"}. From this we obtain - a @{text y} such that @{term "y \ A"} and @{term "y \ x"}. Since we know that + a @{text y} such that @{term "y \ A"} and @{term "y \ x"}. Since we have that the relation \mbox{@{term "{(y, x). y \ x \ x \ y}"}} is well-founded, there must be a minimal element @{text "z"} such that @{term "z \ A"} and @{term "z \ y"}, and hence by transitivity also \mbox{@{term "z \ x"}} (here we deviate from the argument @@ -2265,11 +2265,11 @@ \end{proof} Finally we like to show that the Myhill-Nerode theorem is also convenient for establishing - non-regularity of languages. For this we use the following version of the Continuation + the non-regularity of languages. For this we use the following version of the Continuation Lemma (see for example~\cite{Rosenberg06}). \begin{lmm}[Continuation Lemma] - If the language @{text A} is regular and the set @{text B} is infinite, + If a language @{text A} is regular and a set @{text B} is infinite, then there exist two distinct strings @{text x} and @{text y} in @{text B} such that @{term "x \A y"}. \end{lmm} @@ -2277,14 +2277,14 @@ \noindent This lemma can be easily deduced from the Myhill-Nerode theorem and the Pigeonhole Principle: Since @{text A} is regular, there can be only finitely many - equivalence classes by the Myhill-Nerode relation. Hence an infinite set must contain + equivalence classes. Hence an infinite set must contain at least two strings that are in the same equivalence class, that is they need to be related by the Myhill-Nerode relation. Using this lemma, it is straightforward to establish that the language - \mbox{@{text "A \ \\<^isub>n a\<^sup>n @ b\<^sup>n"}}, where @{text "a\<^sup>n"} stands - for the strings consisting of @{text n} times the character a, is not - regular. For this consider the infinite set @{text "B \ \\<^isub>n a\<^sup>n"}. + \mbox{@{text "A \ \\<^isub>n a\<^sup>n @ b\<^sup>n"}} is not regular (@{text "a\<^sup>n"} stands + for the strings consisting of @{text n} times the character a; similarly for + @{text "b\<^isub>n"}). For this consider the infinite set @{text "B \ \\<^isub>n a\<^sup>n"}. \begin{lmm} No two distinct strings in @{text "B"} are Myhill-Nerode related by @{text A}. @@ -2373,7 +2373,9 @@ using automata, poses no problem \cite{OwensReppyTuron09}. For an implementation of a simple regular expression matcher, whose correctness has been formally established, we refer the reader to Owens and Slind - \cite{OwensSlind08}. + \cite{OwensSlind08}. In our opinion, their formalisation is considerably + slicker than for example the approach to regular expression matching taken + in \cite{Harper99} and \cite{Yi06}. Our proof of the first direction is very much inspired by \emph{Brzozowski's algebraic method} used to convert a finite automaton to a regular expression @@ -2402,12 +2404,8 @@ arguments over automata. The indirect proof for the second direction arose from our interest in - Brzozowski's derivatives for regular expression matching. A corresponding - regular expression matcher has been formalised by Owens and Slind in HOL4 - \cite{OwensSlind08}. In our opinion, their formalisation is considerably - slicker than for example the approach to regular expression matching taken - in \cite{Harper99} and \cite{Yi06}. While Brzozowski's derivatives lead to a - simple regular expression matcher and he established that there are only + Brzozowski's derivatives for regular expression matching. While Brzozowski + already established that there are only finitely many dissimilar derivatives for every regular expression, this result is not as straightforward to formalise in a theorem prover as one might wish. The reason is that the set of dissimilar derivatives is not @@ -2418,8 +2416,8 @@ and for their argument the lack of a formal proof of termination is not crucial (it merely lets them ``sleep better'' \cite{KraussNipkow11}). We expect that their development simplifies by using partial derivatives, - instead of derivatives, and that termination of the algorithm can be - formally established (the main incredience is + instead of derivatives, and that the termination of the algorithm can be + formally established (the main ingredient is Theorem~\ref{antimirov}). However, since partial derivatives use sets of regular expressions, one needs to carefully analyse whether the resulting algorithm is still executable. Given the existing infrastructure for @@ -2445,32 +2443,32 @@ in the context of the work done by Constable at al \cite{Constable00} who formalised the Myhill-Nerode theorem in Nuprl using automata. They write that their four-member team needed something on the magnitude of 18 months - for their formalisation. Also, Filli\^atre reports that his formalisation in - Coq of automata theory and Kleene's theorem is ``rather big''. - \cite{Filliatre97} More recently, Almeida et al reported about another + for their formalisation. It is hard to gauge the size of a + formalisation in Nurpl, but from what is shown in the Nuprl Math Library + about their development it seems substantially larger than ours. We attribute + this to our use of regular expressions, which meant we did not need to `fight' + the theorem prover. + Also, Filli\^atre reports that his formalisation in + Coq of automata theory and Kleene's theorem is ``rather big'' + \cite{Filliatre97}. More recently, Almeida et al reported about another formalisation of regular languages in Coq \cite{Almeidaetal10}. Their main result is the correctness of Mirkin's construction of an automaton from a regular expression using partial derivatives. This took approximately 10600 lines - of code. The estimate for our formalisation is that we + of code. In terms of time, the estimate for our formalisation is that we needed approximately 3 months and this included the time to find our proof arguments. Unlike Constable et al, who were able to follow the Myhill-Nerode proof from \cite{HopcroftUllman69}, we had to find our own arguments. So for us the - formalisation was not the bottleneck. It is hard to gauge the size of a - formalisation in Nurpl, but from what is shown in the Nuprl Math Library - about their development it seems substantially larger than ours. We attribute - this to our use of regular expressions, which meant we did not need to `fight' - the theorem prover. The code of + formalisation was not the bottleneck. The code of our formalisation can be found in the Archive of Formal Proofs at - \mbox{\url{http://afp.sourceforge.net/devel-entries/Myhill-Nerode.shtml}} \cite{myhillnerodeafp11}.\medskip + \mbox{\url{http://afp.sourceforge.net/devel-entries/Myhill-Nerode.shtml}} + \cite{myhillnerodeafp11}.\medskip \noindent {\bf Acknowledgements:} We are grateful for the comments we received from Larry Paulson. Tobias Nipkow made us aware of the properties in Lemma~\ref{subseqreg} and Tjark - Weber helped us with their proofs. - - + Weber helped us with proving them. *} diff -r a9598a206c41 -r 40b8d485ce8d Journal/document/root.bib --- a/Journal/document/root.bib Tue Sep 06 02:52:26 2011 +0000 +++ b/Journal/document/root.bib Wed Sep 07 09:28:13 2011 +0000 @@ -12,7 +12,7 @@ @article{Gasarch09, author = {S.~A.~Fenner and W.~I.~Gasarch and B.~Postow}, title = {{T}he {C}omplexity of {F}inding {SUBSEQ(A)}}, - journal = {Theory Computing Systems}, + journal = {Theory of Computing Systems}, volume = {45}, number = {3}, year = {2009}, diff -r a9598a206c41 -r 40b8d485ce8d Journal/document/root.tex --- a/Journal/document/root.tex Tue Sep 06 02:52:26 2011 +0000 +++ b/Journal/document/root.tex Wed Sep 07 09:28:13 2011 +0000 @@ -58,7 +58,8 @@ expressions can be defined conveniently as a datatype and a corresponding reasoning infrastructure comes for free. We show in this paper that a central result from formal language theory---the Myhill-Nerode theorem---can be -recreated using only regular expressions. +recreated using only regular expressions. From this theorem many closure +properties of regular languages follow. \end{abstract} \maketitle