diff -r 5bbe63876f84 -r fa8d33d13cb6 Journal/Paper.thy --- a/Journal/Paper.thy Fri Aug 12 17:08:58 2011 +0000 +++ b/Journal/Paper.thy Mon Aug 15 21:09:08 2011 +0000 @@ -183,7 +183,6 @@ we have to be able to combine automata. Consider for example the operation of sequencing two automata, say $A_1$ and $A_2$, by connecting the accepting states of $A_1$ to the initial state of $A_2$: - % \begin{center} \begin{tabular}{ccc} @@ -252,12 +251,14 @@ \noindent changes the type---the disjoint union is not a set, but a set of pairs. Using this definition for disjoint union means we do not have a - single type for automata. As a result we will not be able to define a regular - language as one for which there exists an automaton that recognises all its - strings (Definition~\ref{baddef}). This is because we cannot make a definition in HOL that is polymorphic in - the state type and there is no type quantification available in HOL (unlike - in Coq, for example).\footnote{Slind already pointed out this problem in an email - to the HOL4 mailing list on 21st April 2005.} + single type for automata. As a result we will not be able to define a + regular language as one for which there exists an automaton that recognises + all its strings (Definition~\ref{baddef}). This is because we cannot make a + definition in HOL that is polymorphic in the state type and there is no type + quantification available in HOL (unlike in Coq, for example).\footnote{Slind + already pointed out this problem in an email to the HOL4 mailing list on + 21st April 2005.} + An alternative, which provides us with a single type for automata, is to give every state node an identity, for example a natural @@ -1888,7 +1889,7 @@ @{text "\<^raw:$\threesim$>\<^bsub>(\x. ders x r)\<^esub>"} as tagging-relation. - \begin{proof}[Proof of Theorem~\ref{myhillnerodetwo}] + \begin{proof}[Proof of Theorem~\ref{myhillnerodetwo} (second version)] Using \eqref{mhders} and \eqref{Derspders} we can easily infer that @@ -1916,18 +1917,18 @@ \end{proof} *} -section {* Closure Properties *} +section {* Closure Properties of Regular Languages *} text {* \noindent - The real beauty of regular languages is that they are closed - under almost all set operations. Closure under union, concatenation and Kleene-star - are trivial to establish given our definition of regularity (Definition~\ref{regular}). - More interesting is the closure under complement, because - it seems difficult to construct a regular expression for the complement - language by direct means. However the existence of such a regular expression - can now be easily proved using the Myhill-Nerode theorem since - + The beauty of regular languages is that they are closed under many set + operations. Closure under union, concatenation and Kleene-star are trivial + to establish given our definition of regularity (recall Definition~\ref{regular}). + More interesting is the closure under complement, because it seems difficult + to construct a regular expression for the complement language by direct + means. However the existence of such a regular expression can now be easily + proved using the Myhill-Nerode theorem since + \begin{center} @{term "s\<^isub>1 \A s\<^isub>2"} if and only if @{term "s\<^isub>1 \(-A) s\<^isub>2"} \end{center} @@ -1935,8 +1936,13 @@ \noindent holds for any strings @{text "s\<^isub>1"} and @{text "s\<^isub>2"}. Therefore @{text A} and the complement language @{term "-A"} - give rise to the same partitions. So if one is finite, the other is too and the - other way around. + give rise to the same partitions. So if one is finite, the other is too, and + the other way around. Proving the existence of such a regular expression via + automata using the standard method would be quite involved. It includes the + steps: regular expression @{text "\"} non-deterministic automaton @{text + "\"} deterministic automaton @{text "\"} complement automaton @{text "\"} + regular expression. Clearly not something you want to formalise in a theorem + prover in which it is cumbersome to reason about automata. Once closure under complement is established, closure under intersection and set difference is also easy, because @@ -1949,15 +1955,15 @@ \end{center} \noindent - Closure of regular languages under reversal, which means + Closure of regular languages under reversal, that is \begin{center} @{text "A\<^bsup>-1\<^esup> \ {s\<^bsup>-1\<^esup> | s \ A}"} \end{center} \noindent - can be shown with the help of the following operation defined on regular - expressions + can be shown with the help of the following operation defined recursively over + regular expressions \begin{center} \begin{tabular}{r@ {\hspace{1mm}}c@ {\hspace{1mm}}l} @@ -1973,38 +1979,42 @@ \end{center} \noindent - For this operation we can so + For this operation we can show \begin{center} @{text "(\(r))\<^bsup>-1\<^esup>"}~@{text "="}~@{thm (rhs) rev_lang} \end{center} \noindent - from which closure under reversal follows. + from which closure under reversal of regular languages follows. - The perhaps the most surprising fact is that regular languages are closed under any - left-quotient. Define + A perhaps surprising fact is that regular languages are closed under any + left-quotient. Define \begin{center} @{abbrev "Ders_lang B A"} \end{center} \noindent - and assume @{text A} is regular. From this we know there exists a regular - expression @{text r} such that @{term "A = lang r"}. We also know that - @{term "pders_lang B r"} is finite. By definition and Lemma~\ref{Derspders} + and assume @{text B} is any language and @{text A} is regular, then @{term + "Ders_lang B A"} is regular. To see this consider the following argument + using partial derivatives: From @{text A} being regular we know there exists + a regular expression @{text r} such that @{term "A = lang r"}. We also know + that @{term "pders_lang B r"} is finite for every language @{text B} and + regular expression @{text r} (recall Theorem~\ref{antimirov}). By definition + and Lemma~\ref{Derspders} therefore + \begin{equation}\label{eqq} @{term "Ders_lang B (lang r) = (\ lang ` (pders_lang B r))"} \end{equation} \noindent - Since there are only finitely many regular expressions in @{term "pders_lang B r"} - by Theorem~\ref{antimirov}, we know that the right-hand side of \eqref{eqq}, is - equal to @{term "lang (\(pders_lang B r))"} using \eqref{uplus}. Hence - the regular expression @{term "pders_lang B r"} verifies that @{term "Ders_lang B A"} - is regular. - + Since there are only finitely many regular expressions in @{term "pders_lang + B r"}, we know by \eqref{uplus} that there exists a regular expression that + the right-hand side of \eqref{eqq} is equal to \mbox{@{term "lang (\(pders_lang B + r))"}}. Thus the regular expression @{term "\(pders_lang B r)"} verifies that + @{term "Ders_lang B A"} is regular. *} @@ -2033,29 +2043,61 @@ is the case of closure under complement, because it seems difficult to construct a regular expression for the complement language by direct means. However the existence of such a regular expression can be easily - proved using the Myhill-Nerode theorem. Proving the existence of such a - regular expression via automata using the standard method would be quite - involved. It includes the steps: regular expression @{text "\"} - non-deterministic automaton @{text "\"} deterministic automaton @{text "\"} - complement automaton @{text "\"} regular expression. + proved using the Myhill-Nerode theorem. + + Our insistence on regular expressions for proving the Myhill-Nerode theorem + arose from the limitations of HOL on which the popular theorem provers HOL4, + HOLlight and Isabelle/HOL are based. In order to guarantee consistency, + formalisations can only extend HOL by definitions that introduce a notion in + terms of already existing concepts. A convenient definition for automata + (based on graphs) use a polymorphic type for the state nodes. This allows us + to use the standard operation of disjoint union in order to compose two + automata. But we cannot use such a polymorphic definition of automata in HOL + as part of the definition for regularity of a language (a set of strings). + Consider the following attempt - While regular expressions are convenient in formalisations, they have some - limitations. One is that there seems to be no method of calculating a - minimal regular expression (for example in terms of length) for a regular - language, like there is - for automata. On the other hand, efficient regular expression matching, - without using automata, poses no problem \cite{OwensReppyTuron09}. - For an implementation of a simple regular expression matcher, - whose correctness has been formally established, we refer the reader to - Owens and Slind \cite{OwensSlind08}. + \begin{center} + @{text "is_regular A \ \M(\). is_finite_automata (M) \ \(M) = A"} + \end{center} + \noindent + which means the definiens is polymorphic in the type of the automata @{text + "M"}, but the definiendum @{text "is_regular"} is not. Such definitions are + excluded in HOL, because they lead easily to inconsistencies (see + \cite{PittsHOL4} for a simple example). Also HOL does not contain + type-quantifiers which would allow us to get rid of the polymorphism by + quantifying over the type-variable @{text "\"}. Therefore when defining + regularity in terms of automata, the only natural way out in HOL is to use state + nodes with an identity, for example a natural number. Unfortunatly, the + consequence is that we have to be careful when combining two automata so + that there is no clash between two such states. This makes formalisations + quite fiddly and unpleasant. Regular expressions proved much more convenient + for reasoning in HOL and we showed they can be used for establishing the + Myhill-Nerode theorem. + + While regular expressions are convenient, they have some limitations. One is + that there seems to be no method of calculating a minimal regular expression + (for example in terms of length) for a regular language, like there is for + automata. On the other hand, efficient regular expression matching, without + using automata, poses no problem \cite{OwensReppyTuron09}. For an + implementation of a simple regular expression matcher, whose correctness has + been formally established, we refer the reader to Owens and Slind + \cite{OwensSlind08}. Our formalisation consists of 780 lines of Isabelle/Isar code for the first direction and 460 for the second, plus around 300 lines of standard material - about regular languages. The formalisation about derivatives and partial - derivatives shown in Section~\ref{derivatives} consists of 390 lines of code. - While this might be seen large, it should be seen + about regular languages. The formalisation about derivatives and partial + derivatives shown in Section~\ref{derivatives} consists of 390 lines of + code. The algorithm for solving equational systems, which we used in the + first direction, is conceptually not that complicated. Still the use of sets + over which the algorithm operates, means it is not as easy to formalise as + one might wish. It seems sets cannot be avoided since the `input' of the + algorithm consists of equivalence classes and we cannot see how to + reformulate the theory so that we can use lists, which are usually easier to + reason about in a theorem prover. + + While our formalisation might be seen large, it should be seen in the context of the work done by Constable at al \cite{Constable00} who formalised the Myhill-Nerode theorem in Nuprl using automata. They write that their four-member team needed something on the magnitude of 18 months @@ -2069,29 +2111,46 @@ ours can be found in the Mercurial Repository at \mbox{\url{http://www4.in.tum.de/~urbanc/regexp.html}}. + Our proof of the first direction is very much inspired by \emph{Brzozowski's + algebraic method} used to convert a finite automaton to a regular expression + \cite{Brzozowski64}. The close connection can be seen by considering the + equivalence classes as the states of the minimal automaton for the regular + language. However there are some subtle differences. Since we identify + equivalence classes with the states of the automaton, then the most natural + choice is to characterise each state with the set of strings starting from + the initial state leading up to that state. Usually, however, the states are + characterised as the strings starting from that state leading to the + terminal states. The first choice has consequences about how the initial + equational system is set up. We have the $\lambda$-term on our `initial + state', while Brzozowski has it on the terminal states. This means we also + need to reverse the direction of Arden's Lemma. We have not found anything + in the literature about this way of proving the first direction of the + Myhill-Nerode theorem, but it appears to be folklore. - Our proof of the first direction is very much inspired by \emph{Brzozowski's - algebraic method} used to convert a finite automaton to a regular - expression \cite{Brzozowski64}. The close connection can be seen by considering the equivalence - classes as the states of the minimal automaton for the regular language. - However there are some subtle differences. Since we identify equivalence - classes with the states of the automaton, then the most natural choice is to - characterise each state with the set of strings starting from the initial - state leading up to that state. Usually, however, the states are characterised as the - strings starting from that state leading to the terminal states. The first - choice has consequences about how the initial equational system is set up. We have - the $\lambda$-term on our `initial state', while Brzozowski has it on the - terminal states. This means we also need to reverse the direction of Arden's - Lemma. + We presented two proofs for the second direction of the Myhill-Nerode + theorem. One direct proof using tagging-functions and another using partial + derivatives. These proofs is where our method shines, because we can + completely side-step the standard argument \cite{Kozen97} where automata + need to be composed. However, it is also the direction where we had to spend + most of the `conceptual' time, as our first proof based on + tagging-functions is new for establishing the Myhill-Nerode theorem. All + standard proofs of this direction proceed by arguments over automata. + + Our indirect proof for the second direction arose from the interest in + Brzozowski's derivatives for regular expression matching. A corresponding + regular expression matcher has been formalised in HOL4 in + \cite{OwensSlind08}. In our opinion, this formalisation is considerably + slicker than for example the approach to regular expression matching taken + in \cite{Harper99} and \cite{Yi06}. While Brzozowski's derivatives lead to + simple regular expression matchers and he proved that there are only + finitely many dissimilar derivatives for every regular expression, this + result is not as straightforward to formalise in a theorem prover. The + reason is that the set of dissimilar derivatives is not defined inductively, + but in terms of an ACI-equivalence relation. - This is also where our method shines, because we can completely - side-step the standard argument \cite{Kozen97} where automata need - to be composed, which as stated in the Introduction is not so easy - to formalise in a HOL-based theorem prover. However, it is also the - direction where we had to spend most of the `conceptual' time, as - our proof-argument based on tagging-functions is new for - establishing the Myhill-Nerode theorem. All standard proofs of this - direction proceed by arguments over automata.\medskip + + + \medskip We expect that the development of Krauss \& Nipkow gets easier by using partial derivatives.\medskip