regexp: comparison Journal/Paper.thy

equal deleted inserted replaced

-:48b231495281
+:2c56b20032a7
 An alternative, which provides us with a single type for states of automata,
 is to give every state node an identity, for example a natural number, and
 then be careful to rename these identities apart whenever connecting two
 automata. This results in clunky proofs establishing that properties are
 invariant under renaming. Similarly, connecting two automata represented as
-matrices results in very adhoc constructions, which are not pleasant to
+matrices results in messy constructions, which are not pleasant to
-reason about.
+formally reason about. \citeN[Page 67]{Braibant12}, for example, writes that there are no
+problems with reasoning about matrices, but that there is an
+``intrinsic difficulty of working with rectangular matrices'' in some
+parts of his formalisation of formal languages in Coq.
 Functions are much better supported in Isabelle/HOL, but they still lead to
 similar problems as with graphs.  Composing, for example, two
 non-deterministic automata in parallel requires also the formalisation of
 disjoint unions. \citeN{Nipkow98} dismisses for this the option of
 %condition upon functions in order to represent \emph{finite} automata. The
 %best is probably to resort to more advanced reasoning frameworks, such as
 %\emph{locales} or \emph{type classes}, which are \emph{not} available in all
 %HOL-based theorem provers.
-Because of these problems to do with representing automata, there seems to
+Because of these problems to do with representing automata, formalising
-be no substantial formalisation of automata theory and regular languages
+automata theory is surprisingly not as easy as one might think, despite the
-carried out in HOL-based theorem provers. \citeN{Nipkow98} establishes
+sometimes very detailed, but informal, textbook proofs. \citeN{LammichTuerk12}
+formalised Hopcroft's algorithm using an automata library of 27 kloc in
+Isabelle/HOL. There they use matrices for representing automata.
+Functions are used by \citeN{Nipkow98}, who establishes
 the link between regular expressions and automata in the context of
-lexing. \citeN{BerghoferReiter09} formalise automata
+lexing. \citeN{BerghoferReiter09} use functions as well for formalising automata
-working over bit strings in the context of Presburger arithmetic.  The only
+working over bit strings in the context of Presburger arithmetic.
-larger formalisations of automata theory are carried out in Nuprl by
+A Larger formalisation of automata theory, including the Myhill-Nerode theorem,
-\citeN{Constable00} and in Coq by \citeN{Filliatre97}, \citeN{Almeidaetal10}
+was carried out in Nuprl by \citeN{Constable00} which also uses functions.
-and \citeN{Braibant12}.
+Other large formailsations of automata theory in Coq are by
+\citeN{Filliatre97} who essentially uses graphs and by \citeN{Almeidaetal10}
+and \citeN{Braibant12}, who both use matrices. Many of these works,
+like \citeN{Nipkow98} or \citeN{Braibant12}, mention intrinsic problems with
+their representation of
+automata which made them `fight' their respective  theorem prover.
 %Also, one might consider automata as just convenient `vehicles' for
 %establishing properties about regular languages.  However, paper proofs
 %about automata often involve subtle side-conditions which are easily
 %overlooked, but which make formal reasoning rather painful. For example
 HOL. %Moreover, no side-conditions will be needed for regular expressions,
 %like we need for automata.
 This convenience of regular expressions has
 recently been exploited in HOL4 with a formalisation of regular expression
 matching based on derivatives by \citeN{OwensSlind08}, and with an equivalence
-checker for regular expressions in Isabelle/HOL by \citeN{KraussNipkow11}.  The
+checker for regular expressions in Isabelle/HOL by \citeN{KraussNipkow11}
+and in Matida by \citeN{Asperti12} and in Coq by \citeN{CoquandSiles12}.  The
 main purpose of this paper is to show that a central result about regular
 languages---the Myhill-Nerode Theorem---can be recreated by only using
 regular expressions. This theorem gives necessary and sufficient conditions
 for when a language is regular. As a corollary of this theorem we can easily
 establish the usual closure properties, including complementation, for
 hold, which shows that @{term "UNIV // \<approx>(lang r)"} must be finite.
 \end{proof}
 \noindent
 Much more interesting, however, are the inductive cases. They seem hard to be solved
-directly. The reader is invited to try.
+directly. The reader is invited to try.\footnote{The induction hypothesis is not strong enough
+to make any progress with the TIME and STAR cases.}
 In order to see how our proof proceeds consider the following suggestive picture
 given by \citeN{Constable00}:
 %
 \begin{equation}\label{pics}
 \end{center}
 \noindent
 and assume @{text B} is any language and @{text A} is regular, then @{term
 "Deriv_lang B A"} is regular. To see this consider the following argument
-using partial derivatives: From @{text A} being regular we know there exists
+using partial derivatives (which we used in Section~\ref{derivatives}): From @{text A}
+being regular we know there exists
 a regular expression @{text r} such that @{term "A = lang r"}. We also know
 that @{term "pderivs_lang B r"} is finite for every language @{text B} and
 regular expression @{text r} (recall Theorem~\ref{antimirov}). By definition
 and \eqref{Derspders} we have
 %
 holds. Now the first part of Theorem~\ref{subseqreg} is a simple consequence of the second part.
 \begin{proof}[of the First Part of Theorem~\ref{subseqreg}]
 By the second part, we know the right-hand side of \eqref{compl}
 is regular, which means @{term "- SUBSEQ A"} is regular. But since
-we established already that regularity is preserved under complement, also @{term "SUBSEQ A"}
+we established already that regularity is preserved under complement (using Myhill-Nerode),
-must be regular.
+also @{term "SUBSEQ A"} must be regular.
 \end{proof}
 Finally we like to show that the Myhill-Nerode Theorem is also convenient for establishing
 the non-regularity of languages. For this we use the following version of the Continuation
 Lemma (see for example~\cite{Rosenberg06}).
 Theorem~\ref{antimirov}). However, since partial derivatives use sets of
 regular expressions, one needs to carefully analyse whether the resulting
 algorithm is still executable. Given the infrastructure for
 executable sets  introduced by \citeN{Haftmann09} in Isabelle/HOL, it should.
+We started out by claiming that in a theorem prover it is eaiser to reason
+about regular expressions than about automta. Here are some numbers:
 Our formalisation of the Myhill-Nerode Theorem consists of 780 lines of
 Isabelle/Isar code for the first direction and 460 for the second (the one
 based on tagging-functions), plus around 300 lines of standard material
 about regular languages. The formalisation of derivatives and partial
 derivatives shown in Section~\ref{derivatives} consists of 390 lines of
 100 lines of code. The Continuation Lemma and the non-regularity of @{text "a\<^sup>n b\<^sup>n"}
 require 70 lines of code.
 The algorithm for solving equational systems, which we
 used in the first direction, is conceptually relatively simple. Still the
 use of sets over which the algorithm operates means it is not as easy to
-formalise as one might hope. However, it seems sets cannot be avoided since
+formalise as one might wish. However, it seems sets cannot be avoided since
 the `input' of the algorithm consists of equivalence classes and we cannot
 see how to reformulate the theory so that we can use lists or matrices. Lists would be
 much easier to reason about, since we can define functions over them by
 recursion. For sets we have to use set-comprehensions, which is slightly
 unwieldy. Matrices would allow us to use the slick formalisation by
 formalised the Myhill-Nerode Theorem in Nuprl using automata. They write
 that their four-member team would need something on the magnitude of 18 months
 for their formalisation of the first eleven chapters of the textbook by \citeN{HopcroftUllman69},
 which includes the Myhill-Nerode theorem. It is hard to gauge the size of a
 formalisation in Nurpl, but from what is shown in the Nuprl Math Library
-about their development it seems substantially larger than ours. We attribute
+about their development it seems \emph{substantially} larger than ours. We attribute
 this to our use of regular expressions, which meant we did not need to `fight'
-the theorem prover.
+the theorem prover. Recently, \citeN{LammichTuerk12} formalised Hopcroft's
+algorithm in Isabelle/HOL (in 7000 lines of code) using an automata
+library of 27000 lines of code.
 Also, \citeN{Filliatre97} reports that his formalisation in
 Coq of automata theory and Kleene's theorem is ``rather big''.
-More recently, \citeN{Almeidaetal10}  reported about another
+\citeN{Almeidaetal10}  reported about another
 formalisation of regular languages in Coq. Their
 main result is the
 correctness of Mirkin's construction of an automaton from a regular
 expression using partial derivatives. This took approximately 10600 lines
-of code.  Also \citeN{Braibant12} formalised a large part of regular language
+of code.  \citeN{Braibant12} formalised a large part of regular language
 theory and Kleene algebras in Coq. While he is mainly interested
 in implementing decision procedures for Kleene algebras, his library
 includes a proof of the Myhill-Nerode theorem. He reckons that our
 ``development is more concise'' than his one based on matrices \cite[Page 67]{Braibant12}.
+He writes that there is no conceptual problems with formally reasoning
+about matrices for automata, but notes ``intrinsic difficult[ies]'' when working
+with matrices in Coq, which is the sort of `fighting' one would encounter also
+in other theorem provers.
 In terms of time, the estimate for our formalisation is that we
 needed approximately 3 months and this included the time to find our proof
 arguments. Unlike \citeN{Constable00}, who were able to follow the Myhill-Nerode
 proof by \citeN{HopcroftUllman69}, we had to find our own arguments.  So for us the
 formalisation was not the bottleneck.  The code of
-our formalisation can be found in the Archive of Formal Proofs at
+our formalisation \cite{myhillnerodeafp11} can be found in the Archive of Formal Proofs at
-\mbox{\url{http://afp.sourceforge.net/entries/Myhill-Nerode.shtml}}
+\mbox{\url{http://afp.sourceforge.net/entries/Myhill-Nerode.shtml}}.\smallskip
-\cite{myhillnerodeafp11}.\smallskip
 \noindent
 {\bf Acknowledgements:}
 We are grateful for the comments we received from Larry Paulson.  Tobias
 Nipkow made us aware of the properties in Theorem~\ref{subseqreg} and Tjark

changeset 372	2c56b20032a7
parent 350	8ce9a432680b
child 374	01d223421ba0