regexp: comparison Journal/Paper.thy

equal deleted inserted replaced

-:2b414a8a7132
+:edc642266a82
 abbreviation "ZERO \<equiv> Zero"
 abbreviation "ONE \<equiv> One"
 abbreviation "ATOM \<equiv> Atom"
 abbreviation "PLUS \<equiv> Plus"
 abbreviation "TIMES \<equiv> Times"
+abbreviation "TIMESS \<equiv> Times_set"
 abbreviation "STAR \<equiv> Star"
 notation (latex output)
 str_eq_rel ("\<approx>\<^bsub>_\<^esub>") and
 text {*
 \noindent
 Regular languages are an important and well-understood subject in Computer
 Science, with many beautiful theorems and many useful algorithms. There is a
 wide range of textbooks on this subject, many of which are aimed at students
-and contain very detailed `pencil-and-paper' proofs
+and contain very detailed `pencil-and-paper' proofs (e.g.~\cite{Kozen97,
-(e.g.~\cite{Kozen97, HopcroftUllman69}). It seems natural to exercise theorem provers by
+HopcroftUllman69}). It seems natural to exercise theorem provers by
 formalising the theorems and by verifying formally the algorithms.  A
 popular choice for a theorem prover would be one based on Higher-Order Logic
-(HOL), for example HOL4, HOLlight and Isabelle/HOL. For our development
+(HOL), for example HOL4, HOLlight or Isabelle/HOL. For the development
-we will use the latter. One distinguishing feature of HOL is it's
+presented in this paper we will use the latter. HOL is a predicate calculus
-type system, which is based on Church's Simple Theory of Types \cite{Church40}.  The
+that allows quantification over predicate variables. Its type system is
-limitations of this type system are one of the underlying motivations for the
+based on Church's Simple Theory of Types \cite{Church40}.  Although
-work presented in this paper.
+many mathematical concepts can be conveniently expressed in HOL, there are some
+limitations that hurt badly, if one attempts a simple-minded formalisation
-The typical approach to regular languages is to
+of regular languages in it.
-introduce finite automata and then define everything in terms of them.  For
-example, a regular language is normally defined as one whose strings are
+The typical approach to regular languages is to introduce finite automata
-recognised by a finite deterministic automaton. This approach has many
+and then define everything in terms of them \cite{Kozen97}.  For example,
-benefits. Among them is the fact that it is easy to convince oneself that
+a regular language is normally defined as:
-regular languages are closed under complementation: one just has to exchange
-the accepting and non-accepting states in the corresponding automaton to
+\begin{dfntn}\label{baddef}
-obtain an automaton for the complement language.  The problem, however, lies
+A language @{text A} is \emph{regular}, provided there is a
-with formalising such reasoning in a HOL-based theorem prover. Automata are
+finite deterministic automaton that recognises all strings of @{text "A"}.
-built up from states and transitions that need to be represented as graphs,
+\end{dfntn}
-matrices or functions, none of which can be defined as an inductive
-datatype.
+\noindent
+This approach has many benefits. Among them is the fact that it is easy to
+convince oneself that regular languages are closed under complementation:
+one just has to exchange the accepting and non-accepting states in the
+corresponding automaton to obtain an automaton for the complement language.
+The problem, however, lies with formalising such reasoning in a HOL-based
+theorem prover. Automata are built up from states and transitions that need
+to be represented as graphs, matrices or functions, none of which can be
+defined as an inductive datatype.
 In case of graphs and matrices, this means we have to build our own
 reasoning infrastructure for them, as neither Isabelle/HOL nor HOL4 nor
 HOLlight support them with libraries. Even worse, reasoning about graphs and
 matrices can be a real hassle in HOL-based theorem provers, because
 \noindent
 changes the type---the disjoint union is not a set, but a set of
 pairs. Using this definition for disjoint union means we do not have a
 single type for automata. As a result we will not be able to define a regular
 language as one for which there exists an automaton that recognises all its
-strings, since there is no type quantification available in HOL (unlike in Coq, for
+strings. This is because we cannot make a definition in HOL that is polymorphic in
-example).
+the state type and also there is no type quantification available in HOL (unlike
+in Coq, for example).
 An alternative, which provides us with a single type for automata, is to give every
 state node an identity, for example a natural
 number, and then be careful to rename these identities apart whenever
 connecting two automata. This results in clunky proofs
 lexing. Berghofer and Reiter \cite{BerghoferReiter09} formalise automata
 working over bit strings in the context of Presburger arithmetic.  The only
 larger formalisations of automata theory are carried out in Nuprl
 \cite{Constable00} and in Coq \cite{Filliatre97}.
-One might also consider the Myhill-Nerode theorem as well-worn stock
+Also one might consider automata theory as well-worn stock material where
-material where everything is clear. However, paper proofs of this theorem
+everything is crystal clear. However, paper proofs about automata often
-often involve subtle side-conditions which are easily overlooked, but which
+involve subtle side-conditions which are easily overlooked, but which make
-make formal reasoning rather painful. For example Kozen's proof requires
+formal reasoning rather painful. For example Kozen's proof of the
-that the automata do not have inaccessible states \cite{Kozen97}. Another
+Myhill-Nerode theorem requires that the automata do not have inaccessible
-subtle side-condition is completeness of automata:
+states \cite{Kozen97}. Another subtle side-condition is completeness of
-automata need to have total transition functions and at most one `sink'
+automata: automata need to have total transition functions and at most one
-state from which there is no connection to a final state (Brozowski mentions
+`sink' state from which there is no connection to a final state (Brozowski
-this side-condition in connection with state complexity
+mentions this side-condition in connection with state complexity
 \cite{Brozowski10}). Such side-conditions mean that if we define a regular
-language as one for which there exists \emph{any} finite automaton, then we
+language as one for which there exists \emph{a} finite automaton that
-need a lemma which ensures that another equivalent can be found satisfying the
+recognises all its strings (Def.~\ref{baddef}), then we need a lemma which
-side-condition. Unfortunately, such `little' and `obvious' lemmas make
+ensures that another equivalent can be found satisfying the
-formalisations of results in automata theory hair-pulling experiences.
+side-condition. Unfortunately, such `little' and `obvious' lemmas make
+formalisations of automata theory hair-pulling experiences.
 In this paper, we will not attempt to formalise automata theory in
 Isabelle/HOL nor will we attempt to formalise automata proofs from the
 literature, but take a different approach to regular languages than is
 usually taken. Instead of defining a regular language as one where there
 exists an automaton that recognises all strings of the language, we define a
 strings of @{text "A"}.
 \end{dfntn}
 \noindent
 The reason is that regular expressions, unlike graphs, matrices and
-functions, can be easily defined as an inductive datatype. No side-conditions
+functions, can be easily defined as an inductive datatype. A reasoning
-will be needed for regular expressions. Moreover, a reasoning infrastructure
+infrastructure (like induction and recursion) comes then for free in
-(like induction and recursion) comes for free in HOL-based theorem provers.
+HOL. Moreover, no side-conditions will be needed for regular expressions,
-This has recently been exploited in HOL4 with a formalisation of
+like we usually need for automata. This convenience of regular expressions has
-regular expression matching based on derivatives \cite{OwensSlind08} and
+recently been exploited in HOL4 with a formalisation of regular expression
-with an equivalence checker for regular expressions in Isabelle/HOL
+matching based on derivatives \cite{OwensSlind08} and with an equivalence
-\cite{KraussNipkow11}.  The main purpose of this paper is to show that a central
+checker for regular expressions in Isabelle/HOL \cite{KraussNipkow11}.  The
-result about regular languages---the Myhill-Nerode theorem---can be
+main purpose of this paper is to show that a central result about regular
-recreated by only using regular expressions. This theorem gives necessary
+languages---the Myhill-Nerode theorem---can be recreated by only using
-and sufficient conditions for when a language is regular. As a corollary of
+regular expressions. This theorem gives necessary and sufficient conditions
-this theorem we can easily establish the usual closure properties, including
+for when a language is regular. As a corollary of this theorem we can easily
-complementation, for regular languages.\medskip
+establish the usual closure properties, including complementation, for
+regular languages.\medskip
 \noindent
-{\bf Contributions:} There is an extensive literature on regular
+{\bf Contributions:} There is an extensive literature on regular languages.
-languages.  To our best knowledge, our proof of the Myhill-Nerode theorem is
+To our best knowledge, our proof of the Myhill-Nerode theorem is the first
-the first that is based on regular expressions, only. The part of this
+that is based on regular expressions, only. The part of this theorem stating
-theorem stating that finitely many partitions imply regularity of the
+that finitely many partitions imply regularity of the language is proved by
-language is proved by an argument about solving equational sytems.  This
+an argument about solving equational sytems.  This argument appears to be
-argument appears to be folklore. For the other part, we give two proofs: one
+folklore. For the other part, we give two proofs: one direct proof using
-direct proof using certain tagging-functions, and another indirect proof
+certain tagging-functions, and another indirect proof using Antimirov's
-using Antimirov's partial derivatives \cite{Antimirov95}. Again to our best
+partial derivatives \cite{Antimirov95}. Again to our best knowledge, the
-knowledge, the tagging-functions have not been used before to establish the
+tagging-functions have not been used before to establish the Myhill-Nerode
-Myhill-Nerode theorem. Derivatives of regular expressions have been used
+theorem. Derivatives of regular expressions have been used widely in the
-extensively in the literature, unlike partial derivatives. However, partial
+literature about regular expressions. However, partial derivatives are more
-derivatives are more suitable in the context of the Myhill-Nerode theorem,
+suitable in the context of the Myhill-Nerode theorem, since it is easier to
-since it is easier to establish formally their finiteness result.
+establish formally their finiteness result.
 *}
 section {* Preliminaries *}
 text {*
 \noindent
 Without the indentification, we unfortunately obtain infinitely many
 derivations (an example is given in \cite[Page~141]{Sakarovitch09}).
 Reasoning modulo ACI can be done, but it is very painful in a theorem prover.
+Fortunately, there is a much simpler approach using partial
-in order to prove the second
+derivatives introduced by Antimirov \cite{Antimirov95}.
-direction of the Myhill-Nerode theorem. There he calculates the
-derivatives for regular expressions and shows that for every
+\begin{center}
-language there can be only finitely many of them %derivations (if
+\begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1.5mm}}l@ {}}
-regarded equal modulo ACI). We could have used as tagging-function
+@{thm (lhs) pder.simps(1)}  & @{text "\<equiv>"} & @{thm (rhs) pder.simps(1)}\\
-the set of derivatives of a regular expression with respect to a
+@{thm (lhs) pder.simps(2)}  & @{text "\<equiv>"} & @{thm (rhs) pder.simps(2)}\\
-language.  Using the fact that two strings are Myhill-Nerode related
+@{thm (lhs) pder.simps(3)[where c'="d"]}  & @{text "\<equiv>"} & @{thm (rhs) pder.simps(3)[where c'="d"]}\\
-whenever their derivative is the same, together with the fact that
+@{thm (lhs) pder.simps(4)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]}
-there are only finitely such derivatives would give us a similar
+& @{text "\<equiv>"} & @{thm (rhs) pder.simps(4)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]}\\
-argument as ours. However it seems not so easy to calculate the set
+@{thm (lhs) pder.simps(5)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]}
-of derivatives modulo ACI. Therefore we preferred our direct method
+& @{text "\<equiv>"}\\
-of using tagging-functions.
+\multicolumn{3}{@ {\hspace{20mm}}l@ {}}{@{thm (rhs) pder.simps(5)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]}}\\
+@{thm (lhs) pder.simps(6)}  & @{text "\<equiv>"} & @{thm (rhs) pder.simps(6)}\smallskip\\
-The problem of finiteness modulo ACI can be avoided by using partial
+@{thm (lhs) pders.simps(1)}  & @{text "\<equiv>"} & @{thm (rhs) pders.simps(1)}\\
-derivatives introduced by Antimirov \cite{Antimirov}.
+@{thm (lhs) pders.simps(2)}  & @{text "\<equiv>"} & @{thm (rhs) pders.simps(2)}\\
+\end{tabular}
+\end{center}
 *}
 section {* Closure Properties *}

changeset 175	edc642266a82
parent 174	2b414a8a7132
child 176	6969de1eb96b