regexp: comparison Paper/Paper.thy

equal deleted inserted replaced

-:8ab3a06577cf
+:426070e68b21
 Suc ("_+1" [100] 100) and
 quotient ("_ \<^raw:\ensuremath{\!\sslash\!}> _" [90, 90] 90) and
 REL ("\<approx>") and
 UPLUS ("_ \<^raw:\ensuremath{\uplus}> _" [90, 90] 90) and
 L ("L '(_')" [0] 101) and
-EClass ("\<lbrakk>_\<rbrakk>\<^bsub>_\<^esub>" [100, 100] 100)
+EClass ("\<lbrakk>_\<rbrakk>\<^bsub>_\<^esub>" [100, 100] 100) and
+transition ("_ \<^raw:\ensuremath{\stackrel{\text{>_\<^raw:}}{\Longrightarrow}}> _" [100, 100, 100] 100)
 (*>*)
 section {* Introduction *}
 There is however a problem: the typical approach to regular languages is to
 introduce finite automata and then define everything in terms of them.  For
 example, a regular language is normally defined as one whose strings are
 recognised by a finite deterministic automaton. This approach has many
-benefits. Among them is that it is easy to convince oneself from the fact that
+benefits. Among them is the fact that it is easy to convince oneself that
 regular languages are closed under complementation: one just has to exchange
 the accepting and non-accepting states in the corresponding automaton to
 obtain an automaton for the complement language.  The problem, however, lies with
 formalising such reasoning in a HOL-based theorem prover, in our case
 Isabelle/HOL. Automata are build up from states and transitions that
 Because of these problems to do with representing automata, there seems
 to be no substantial formalisation of automata theory and regular languages
 carried out in a HOL-based theorem prover. We are only aware of the
 large formalisation of automata theory in Nuprl \cite{Constable00} and
-some smaller formalisations in Coq, for example \cite{Filliatre97}.
+some smaller formalisations in Coq (for example \cite{Filliatre97}).
 In this paper, we will not attempt to formalise automata theory, but take a completely
 different approach to regular languages. Instead of defining a regular language as one
 where there exists an automaton that recognises all strings of the language, we define
 a regular language as:
 strings of @{text "A"}.
 \end{definition}
 \noindent
 The reason is that regular expressions, unlike graphs and matrices, can
-be easily defined as inductive datatype. Therefore a corresponding reasoning
+be easily defined as inductive datatype. Consequently a corresponding reasoning
-infrastructure comes for free. This has recently been used in HOL4 for formalising regular
+infrastructure comes for free. This has recently been exploited in HOL4 with a formalisation
-expression matching based on derivatives \cite{OwensSlind08}.  The purpose of this paper is to
+of regular expression matching based on derivatives \cite{OwensSlind08}.  The purpose of this paper is to
-show that a central result about regular languages, the Myhill-Nerode theorem,
+show that a central result about regular languages---the Myhill-Nerode theorem---can
-can be recreated by only using regular expressions. This theorem gives a necessary
+be recreated by only using regular expressions. This theorem gives necessary
-and sufficient condition for when a language is regular. As a corollary of this
+and sufficient conditions for when a language is regular. As a corollary of this
 theorem we can easily establish the usual closure properties, including
 complementation, for regular languages.\smallskip
 \noindent
 {\bf Contributions:} To our knowledge, our proof of the Myhill-Nerode theorem is the
 text {*
 Strings in Isabelle/HOL are lists of characters with the \emph{empty string}
 being represented by the empty list, written @{term "[]"}. \emph{Languages}
 are sets of strings. The language containing all strings is written in
-Isabelle/HOL as @{term "UNIV::string set"}.  The notation for the quotient
+Isabelle/HOL as @{term "UNIV::string set"}. The concatenation of two languages
-of a language @{text A} according to a relation @{term REL} is @{term "A //
+is written @{term "A ;; B"} and a language raised to the power $n$ is written
-REL"}. The concatenation of two languages is written @{term "A ;; B"}; a
+@{term "A \<up> n"}. Their definitions are
-language raised to the power $n$ is written @{term "A \<up> n"}. Both concepts
-are defined as
 \begin{center}
 @{thm Seq_def[THEN eq_reflection, where A1="A" and B1="B"]}
 \hspace{7mm}
 @{thm pow.simps(1)[THEN eq_reflection, where A1="A"]}
 @{thm pow.simps(2)[THEN eq_reflection, where A1="A" and n1="n"]}
 \end{center}
 \noindent
 where @{text "@"} is the usual list-append operation. The Kleene-star of a language @{text A}
-is defined as the union over all powers, namely @{thm Star_def}.
+is defined as the union over all powers, namely @{thm Star_def}. In the paper
+we will often make use of the following properties.
+\begin{proposition}\label{langprops}\mbox{}\\
+\begin{tabular}{@ {}ll@ {\hspace{10mm}}ll}
+(i)   & @{thm star_cases}      & (ii)  & @{thm[mode=IfThen] pow_length}\\
+(iii) & @{thm seq_Union_left}  &
+\end{tabular}
+\end{proposition}
+\noindent
+We omit the proofs of these properties, but invite the reader to consult
+our formalisation.\footnote{Available at ???}
+The notation for the quotient of a language @{text A} according to an
+equivalence relation @{term REL} is @{term "A // REL"}. We will write
+@{text "\<lbrakk>x\<rbrakk>\<^isub>\<approx>"} for the equivalence class defined
+as @{text "{y | y \<approx> x}"}.
 Central to our proof will be the solution of equational systems
-involving regular expressions. For this we will use the following ``reverse''
+involving regular expressions. For this we will use Arden's lemma \cite{}
+which solves equations of the form @{term "X = A ;; X \<union> B"} provided
+@{term "[] \<notin> A"}. However we will need the following ``reverse''
 version of Arden's lemma.
 \begin{lemma}[Reverse Arden's Lemma]\mbox{}\\
 If @{thm (prem 1) ardens_revised} then
 @{thm (lhs) ardens_revised} has the unique solution
 @{thm (rhs) ardens_revised}.
 \end{lemma}
 \begin{proof}
 For the right-to-left direction we assume @{thm (rhs) ardens_revised} and show
-that @{thm (lhs) ardens_revised} holds. From Lemma ??? we have @{term "A\<star> = {[]} \<union> A ;; A\<star>"},
+that @{thm (lhs) ardens_revised} holds. From Prop.~\ref{langprops}$(i)$
+we have @{term "A\<star> = {[]} \<union> A ;; A\<star>"},
 which is equal to @{term "A\<star> = {[]} \<union> A\<star> ;; A"}. Adding @{text B} to both
 sides gives @{term "B ;; A\<star> = B ;; ({[]} \<union> A\<star> ;; A)"}, whose right-hand side
 is equal to @{term "(B ;; A\<star>) ;; A \<union> B"}. This completes this direction.
 For the other direction we assume @{thm (lhs) ardens_revised}. By a simple induction
 @{text "(*)"}\hspace{5mm} @{thm (concl) ardens_helper}
 \end{center}
 \noindent
 Using this property we can show that @{term "B ;; (A \<up> n) \<subseteq> X"} holds for
-all @{text n}. From this we can infer @{term "B ;; A\<star> \<subseteq> X"} using Lemma ???.
+all @{text n}. From this we can infer @{term "B ;; A\<star> \<subseteq> X"} using the definition
+of @{text "\<star>"}.
 For the inclusion in the other direction we assume a string @{text s}
 with length @{text k} is element in @{text X}. Since @{thm (prem 1) ardens_revised}
-we know that @{term "s \<notin> X ;; (A \<up> Suc k)"} since its length is only @{text k}
+we know by Prop.~\ref{langprops}$(ii)$ that
+@{term "s \<notin> X ;; (A \<up> Suc k)"} since its length is only @{text k}
 (the strings in @{term "X ;; (A \<up> Suc k)"} are all longer).
 From @{text "(*)"} it follows then that
 @{term s} must be element in @{term "(\<Union>m\<in>{0..k}. B ;; (A \<up> m))"}. This in turn
-implies that @{term s} is in @{term "(\<Union>n. B ;; (A \<up> n))"}. Using Lemma ??? this
+implies that @{term s} is in @{term "(\<Union>n. B ;; (A \<up> n))"}. Using Prop.~\ref{langprops}$(iii)$
-is equal to @{term "B ;; A\<star>"}, as we needed to show.\qed
+this is equal to @{term "B ;; A\<star>"}, as we needed to show.\qed
 \end{proof}
 \noindent
 Regular expressions are defined as the following inductive datatype
 text {*
 \begin{definition}[Myhill-Nerode Relation]\mbox{}\\
 @{thm str_eq_rel_def[simplified]}
 \end{definition}
-\begin{definition} @{text "finals A"} are the equivalence classes that contain
+\noindent
-strings from @{text A}\\
+It is easy to see that @{term "\<approx>A"} is an equivalence relation, which partitions
+the set of all string, @{text "UNIV"}, into a set of equivalence classed. We define
+the set @{term "finals A"} as those equivalence classes that contain strings of
+@{text A}, namely
+\begin{equation}
 @{thm finals_def}
-\end{definition}
+\end{equation}
-@{thm lang_is_union_of_finals}
+\noindent
+It is easy to show that @{thm lang_is_union_of_finals} holds. We can also define
+a notion of \emph{transition} between equivalence classes as
+\begin{equation}
+@{thm transition_def}
+\end{equation}
+\noindent
+which means if we add the character @{text c} to all strings in the equivalence
+class @{text Y} HERE
 \begin{theorem}
 Given a language @{text A}.
 @{thm[mode=IfThen] hard_direction}
 \end{theorem}

changeset 71	426070e68b21
parent 70	8ab3a06577cf
child 75	d63baacbdb16