regexp: comparison Journal/Paper.thy

equal deleted inserted replaced

-:48b452a2d4df
+:b73478aaf33e
 \noindent
 changes the type---the disjoint union is not a set, but a set of
 pairs. Using this definition for disjoint union means we do not have a
 single type for automata. As a result we will not be able to define a regular
 language as one for which there exists an automaton that recognises all its
-strings. This is because we cannot make a definition in HOL that is polymorphic in
+strings (Definition~\ref{baddef}). This is because we cannot make a definition in HOL that is polymorphic in
 the state type and there is no type quantification available in HOL (unlike
 in Coq, for example).\footnote{Slind already pointed out this problem in an email
 to the HOL4 mailing list on 21st April 2005.}
 An alternative, which provides us with a single type for automata, is to give every
 lexing. Berghofer and Reiter \cite{BerghoferReiter09} formalise automata
 working over bit strings in the context of Presburger arithmetic.  The only
 larger formalisations of automata theory are carried out in Nuprl
 \cite{Constable00} and in Coq \cite{Filliatre97}.
-Also one might consider automata theory as a well-worn stock subject where
+Also one might consider automata theory and regular languages as a well-worn
-everything is crystal clear. However, paper proofs about automata often
+stock subject where everything is crystal clear. However, paper proofs about
-involve subtle side-conditions which are easily overlooked, but which make
+automata often involve subtle side-conditions which are easily overlooked,
-formal reasoning rather painful. For example Kozen's proof of the
+but which make formal reasoning rather painful. For example Kozen's proof of
-Myhill-Nerode theorem requires that automata do not have inaccessible
+the Myhill-Nerode theorem requires that automata do not have inaccessible
 states \cite{Kozen97}. Another subtle side-condition is completeness of
-automata, that is automata need to have total transition functions and at most one
+automata, that is automata need to have total transition functions and at
-`sink' state from which there is no connection to a final state (Brzozowski
+most one `sink' state from which there is no connection to a final state
-mentions this side-condition in the context of state complexity
+(Brzozowski mentions this side-condition in the context of state complexity
-of automata \cite{Brzozowski10}). Such side-conditions mean that if we define a regular
+of automata \cite{Brzozowski10}). Such side-conditions mean that if we
-language as one for which there exists \emph{a} finite automaton that
+define a regular language as one for which there exists \emph{a} finite
-recognises all its strings (see Def.~\ref{baddef}), then we need a lemma which
+automaton that recognises all its strings (see Def.~\ref{baddef}), then we
-ensures that another equivalent one can be found satisfying the
+need a lemma which ensures that another equivalent one can be found
-side-condition. Unfortunately, such `little' and `obvious' lemmas make
+satisfying the side-condition. Unfortunately, such `little' and `obvious'
-a formalisation of automata theory a hair-pulling experience.
+lemmas make a formalisation of automata theory a hair-pulling experience.
 In this paper, we will not attempt to formalise automata theory in
 Isabelle/HOL nor will we attempt to formalise automata proofs from the
 literature, but take a different approach to regular languages than is
 folklore. For the other part, we give two proofs: one direct proof using
 certain tagging-functions, and another indirect proof using Antimirov's
 partial derivatives \cite{Antimirov95}. Again to our best knowledge, the
 tagging-functions have not been used before to establish the Myhill-Nerode
 theorem. Derivatives of regular expressions have been used recently quite
-widely in the literature; partial derivatives, in contrast, attracted much
+widely in the literature; partial derivatives, in contrast, attract much
 less attention. However, partial derivatives are more suitable in the
 context of the Myhill-Nerode theorem, since it is easier to establish
-formally their finiteness result. We have not found any proof that uses
+formally their finiteness result. We are not aware of any proof that uses
-either of them in order to prove the Myhill-Nerode theorem.
+either of them for proving the Myhill-Nerode theorem.
 *}
 section {* Preliminaries *}
 text {*
 \end{prpstn}
 \noindent
 In @{text "(ii)"} we use the notation @{term "length s"} for the length of a
 string; this property states that if \mbox{@{term "[] \<notin> A"}} then the lengths of
-the strings in @{term "A \<up> (Suc n)"} must be longer than @{text n}.  We omit
+the strings in @{term "A \<up> (Suc n)"} must be longer than @{text n}.
+Property @{text "(iv)"} states that a non-empty string in @{term "A\<star>"} can
+always be split up into a non-empty prefix belonging to @{text "A"} and the
+rest being in @{term "A\<star>"}. We omit
 the proofs for these properties, but invite the reader to consult our
 formalisation.\footnote{Available at \url{http://www4.in.tum.de/~urbanc/regexp.html}}
 The notation in Isabelle/HOL for the quotient of a language @{text A}
 according to an equivalence relation @{term REL} is @{term "A // REL"}. We
 \mbox{@{thm (lhs) folds_alt_simp} @{text "= \<Union> (\<calL> ` rs)"}}
 \end{equation}
 \noindent
 holds, whereby @{text "\<calL> ` rs"} stands for the
-image of the set @{text rs} under function @{text "\<calL>"}.
+image of the set @{text rs} under function @{text "\<calL>"} defined as
+\begin{center}
+@{term "lang ` rs \<equiv> {lang r | r. r \<in> rs}"}
+\end{center}
+\noindent
+In what follows we shall use this convenient short-hand notation for images of sets
+also with other functions.
 *}
 section {* The Myhill-Nerode Theorem, First Part *}
 @{term UNIV} & @{term "UNIV // (\<approx>(lang r))"} & @{term "UNIV // R"}
 \end{tabular}}
 \end{equation}
 \noindent
-The relation @{term "\<approx>(lang r)"} partitions the set of all strings into some
+The relation @{term "\<approx>(lang r)"} partitions the set of all strings, @{term UNIV}, into some
 equivalence classes. To show that there are only finitely many of them, it
 suffices to show in each induction step that another relation, say @{text
 R}, has finitely many equivalence classes and refines @{term "\<approx>(lang r)"}.
 \begin{dfntn}
 The @{const TIMES}-case is slightly more complicated. We first prove the
 following lemma, which will aid the proof about refinement.
 \begin{lmm}\label{refinement}
 The relation @{text "\<^raw:$\threesim$>\<^bsub>tag\<^esub>"} refines @{term "\<approx>A"}, provided for
-all strings @{text x}, @{text y} and @{text z} we have \mbox{@{text "x \<^raw:$\threesim$>\<^bsub>tag\<^esub> y"}}
+all strings @{text x}, @{text y} and @{text z} we have that \mbox{@{text "x \<^raw:$\threesim$>\<^bsub>tag\<^esub> y"}}
 and @{term "x @ z \<in> A"} imply @{text "y @ z \<in> A"}.
 \end{lmm}
 \noindent
 We therefore can analyse how the strings @{text "x @ z"} are in the language
 @{text A} and then construct an appropriate tagging-function to infer that
-@{term "y @ z"} are also in @{text A}.  For this we sill need the notion of
+@{term "y @ z"} are also in @{text A}.  For this we will use the notion of
-the set of all possible \emph{partitions} of a string
+the set of all possible \emph{partitions} of a string:
 \begin{equation}
 @{thm Partitions_def}
 \end{equation}
 \noindent
 If we know that @{text "(x\<^isub>p, x\<^isub>s) \<in> Partitions x"}, we will
 refer to @{text "x\<^isub>p"} as the \emph{prefix} of the string @{text x},
-respectively to @{text "x\<^isub>s"} as the \emph{suffix}.
+and respectively to @{text "x\<^isub>s"} as the \emph{suffix}.
 Now assuming  @{term "x @ z \<in> A \<cdot> B"} there are only two possible ways of how to `split'
 this string to be in @{term "A \<cdot> B"}:
 %
 \begin{center}
 @{text "\<lbrakk>x\<^isub>s\<rbrakk>\<^bsub>\<approx>A\<^esub> \<in> {\<lbrakk>y\<^isub>s\<rbrakk>\<^bsub>\<approx>A\<^esub> | y\<^bsub>p\<^esub> < y \<and> y\<^bsub>p\<^esub> \<in> A\<^isup>\<star> \<and> (y\<^bsub>p\<^esub>, y\<^isub>s) \<in> Partitions y}"}
 \end{center}
 \noindent
-From this we know there exist partitions @{text "y\<^isub>p"} and @{text
+From this we know there exist a partition @{text "y\<^isub>p"} and @{text
 "y\<^isub>s"} with @{term "y\<^isub>p \<in> A\<star>"} and also @{term "x\<^isub>s \<approx>A
 y\<^isub>s"}. Unfolding the Myhill-Nerode relation we know @{term
 "y\<^isub>s @ z\<^isub>a \<in> A"}. We also know that @{term "z\<^isub>b \<in> A\<star>"}.
 Therefore @{term "y\<^isub>p @ (y\<^isub>s @ z\<^isub>a) @ z\<^isub>b \<in>
-A\<star>"}, which means @{term "y @ z \<in> A\<star>"}. As the last step we have to set
+A\<star>"}, which means @{term "y @ z \<in> A\<star>"}. The last step is to set
 @{text "A"} to @{term "lang r"} and thus complete the proof.
 \end{proof}
 *}
 section {* Second Part proved using Partial Derivatives *}
 As we have seen in the previous section, in order to establish
 the second direction of the Myhill-Nerode theorem, we need to find
 a more refined relation than @{term "\<approx>(lang r)"} for which we can
 show that there are only finitely many equivalence classes. So far we
 showed this by induction on @{text "r"}. However, there is also
-an indirect method to come up with such a refined relation based on
+an indirect method to come up with such a refined relation by using
 derivatives of regular expressions \cite{Brzozowski64}.
 Assume the following two definitions for a \emph{left-quotient} of a language,
 which we write as @{term "Der c A"} and @{term "Ders s A"} where @{text c}
 is a character and @{text s} a string:
 @{thm (lhs) Ders_def} & @{text "\<equiv>"} & @{thm (rhs) Ders_def}\\
 \end{tabular}
 \end{center}
 \noindent
-In order to aid readability, we shall also make use of the following abbreviation:
+In order to aid readability, we shall also make use of the following abbreviation
 \begin{center}
-@{abbrev "Derss s A"}
+@{abbrev "Derss s As"}
 \end{center}
+\noindent
-\noindent
+where we apply the left-quotient to a set of languages and then combine the results.
 Clearly we have the following relation between the Myhill-Nerode relation
 (Def.~\ref{myhillneroderel}) and left-quotients
 \begin{equation}\label{mhders}
 @{term "x \<approx>A y"} \hspace{4mm}\text{if and only if}\hspace{4mm} @{term "Ders x A = Ders y A"}
 \end{equation}
 \noindent
-It is straightforward to establish the following properties for left-quotients:
+It is also straightforward to establish the following properties for left-quotients:
 \begin{equation}
 \mbox{\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{2mm}}l}
 @{thm (lhs) Der_simps(1)} & $=$ & @{thm (rhs) Der_simps(1)}\\
 @{thm (lhs) Der_simps(2)} & $=$ & @{thm (rhs) Der_simps(2)}\\
 \end{tabular}}
 \end{equation}
 \noindent
 where @{text "\<Delta>"} is a function that tests whether the empty string
-is in the language and returns @{term "{[]}"} or @{term "{}"}, respectively.
+is in the language and returns @{term "{[]}"} or @{term "{}"}, accordingly.
-The only interesting case above is the last one where we use Prop.~\ref{langprops}
+The only interesting case above is the last one where we use Prop.~\ref{langprops}@{text "(i)"}
 in order to infer that @{term "Der c (A\<star>) = Der c (A \<cdot> A\<star>)"}. We can
 then complete the proof by observing that @{term "Delta A \<cdot> Der c (A\<star>) \<subseteq> (Der c A) \<cdot> A\<star>"}.
 Brzozowski observed that the left-quotients for languages of regular
 expressions can be calculated directly via the notion of \emph{derivatives
 @{thm (lhs) ders.simps(2)}  & @{text "\<equiv>"} & @{thm (rhs) ders.simps(2)}\\
 \end{tabular}
 \end{center}
 \noindent
-The last two clauses extend derivatives for characters to strings (list of
+The last two clauses extend derivatives from characters to strings---i.e.~list of
-characters). The list-cons operator is written \mbox{@{text "_ :: _"}}. The
+characters. The list-cons operator is written \mbox{@{text "_ :: _"}}. The
 function @{term "nullable r"} needed in the @{const Times}-case tests
 whether a regular expression can recognise the empty string:
 \begin{center}
 \begin{tabular}{c@ {\hspace{10mm}}c}
 \end{tabular}
 \end{tabular}
 \end{center}
 \noindent
-By induction on the regular expression @{text r}, respectively the string @{text s},
+By induction on the regular expression @{text r}, respectively on the string @{text s},
 one can easily show that left-quotients and derivatives relate as follows
 \cite{Sakarovitch09}:
 \begin{equation}\label{Dersders}
 \mbox{\begin{tabular}{c}
 \noindent
 The importance in the context of the Myhill-Nerode theorem is that
 we can use \eqref{mhders} and \eqref{Dersders} in order to
 establish that @{term "x \<approx>(lang r) y"} is equivalent to
-@{term "lang (ders x r) = lang (ders y r)"}. From this we obtain
+@{term "lang (ders x r) = lang (ders y r)"}. Hence
 \begin{equation}
 @{term "x \<approx>(lang r) y"}\hspace{4mm}\mbox{provided}\hspace{4mm}@{term "ders x r = ders y r"}
 \end{equation}
 \noindent
 which means the right-hand side (seen as relation) refines the
 Myhill-Nerode relation.  Consequently, we can use
-@{text "\<^raw:$\threesim$>\<^bsub>(\<lambda>x. ders x r)\<^esub>"} as a potential tagging-relation
+@{text "\<^raw:$\threesim$>\<^bsub>(\<lambda>x. ders x r)\<^esub>"} as a tagging-relation
 for the regular expression @{text r}. However, in
-order to be useful in the Myhill-Nerode theorem, we also have to show that
+order to be useful for teh second part of the Myhill-Nerode theorem, we also have to show that
-for the corresponding language there are only finitely many derivatives---ensuring
+for the corresponding language there are only finitely many derivatives---thus ensuring
 that there are only finitely many equivalence classes. Unfortunately, this
 is not true in general. Sakarovitch gives an example where a regular
 expression  has infinitely many derivatives w.r.t.~a language
 \cite[Page~141]{Sakarovitch09}. What Brzozowski \cite{Brzozowski64} proved
 is that for every language there \emph{are} only finitely `dissimilar'
 \end{tabular}}
 \end{equation}
 \noindent
 Carrying this idea through, we must not consider the set of all derivatives,
-but the ones modulo @{text "ACI"}.  In principle, this can be formally
+but the ones modulo @{text "ACI"}.  In principle, this can be done formally,
-defined, but it is very painful in a theorem prover (since there is no
+but it is very painful in a theorem prover (since there is no
 direct characterisation of the set of dissimlar derivatives).
 Fortunately, there is a much simpler approach using \emph{partial
 derivatives}. They were introduced by Antimirov \cite{Antimirov95} and can be defined
 \noindent
 in order to `sequence' a regular expression with a set of regular
 expressions. Note that in the last clause we first build the set of partial
 derivatives w.r.t~the character @{text c}, then build the image of this set under the
 function @{term "pders s"} and finally `union up' all resulting sets. It will be
-convenient to introduce the following abbreviation
+convenient to introduce for this the following abbreviation
 \begin{center}
 @{abbrev "pderss s A"}
 \end{center}
 taking the partial derivatives of the
 regular expressions in \eqref{ACI} gives us in each case
 equal sets.  Antimirov \cite{Antimirov95} showed a similar result to
 \eqref{Dersders} for partial derivatives:
-\begin{equation}
+\begin{equation}\label{Derspders}
 \mbox{\begin{tabular}{lc}
 @{text "(i)"}  & @{thm Der_pder}\\
 @{text "(ii)"} & @{thm Ders_pders}
 \end{tabular}}
 \end{equation}
 \begin{center}
 \begin{tabular}{r@ {\hspace{1.5mm}}c@ {\hspace{1.5mm}}ll}
 @{term "Ders (c # s) (lang r)"}
 & @{text "="} & @{term "Ders s (Der c (lang r))"} & by def.\\
-& @{text "="} & @{term "Ders s (\<Union> lang ` (pder c r))"} & by @{text "(i)"}\\
+& @{text "="} & @{term "Ders s (\<Union> lang ` (pder c r))"} & by @{text "("}\ref{Derspders}@{text ".i)"}\\
 & @{text "="} & @{term "\<Union> (Ders s) ` (lang ` (pder c r))"} & by def.~of @{text "Ders"}\\
 & @{text "="} & @{term "\<Union> lang ` (\<Union> pders s ` (pder c r))"} & by IH\\
 & @{text "="} & @{term "\<Union> lang ` (pders (c # s) r)"} & by def.\\
 \end{tabular}
 \end{center}
 \noindent
-In order to apply the induction hypothesis in the fourth step, we need the generalisation
+Note that in order to apply the induction hypothesis in the fourth equation, we
-over all regular expressions @{text r}. The case for the empty string is routine and omitted.
+need the generalisation over all regular expressions @{text r}. The case for
+the empty string is routine and omitted.
 \end{proof}
-Antimirov also proved that for every language and regular expression there are only finitely
+\noindent
-many partial derivatives.
+Taking \eqref{Dersders} and \eqref{Derspders} together gives the relationship
+between languages of derivatives and partial derivatives
+\begin{equation}
+\mbox{\begin{tabular}{lc}
+@{text "(i)"}  & @{thm der_pder[symmetric]}\\
+@{text "(ii)"} & @{thm ders_pders[symmetric]}
+\end{tabular}}
+\end{equation}
+\noindent
+These two properties confirm the observation made earlier
+that by using sets, partial derivatives have the @{text "ACI"}-identidies
+of derivatives already built in.
+Antimirov also proved that for every language and regular expression
+there are only finitely many partial derivatives.
 *}
 section {* Closure Properties *}
 text {*

changeset 190	b73478aaf33e
parent 187	9f46a9571e37
child 193	2a5ac68db24b