regexp: comparison Journal/Paper.thy

equal deleted inserted replaced

-:21ee3a852a02
+:d371536861bc
 translations
 "_Collect p P"      <= "{p. P}"
 "_Collect p P"      <= "{p|xs. P}"
 "_CollectIn p A P"  <= "{p : A. P}"
-abbreviation "NULL \<equiv> Zero"
+abbreviation "ZERO \<equiv> Zero"
-abbreviation "EMPTY \<equiv> One"
+abbreviation "ONE \<equiv> One"
-abbreviation "CHAR \<equiv> Atom"
+abbreviation "ATOM \<equiv> Atom"
-abbreviation "ALT \<equiv> Plus"
+abbreviation "PLUS \<equiv> Plus"
-abbreviation "SEQ \<equiv> Times"
+abbreviation "TIMES \<equiv> Times"
 abbreviation "STAR \<equiv> Star"
-ML {* @{term "op ^^"} *}
 notation (latex output)
 str_eq_rel ("\<approx>\<^bsub>_\<^esub>") and
 str_eq ("_ \<approx>\<^bsub>_\<^esub> _") and
 conc (infixr "\<cdot>" 100) and
 wide range of textbooks on this subject, many of which are aimed at students
 and contain very detailed `pencil-and-paper' proofs
 (e.g.~\cite{Kozen97, HopcroftUllman69}). It seems natural to exercise theorem provers by
 formalising the theorems and by verifying formally the algorithms.  A
 popular choice for a theorem prover would be one based on Higher-Order Logic
-(HOL), such as HOL4, HOLlight and Isabelle/HOL. For our development we will
+(HOL), for example HOL4, HOLlight and Isabelle/HOL. For our development
-use the latter theorem prover. One distinguishing feature of HOL is it's
+we will use the latter. One distinguishing feature of HOL is it's
-type system based on Church's Simple Theory of Types \cite{Church40}.  The
+type system, which is based on Church's Simple Theory of Types \cite{Church40}.  The
-limitations of this type system are one of the main motivations
+limitations of this type system are one of the underlying motivations for the
-behind the work presented in this paper.
+work presented in this paper.
 The typical approach to regular languages is to
 introduce finite automata and then define everything in terms of them.  For
 example, a regular language is normally defined as one whose strings are
 recognised by a finite deterministic automaton. This approach has many
 \begin{equation}\label{disjointunion}
 @{text "A\<^isub>1 \<uplus> A\<^isub>2 \<equiv> {(1, x) | x \<in> A\<^isub>1} \<union> {(2, y) | y \<in> A\<^isub>2}"}
 \end{equation}
 \noindent
-changes the type---the disjoint union is not a set, but a set of pairs.
+changes the type---the disjoint union is not a set, but a set of
-Using this definition for disjoint union means we do not have a single type for automata
+pairs. Using this definition for disjoint union means we do not have a
-and hence will not be able to state certain properties about \emph{all}
+single type for automata. As a result we will not be able to define a regular
-automata, since there is no type quantification available in HOL (unlike in Coq, for example).
+language as one for which there exists an automaton that recognises all its
-As a result, we would not be able to define a language being regular
+strings. Similarly, we cannot state properties about \emph{all} automata,
-in terms of the existence of an automata.
+since there is no type quantification available in HOL (unlike in Coq, for
+example).
 An alternative, which provides us with a single type for automata, is to give every
 state node an identity, for example a natural
 number, and then be careful to rename these identities apart whenever
 connecting two automata. This results in clunky proofs
 lexing. Berghofer and Reiter \cite{BerghoferReiter09} formalise automata
 working over bit strings in the context of Presburger arithmetic.  The only
 larger formalisations of automata theory are carried out in Nuprl
 \cite{Constable00} and in Coq \cite{Filliatre97}.
-Also one might consider the Myhill-Nerode theorem as well-worn stock
+One might also consider the Myhill-Nerode theorem as well-worn stock
-material for a computer scientists, but paper proofs of this theorem contain some
+material where everything is clear. However, paper proofs of this theorem
-subtle side-conditions which are easily overlooked and which make formal reasoning
+often involve subtle side-conditions which are easily overlooked, but which
-painful. For example in Kozen's proof \cite{Kozen97} it is not sufficient to
+make formal reasoning rather painful. For example Kozen's proof requires
-have just any automata recognising a regular language, but one which does
+that the automata do not have inaccessible states \cite{Kozen97}. Another
-not have inaccessible states. This means if we define a regular language
+subtle side-condition is completeness of automata:
-for which \emph{any} finite automaton exists, then one has to ensure that
+automata need to have total transition functions and at most one `sink'
-another equivalent can be found satisfying the side-conditions. Similarly
+state from which there is no connection to a final state (Brozowski mentions
-Brozowski mentiones in \cite{Brozowski10} side-conditions of finite automata
+this side-condition in connection with state complexity
-in connection of state-complexity: there it is required that automata
+\cite{Brozowski10}). Such side-conditions mean that if we define a regular
-must be complete in the sense of having a total transition function.
+language as one for which there exists \emph{any} finite automaton, then we
-Furthermore, if a `sink' state is present which accepts no word, then there
+need a lemma which ensures that another equivalent can be found satisfying the
-must be only one such state. . . .
+side-condition. Unfortunately, such `little' and `obvious' lemmas make
-Such 'little' lemmas make formalisation of these results in a theroem
+formalisations of results in automata theory hair-pulling experiences.
-prover hair-pulling experiences.
 In this paper, we will not attempt to formalise automata theory in
-Isabelle/HOL nor attempt to formalise any automata proof from the
+Isabelle/HOL nor will we attempt to formalise automata proofs from the
 literature, but take a different approach to regular languages than is
 usually taken. Instead of defining a regular language as one where there
 exists an automaton that recognises all strings of the language, we define a
 regular language as:
 strings of @{text "A"}.
 \end{dfntn}
 \noindent
 The reason is that regular expressions, unlike graphs, matrices and
-functions, can be easily defined as inductive datatype. Consequently a
+functions, can be easily defined as an inductive datatype. No side-conditions
-corresponding reasoning infrastructure (like induction or recursion) comes
+will be needed for regular expressions. Moreover, a reasoning infrastructure
-for free. This has recently been exploited in HOL4 with a formalisation of
+(like induction and recursion) comes for free in HOL-based theorem provers.
+This has recently been exploited in HOL4 with a formalisation of
 regular expression matching based on derivatives \cite{OwensSlind08} and
 with an equivalence checker for regular expressions in Isabelle/HOL
 \cite{KraussNipkow11}.  The purpose of this paper is to show that a central
 result about regular languages---the Myhill-Nerode theorem---can be
 recreated by only using regular expressions. This theorem gives necessary
 \noindent
 {\bf Contributions:} There is an extensive literature on regular languages.
 To our best knowledge, our proof of the Myhill-Nerode theorem is the first
 that is based on regular expressions, only. The part of this theorem stating
-that finitely many partitions of a language w.r.t.~to the Myhill-Nerode
+that finitely many partitions imply regularity of the language is proved by
-relation imply that the language is regular is proved by a folklore argument
+an argument about solving equational sytems.  This argument seems to be folklore.
-of solving equational sytems.  For the other part we give two proofs: a
+For the other part, we give two proofs: a
 direct proof using certain tagging-functions, and an indirect proof using
 Antimirov's partial derivatives \cite{Antimirov95} (also earlier russion work).
 Again to our best knowledge, the tagging-functions have not been used before to
 establish the Myhill-Nerode theorem.
 *}
 section {* Preliminaries *}
 \noindent
 Regular expressions are defined as the inductive datatype
 \begin{center}
 @{text r} @{text "::="}
-@{term NULL}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
+@{term ZERO}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
-@{term EMPTY}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
+@{term ONE}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
-@{term "CHAR c"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
+@{term "ATOM c"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
-@{term "SEQ r r"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
+@{term "TIMES r r"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
-@{term "ALT r r"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
+@{term "PLUS r r"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
 @{term "STAR r"}
 \end{center}
 \noindent
 and the language matched by a regular expression is defined as
 Given a finite set of regular expressions @{text rs}, we will make use of the operation of generating
 a regular expression that matches the union of all languages of @{text rs}. We only need to know the
 existence
 of such a regular expression and therefore we use Isabelle/HOL's @{const "fold_graph"} and Hilbert's
-@{text "\<epsilon>"} to define @{term "\<Uplus>rs"}. This operation, roughly speaking, folds @{const ALT} over the
+@{text "\<epsilon>"} to define @{term "\<Uplus>rs"}. This operation, roughly speaking, folds @{const PLUS} over the
-set @{text rs} with @{const NULL} for the empty set. We can prove that for a finite set @{text rs}
+set @{text rs} with @{const ZERO} for the empty set. We can prove that for a finite set @{text rs}
 %
 \begin{equation}\label{uplus}
 \mbox{@{thm (lhs) folds_alt_simp} @{text "= \<Union> (\<calL> ` rs)"}}
 \end{equation}
 contains the empty string @{text "[]"} (since equivalence classes are disjoint).
 Let us assume @{text "[] \<in> X\<^isub>1"}. We build the following equational system
 \begin{center}
 \begin{tabular}{rcl}
-@{text "X\<^isub>1"} & @{text "="} & @{text "(Y\<^isub>1\<^isub>1, CHAR c\<^isub>1\<^isub>1) + \<dots> + (Y\<^isub>1\<^isub>p, CHAR c\<^isub>1\<^isub>p) + \<lambda>(EMPTY)"} \\
+@{text "X\<^isub>1"} & @{text "="} & @{text "(Y\<^isub>1\<^isub>1, ATOM c\<^isub>1\<^isub>1) + \<dots> + (Y\<^isub>1\<^isub>p, ATOM c\<^isub>1\<^isub>p) + \<lambda>(ONE)"} \\
-@{text "X\<^isub>2"} & @{text "="} & @{text "(Y\<^isub>2\<^isub>1, CHAR c\<^isub>2\<^isub>1) + \<dots> + (Y\<^isub>2\<^isub>o, CHAR c\<^isub>2\<^isub>o)"} \\
+@{text "X\<^isub>2"} & @{text "="} & @{text "(Y\<^isub>2\<^isub>1, ATOM c\<^isub>2\<^isub>1) + \<dots> + (Y\<^isub>2\<^isub>o, ATOM c\<^isub>2\<^isub>o)"} \\
 & $\vdots$ \\
-@{text "X\<^isub>n"} & @{text "="} & @{text "(Y\<^isub>n\<^isub>1, CHAR c\<^isub>n\<^isub>1) + \<dots> + (Y\<^isub>n\<^isub>q, CHAR c\<^isub>n\<^isub>q)"}\\
+@{text "X\<^isub>n"} & @{text "="} & @{text "(Y\<^isub>n\<^isub>1, ATOM c\<^isub>n\<^isub>1) + \<dots> + (Y\<^isub>n\<^isub>q, ATOM c\<^isub>n\<^isub>q)"}\\
 \end{tabular}
 \end{center}
 \noindent
-where the terms @{text "(Y\<^isub>i\<^isub>j, CHAR c\<^isub>i\<^isub>j)"}
+where the terms @{text "(Y\<^isub>i\<^isub>j, ATOM c\<^isub>i\<^isub>j)"}
 stand for all transitions @{term "Y\<^isub>i\<^isub>j \<Turnstile>c\<^isub>i\<^isub>j\<Rightarrow>
 X\<^isub>i"}.
 %The intuition behind the equational system is that every
 %equation @{text "X\<^isub>i = rhs\<^isub>i"} in this system
 %corresponds roughly to a state of an automaton whose name is @{text X\<^isub>i} and its predecessor states
 %are the @{text "Y\<^isub>i\<^isub>j"}; the @{text "c\<^isub>i\<^isub>j"} are the labels of the transitions from these
 %predecessor states to @{text X\<^isub>i}.
 There can only be
-finitely many terms of the form @{text "(Y\<^isub>i\<^isub>j, CHAR c\<^isub>i\<^isub>j)"} in a right-hand side
+finitely many terms of the form @{text "(Y\<^isub>i\<^isub>j, ATOM c\<^isub>i\<^isub>j)"} in a right-hand side
 since by assumption there are only finitely many
 equivalence classes and only finitely many characters.
-The term @{text "\<lambda>(EMPTY)"} in the first equation acts as a marker for the initial state, that
+The term @{text "\<lambda>(ONE)"} in the first equation acts as a marker for the initial state, that
 is the equivalence class
 containing @{text "[]"}.\footnote{Note that we mark, roughly speaking, the
 single `initial' state in the equational system, which is different from
 the method by Brzozowski \cite{Brzozowski64}, where he marks the
 `terminal' states. We are forced to set up the equational system in our
 way, because the Myhill-Nerode relation determines the `direction' of the
 transitions---the successor `state' of an equivalence class @{text Y} can
 be reached by adding a character to the end of @{text Y}. This is also the
 reason why we have to use our reverse version of Arden's Lemma.}
 %In our initial equation system there can only be
-%finitely many terms of the form @{text "(Y\<^isub>i\<^isub>j, CHAR c\<^isub>i\<^isub>j)"} in a right-hand side
+%finitely many terms of the form @{text "(Y\<^isub>i\<^isub>j, ATOM c\<^isub>i\<^isub>j)"} in a right-hand side
 %since by assumption there are only finitely many
 %equivalence classes and only finitely many characters.
 Overloading the function @{text \<calL>} for the two kinds of terms in the
 equational system, we have
 \noindent
 and we can prove for @{text "X\<^isub>2\<^isub>.\<^isub>.\<^isub>n"} that the following equations
 %
 \begin{equation}\label{inv1}
-@{text "X\<^isub>i = \<calL>(Y\<^isub>i\<^isub>1, CHAR c\<^isub>i\<^isub>1) \<union> \<dots> \<union> \<calL>(Y\<^isub>i\<^isub>q, CHAR c\<^isub>i\<^isub>q)"}.
+@{text "X\<^isub>i = \<calL>(Y\<^isub>i\<^isub>1, ATOM c\<^isub>i\<^isub>1) \<union> \<dots> \<union> \<calL>(Y\<^isub>i\<^isub>q, ATOM c\<^isub>i\<^isub>q)"}.
 \end{equation}
 \noindent
 hold. Similarly for @{text "X\<^isub>1"} we can show the following equation
 %
 \begin{equation}\label{inv2}
-@{text "X\<^isub>1 = \<calL>(Y\<^isub>1\<^isub>1, CHAR c\<^isub>1\<^isub>1) \<union> \<dots> \<union> \<calL>(Y\<^isub>1\<^isub>p, CHAR c\<^isub>1\<^isub>p) \<union> \<calL>(\<lambda>(EMPTY))"}.
+@{text "X\<^isub>1 = \<calL>(Y\<^isub>1\<^isub>1, ATOM c\<^isub>1\<^isub>1) \<union> \<dots> \<union> \<calL>(Y\<^isub>1\<^isub>p, ATOM c\<^isub>1\<^isub>p) \<union> \<calL>(\<lambda>(ONE))"}.
 \end{equation}
 \noindent
 holds. The reason for adding the @{text \<lambda>}-marker to our initial equational system is
 to obtain this equation: it only holds with the marker, since none of
 %
 \begin{equation}\label{initcs}
 \mbox{\begin{tabular}{rcl}
 @{thm (lhs) Init_rhs_def} & @{text "\<equiv>"} &
 @{text "if"}~@{term "[] \<in> X"}\\
-& & @{text "then"}~@{term "{Trn Y (CHAR c) | Y c. Y \<in> CS \<and> Y \<Turnstile>c\<Rightarrow> X} \<union> {Lam EMPTY}"}\\
+& & @{text "then"}~@{term "{Trn Y (ATOM c) | Y c. Y \<in> CS \<and> Y \<Turnstile>c\<Rightarrow> X} \<union> {Lam ONE}"}\\
-& & @{text "else"}~@{term "{Trn Y (CHAR c)| Y c. Y \<in> CS \<and> Y \<Turnstile>c\<Rightarrow> X}"}\\
+& & @{text "else"}~@{term "{Trn Y (ATOM c)| Y c. Y \<in> CS \<and> Y \<Turnstile>c\<Rightarrow> X}"}\\
 @{thm (lhs) Init_def}     & @{text "\<equiv>"} & @{thm (rhs) Init_def}
 \end{tabular}}
 \end{equation}
 section {* Myhill-Nerode, Second Part *}
 text {*
+\noindent
 We will prove in this section the second part of the Myhill-Nerode
 theorem. It can be formulated in our setting as follows:
 \begin{thrm}
 Given @{text "r"} is a regular expression, then @{thm Myhill_Nerode2}.
 The proof will be by induction on the structure of @{text r}. It turns out
 the base cases are straightforward.
 \begin{proof}[Base Cases]
-The cases for @{const NULL}, @{const EMPTY} and @{const CHAR} are routine, because
+The cases for @{const ZERO}, @{const ONE} and @{const ATOM} are routine, because
 we can easily establish that
 \begin{center}
 \begin{tabular}{l}
 @{thm quot_zero_eq}\\
 \noindent
 Chaining Lem.~\ref{finone} and \ref{fintwo} together, means in order to show
 that @{term "UNIV // \<approx>(L r)"} is finite, we have to find a tagging-function whose
 range can be shown to be finite and whose tagging-relation refines @{term "\<approx>(L r)"}.
-Let us attempt the @{const ALT}-case first.
+Let us attempt the @{const PLUS}-case first.
-\begin{proof}[@{const "ALT"}-Case]
+\begin{proof}[@{const "PLUS"}-Case]
 We take as tagging-function
 %
 \begin{center}
 @{thm tag_str_Plus_def[where A="A" and B="B", THEN meta_eq_app]}
 \end{center}
 \noindent
 where @{text "A"} and @{text "B"} are some arbitrary languages.
 We can show in general, if @{term "finite (UNIV // \<approx>A)"} and @{term "finite (UNIV // \<approx>B)"}
 then @{term "finite ((UNIV // \<approx>A) \<times> (UNIV // \<approx>B))"} holds. The range of
-@{term "tag_str_ALT A B"} is a subset of this product set---so finite. It remains to be shown
+@{term "tag_str_PLUS A B"} is a subset of this product set---so finite. It remains to be shown
 that @{text "=tag\<^isub>A\<^isub>L\<^isub>T A B="} refines @{term "\<approx>(A \<union> B)"}. This amounts to
 showing
 %
 \begin{center}
 @{term "tag\<^isub>A\<^isub>L\<^isub>T A B x = tag\<^isub>A\<^isub>L\<^isub>T A B y \<longrightarrow> x \<approx>(A \<union> B) y"}
 \end{proof}
 \noindent
 The pattern in \eqref{pattern} is repeated for the other two cases. Unfortunately,
-they are slightly more complicated. In the @{const SEQ}-case we essentially have
+they are slightly more complicated. In the @{const TIMES}-case we essentially have
 to be able to infer that
 %
 \begin{center}
 @{text "\<dots>"}@{term "x @ z \<in> A \<cdot> B \<longrightarrow> y @ z \<in> A \<cdot> B"}
 \end{center}
 \noindent
 with the idea that in the first split we have to make sure that @{text "(x - x') @ z"}
 is in the language @{text B}.
-\begin{proof}[@{const SEQ}-Case]
+\begin{proof}[@{const TIMES}-Case]
 If @{term "finite (UNIV // \<approx>A)"} and @{term "finite (UNIV // \<approx>B)"}
 then @{term "finite ((UNIV // \<approx>A) \<times> (Pow (UNIV // \<approx>B)))"} holds. The range of
-@{term "tag_str_SEQ A B"} is a subset of this product set, and therefore finite.
+@{term "tag_str_TIMES A B"} is a subset of this product set, and therefore finite.
 We have to show injectivity of this tagging-function as
 %
 \begin{center}
-@{term "\<forall>z. tag_str_SEQ A B x = tag_str_SEQ A B y \<and> x @ z \<in> A \<cdot> B \<longrightarrow> y @ z \<in> A \<cdot> B"}
+@{term "\<forall>z. tag_str_TIMES A B x = tag_str_TIMES A B y \<and> x @ z \<in> A \<cdot> B \<longrightarrow> y @ z \<in> A \<cdot> B"}
 \end{center}
 %
 \noindent
 There are two cases to be considered (see pictures above). First, there exists
 a @{text "x'"} such that
 \begin{center}
 @{term "(\<approx>B `` {x - x'}) \<in> ({\<approx>B `` {x - x'} |x'. x' \<le> x \<and> x' \<in> A})"}
 \end{center}
 %
 \noindent
-and by the assumption about @{term "tag_str_SEQ A B"} also
+and by the assumption about @{term "tag_str_TIMES A B"} also
 %
 \begin{center}
 @{term "(\<approx>B `` {x - x'}) \<in> ({\<approx>B `` {y - y'} |y'. y' \<le> y \<and> y' \<in> A})"}
 \end{center}
 %
 relation and together with the fact that @{text "(x - x') @ z \<in> B"}, we
 have @{text "(y - y') @ z \<in> B"}. We already know @{text "y' \<in> A"}, therefore
 @{term "y @ z \<in> A \<cdot> B"}, as needed in this case.
 Second, there exists a @{text "z'"} such that @{term "x @ z' \<in> A"} and @{text "z - z' \<in> B"}.
-By the assumption about @{term "tag_str_SEQ A B"} we have
+By the assumption about @{term "tag_str_TIMES A B"} we have
 @{term "\<approx>A `` {x} = \<approx>A `` {y}"} and thus @{term "x \<approx>A y"}. Which means by the Myhill-Nerode
 relation that @{term "y @ z' \<in> A"} holds. Using @{text "z - z' \<in> B"}, we can conclude also in this case
-with @{term "y @ z \<in> A \<cdot> B"}. We again can complete the @{const SEQ}-case
+with @{term "y @ z \<in> A \<cdot> B"}. We again can complete the @{const TIMES}-case
 by setting @{text A} to @{term "L r\<^isub>1"} and @{text B} to @{term "L r\<^isub>2"}.\qed
 \end{proof}
 \noindent
-The case for @{const STAR} is similar to @{const SEQ}, but poses a few extra challenges. When
+The case for @{const STAR} is similar to @{const TIMES}, but poses a few extra challenges. When
 we analyse the case that @{text "x @ z"} is an element in @{term "A\<star>"} and @{text x} is not the
 empty string, we
 have the following picture:
 %
 \begin{center}
 *}
 section {* Second Part based on Partial Derivatives *}
 text {*
-We briefly considered using the method Brzozowski presented in the
+\noindent
-Appendix of~\cite{Brzozowski64} in order to prove the second
+As we have seen in the previous section, in order to establish
+the second direction of the Myhill-Nerode theorem, we need to find
+a more refined relation (more refined than ??) for which we can
+show that there are only finitely many equivalence classes.
+Brzozowski presented in the Appendix of~\cite{Brzozowski64}
+in order to prove the second
 direction of the Myhill-Nerode theorem. There he calculates the
 derivatives for regular expressions and shows that for every
 language there can be only finitely many of them %derivations (if
 regarded equal modulo ACI). We could have used as tagging-function
 the set of derivatives of a regular expression with respect to a
 whenever their derivative is the same, together with the fact that
 there are only finitely such derivatives would give us a similar
 argument as ours. However it seems not so easy to calculate the set
 of derivatives modulo ACI. Therefore we preferred our direct method
 of using tagging-functions.
+The problem of finiteness modulo ACI can be avoided by using partial
+derivatives introduced by Antimirov \cite{Antimirov}.
 *}
 section {* Closure Properties *}
 direction where we had to spend most of the `conceptual' time, as
 our proof-argument based on tagging-functions is new for
 establishing the Myhill-Nerode theorem. All standard proofs of this
 direction proceed by arguments over automata.\medskip
-We expect that the development of Krauss \& Nipkoe gets easier by
+We expect that the development of Krauss \& Nipkow gets easier by
-using partial derivatives.
+using partial derivatives.\medskip
 \noindent
-{\bf Acknowledgements:} We are grateful for the comments we received from Larry
+{\bf Acknowledgements:}
+We are grateful for the comments we received from Larry
 Paulson.
 *}

changeset 173	d371536861bc
parent 172	21ee3a852a02
child 174	2b414a8a7132