regexp: comparison Paper/Paper.thy

equal deleted inserted replaced

-:37ab56205097
+:a9ebc410a5c8
 REL :: "(string \<times> string) \<Rightarrow> bool"
 UPLUS :: "'a set \<Rightarrow> 'a set \<Rightarrow> (nat \<times> 'a) set"
 abbreviation
 "EClass x R \<equiv> R `` {x}"
+abbreviation
+"append_rexp2 r_itm r \<equiv> append_rexp r r_itm"
 notation (latex output)
 str_eq_rel ("\<approx>\<^bsub>_\<^esub>") and
 str_eq ("_ \<approx>\<^bsub>_\<^esub> _") and
 Seq (infixr "\<cdot>" 100) and
 L ("\<^raw:\ensuremath{\cal{L}}>'(_')" [0] 101) and
 Lam ("\<lambda>'(_')" [100] 100) and
 Trn ("'(_, _')" [100, 100] 100) and
 EClass ("\<lbrakk>_\<rbrakk>\<^bsub>_\<^esub>" [100, 100] 100) and
 transition ("_ \<^raw:\ensuremath{\stackrel{\text{>_\<^raw:}}{\Longmapsto}}> _" [100, 100, 100] 100) and
-Setalt ("\<^raw:\ensuremath{\bigplus}>_" [1000] 999)
+Setalt ("\<^raw:\ensuremath{\bigplus}>_" [1000] 999) and
+append_rexp2 ("_ \<^raw:\ensuremath{\triangleleft}> _" [100, 100] 100) and
+append_rhs_rexp ("_ \<^raw:\ensuremath{\triangleleft}> _" [100, 100] 100)
 (*>*)
 section {* Introduction *}
 \end{equation}
 \noindent
 changes the type---the disjoint union is not a set, but a set of pairs.
 Using this definition for disjoint unions means we do not have a single type for automata
-and hence will not be able to state properties about \emph{all}
+and hence will not be able to state certain properties about \emph{all}
 automata, since there is no type quantification available in HOL. An
 alternative, which provides us with a single type for automata, is to give every
 state node an identity, for example a natural
 number, and then be careful to rename these identities apart whenever
 connecting two automata. This results in clunky proofs
 Functions are much better supported in Isabelle/HOL, but they still lead to similar
 problems as with graphs.  Composing, for example, two non-deterministic automata in parallel
 poses again the problem of how to implement disjoint unions. Nipkow \cite{Nipkow98}
 dismisses the option of using identities, because it leads to ``messy proofs''. He
-opts for a variant of \eqref{disjointunion}, but writes
+opts for a variant of \eqref{disjointunion} using bitlists, but writes
 \begin{quote}
 \it ``If the reader finds the above treatment in terms of bit lists revoltingly
-concrete, I cannot disagree.''
+concrete, \phantom{``}I cannot disagree.''
 \end{quote}
 \noindent
 Moreover, it is not so clear how to conveniently impose a finiteness condition
 upon functions in order to represent \emph{finite} automata. The best is
-probably to resort to more advanced reasoning frameworks, such as \emph{locales}.
+probably to resort to more advanced reasoning frameworks, such as \emph{locales}
+or \emph{type classes},
+which are not avaiable in all HOL-based theorem provers.
 Because of these problems to do with representing automata, there seems
 to be no substantial formalisation of automata theory and regular languages
 carried out in a HOL-based theorem prover. Nipkow establishes in
 \cite{Nipkow98} the link between regular expressions and automata in
-the context of lexing. The only larger formalisations of automata theory
+the restricted context of lexing. The only larger formalisations of automata theory
 are carried out in Nuprl \cite{Constable00} and in Coq (for example
 \cite{Filliatre97}).
 In this paper, we will not attempt to formalise automata theory in
 Isabelle/HOL, but take a completely different approach to regular
 section {* Preliminaries *}
 text {*
 Strings in Isabelle/HOL are lists of characters with the \emph{empty string}
-being represented by the empty list, written @{term "[]"}. \emph{Languages}
+being represented by the empty list, written @{term "[]"}.  \emph{Languages}
 are sets of strings. The language containing all strings is written in
 Isabelle/HOL as @{term "UNIV::string set"}. The concatenation of two languages
 is written @{term "A ;; B"} and a language raised to the power @{text n} is written
 @{term "A \<up> n"}. Their definitions are
 where @{text "@"} is the usual list-append operation. The Kleene-star of a language @{text A}
 is defined as the union over all powers, namely @{thm Star_def}. In the paper
 we will make use of the following properties of these constructions.
 \begin{proposition}\label{langprops}\mbox{}\\
-\begin{tabular}{@ {}ll@ {\hspace{10mm}}ll}
+\begin{tabular}{@ {}ll}
-(i)   & @{thm star_cases}      & (ii)  & @{thm[mode=IfThen] pow_length}\\
+(i)   & @{thm star_cases}     \\
-(iii) & @{thm seq_Union_left}  &
+(ii)  & @{thm[mode=IfThen] pow_length}\\
+(iii) & @{thm seq_Union_left} \\
 \end{tabular}
 \end{proposition}
 \noindent
-We omit the proofs, but invite the reader to consult
+In @{text "(ii)"} we use the notation @{term "length s"} for the length of a string.
+We omit the proofs for these properties, but invite the reader to consult
 our formalisation.\footnote{Available at ???}
 The notation in Isabelle/HOL for the quotient of a language @{text A} according to an
 equivalence relation @{term REL} is @{term "A // REL"}. We will write
 @{thm (rhs) L_rexp.simps(6)[where r="r"]}\\
 \end{tabular}
 \end{tabular}
 \end{center}
-Given a set of regular expressions @{text rs}, we will need the operation of generating
+Given a set of regular expressions @{text rs}, we will make use of the operation of generating
-a corresponding regular expressions that matches all languages of @{text rs}. We only need the existence
+a regular expression that matches all languages of @{text rs}. We only need to know the existence
-of such a regular expressions and therefore we use Isabelle's @{const "fold_graph"} and Hilbert's
+of such a regular expression and therefore we use Isabelle/HOL's @{const "fold_graph"} and Hilbert's
 @{text "\<epsilon>"} to define @{term "\<Uplus>rs"}. This function, roughly speaking, folds @{const ALT} over the
 set @{text rs} with @{const NULL} for the empty set. We can prove that for finite sets @{text rs}
 \begin{center}
 @{thm (lhs) folds_alt_simp}@{text "= \<Union> (\<calL> ` rs)"}
 \noindent
 It is easy to see that @{term "\<approx>A"} is an equivalence relation, which
 partitions the set of all strings, @{text "UNIV"}, into a set of disjoint
 equivalence classes. An example is the regular language containing just
-the string @{text "[c]"}, then @{term "\<approx>({[c]})"} partitions @{text UNIV}
+the string @{text "[c]"}. The relation @{term "\<approx>({[c]})"} partitions @{text UNIV}
 into the three equivalence classes @{text "X\<^isub>1"}, @{text "X\<^isub>2"} and  @{text "X\<^isub>3"}
 as follows
 \begin{center}
 @{text "X\<^isub>1 = {[]}"}\hspace{5mm}
 \begin{equation}
 @{thm finals_def}
 \end{equation}
 \noindent
-In our running example, @{text "X\<^isub>1"} is the only equivalence class in @{term "finals {[c]}"}.
+In our running example, @{text "X\<^isub>2"} is the only equivalence class in @{term "finals {[c]}"}.
 It is straightforward to show that in general @{thm lang_is_union_of_finals} and
 @{thm finals_in_partitions} hold.
 Therefore if we know that there exists a regular expression for every
 equivalence class in @{term "finals A"} (which by assumption must be
-a finite set), then we can obtain a regular expression using @{text "\<bigplus>"}
+a finite set), then we can using @{text "\<bigplus>"} obtain a regular expression
-that matches every string in @{text A}.
+using that matches every string in @{text A}.
 Our proof of Thm.~\ref{myhillnerodeone} relies on a method that can calculate a
 regular expression for \emph{every} equivalence class, not just the ones
 in @{term "finals A"}. We
-first define the notion of \emph{transition} between equivalence classes
+first define the notion of \emph{one-character-transition} between equivalence classes
 %
 \begin{equation}
 @{thm transition_def}
 \end{equation}
 \noindent
-which means that if we concatenate all strings matching the regular expression @{text r}
+which means that if we concatenate the character @{text c} to the end of all
-to the end of all strings in the equivalence class @{text Y}, we obtain a subset of
+strings in the equivalence class @{text Y}, we obtain a subset of
 @{text X}. Note that we do not define an automaton here, we merely relate two sets
 (with the help of a regular expression). In our concrete example we have
-@{term "X\<^isub>1 \<Turnstile>(CHAR c)\<Rightarrow> X\<^isub>2"} and @{term "X\<^isub>1 \<Turnstile>r\<Rightarrow> X\<^isub>3"} with @{text r} being any
+@{term "X\<^isub>1 \<Turnstile>c\<Rightarrow> X\<^isub>2"}, @{term "X\<^isub>1 \<Turnstile>d\<Rightarrow> X\<^isub>3"} with @{text d} being any
-other regular expression than @{const EMPTY} and @{term "CHAR c"}.
+other character than @{text c}, and @{term "X\<^isub>3 \<Turnstile>d\<Rightarrow> X\<^isub>3"} for any @{text c}.
 Next we build an equational system that
 contains an equation for each equivalence class. Suppose we have
 the equivalence classes @{text "X\<^isub>1,\<dots>,X\<^isub>n"}, there must be one and only one that
 contains the empty string @{text "[]"} (since equivalence classes are disjoint).
 \end{tabular}
 \end{center}
 \noindent
 where the pairs @{text "(Y\<^isub>i\<^isub>j, CHAR c\<^isub>i\<^isub>j)"} stand for all transitions
-@{term "Y\<^isub>i\<^isub>j \<Turnstile>(CHAR c\<^isub>i\<^isub>j)\<Rightarrow> X\<^isub>i"}.  The term @{text "\<lambda>(EMPTY)"} acts as a marker for the equivalence
+@{term "Y\<^isub>i\<^isub>j \<Turnstile>c\<^isub>i\<^isub>j\<Rightarrow> X\<^isub>i"}.  There can only be finitely many such
-class containing @{text "[]"}. (Note that we mark, roughly speaking, the
+transitions since there are only finitely many equivalence classes
+and finitely many characters.
+The term @{text "\<lambda>(EMPTY)"} in the first equation acts as a marker for the equivalence
+class containing @{text "[]"}. (As an aside note that we mark, roughly speaking, the
 single ``initial'' state in the equational system, which is different from
-the method by Brzozowski \cite{Brzozowski64}, since he marks the ``terminal''
+the method by Brzozowski \cite{Brzozowski64}: he marks the ``terminal''
 states. We are forced to set up the equational system in this way, because
 the Myhill-Nerode relation determines the ``direction'' of the transitions.
 The successor ``state'' of an equivalence class @{text Y} can be reached by adding
 characters to the end of @{text Y}. This is also the reason why we have to use
 our reverse version of Arden's lemma.)
 Overloading the function @{text L} for the two kinds of terms in the
-equational system as follows
+equational system, we have
 \begin{center}
-@{thm L_rhs_item.simps(2)[where X="Y" and r="r", THEN eq_reflection]}\hspace{10mm}
+@{text "\<calL>(Y, r) \<equiv>"} %
+@{thm (rhs) L_rhs_item.simps(2)[where X="Y" and r="r", THEN eq_reflection]}\hspace{10mm}
 @{thm L_rhs_item.simps(1)[where r="r", THEN eq_reflection]}
 \end{center}
 \noindent
 we can prove for @{text "X\<^isub>2\<^isub>.\<^isub>.\<^isub>n"} that the following equations
 @{text "X\<^isub>1 = \<calL>(Y\<^isub>i\<^isub>1, CHAR c\<^isub>i\<^isub>1) \<union> \<dots> \<union> \<calL>(Y\<^isub>i\<^isub>p, CHAR c\<^isub>i\<^isub>p) \<union> \<calL>(\<lambda>(EMPTY))"}.
 \end{equation}
 \noindent
 The reason for adding the @{text \<lambda>}-marker to our equational system is
-to obtain this equation, which only holds in this form since none of
+to obtain this equation: it only holds in this form since none of
 the other terms contain the empty string.
-Our proof of Thm.~\ref{myhillnerodeone} will be by transforming the
+Our proof of Thm.~\ref{myhillnerodeone} will proceed by transforming the
 equational system into a \emph{solved form} maintaining the invariants
 \eqref{inv1} and \eqref{inv2}. From the solved form we will be able to read
 off the regular expressions.
 In order to transform an equational system into solved form, we have two main
 operations: one that takes an equation of the form @{text "X = rhs"} and removes
 the recursive occurences of @{text X} in @{text rhs} using our variant of Arden's
-Lemma (Lem.~\ref{arden}). The other operation takes an equation @{text "X = rhs"}
+Lemma. The other operation takes an equation @{text "X = rhs"}
 and substitutes @{text X} throughout the rest of the equational system
-adjusting the regular expressions approriately. To define them we need the
+adjusting the remaining regular expressions approriately. To define this adjustment
-operation
+we define the \emph{append-operation}
 \begin{center}
-@{thm attach_rexp.simps(2)[where X="Y" and r="r\<^isub>1" and rexp="r\<^isub>2", THEN eq_reflection]}\hspace{10mm}
+@{thm append_rexp.simps(2)[where X="Y" and r="r\<^isub>1" and rexp="r\<^isub>2", THEN eq_reflection]}\hspace{10mm}
-@{thm attach_rexp.simps(1)[where r="r\<^isub>1" and rexp="r\<^isub>2", THEN eq_reflection]}
+@{thm append_rexp.simps(1)[where r="r\<^isub>1" and rexp="r\<^isub>2", THEN eq_reflection]}
 \end{center}
+\noindent
+which we also lift to entire right-hand sides of equations, written as
+@{thm (lhs) append_rhs_rexp_def[where rexp="r"]}.
+\begin{center}
+@{thm arden_op_def}
+\end{center}
 *}
 section {* Regular Expressions Generate Finitely Many Partitions *}
 text {*
 *}
 section {* Conclusion and Related Work *}
+text {*
+In this paper we took the view that a regular language as one where there exists
+a regular expression that matches all its strings. For us it was important to find
+out how far we can push this point of view. Having formalised the Myhill-Nerode
+theorem means pushed very far. Having the Myhill-Nerode theorem means we can
+formalise much of the textbook results in this subject.
+*}
 (*<*)
 end
 (*>*)

changeset 92	a9ebc410a5c8
parent 90	97b783438316
child 93	2aa3756dcc9f