regexp: comparison Paper/Paper.thy

equal deleted inserted replaced

-:a9ebc410a5c8
+:2aa3756dcc9f
 connecting two automata represented as matrices results in very adhoc
 constructions, which are not pleasant to reason about.
 Functions are much better supported in Isabelle/HOL, but they still lead to similar
 problems as with graphs.  Composing, for example, two non-deterministic automata in parallel
-poses again the problem of how to implement disjoint unions. Nipkow \cite{Nipkow98}
+requires also the formalisation of disjoint unions. Nipkow \cite{Nipkow98}
-dismisses the option of using identities, because it leads to ``messy proofs''. He
+dismisses the option of using identities, because it leads according to him to ``messy proofs''. He
 opts for a variant of \eqref{disjointunion} using bitlists, but writes
 \begin{quote}
-\it ``If the reader finds the above treatment in terms of bit lists revoltingly
+\it%
-concrete, \phantom{``}I cannot disagree.''
+\begin{tabular}{@ {}l@ {}p{0.88\textwidth}@ {}}
+`` & If the reader finds the above treatment in terms of bit lists revoltingly
+concrete, I cannot disagree.''\\
+`` & All lemmas appear obvious given a picture of the composition of automata\ldots
+Yet their proofs require a painful amount of detail.''
+\end{tabular}
 \end{quote}
 \noindent
 Moreover, it is not so clear how to conveniently impose a finiteness condition
 upon functions in order to represent \emph{finite} automata. The best is
 probably to resort to more advanced reasoning frameworks, such as \emph{locales}
 or \emph{type classes},
-which are not avaiable in all HOL-based theorem provers.
+which are not avaiable in \emph{all} HOL-based theorem provers.
 Because of these problems to do with representing automata, there seems
 to be no substantial formalisation of automata theory and regular languages
-carried out in a HOL-based theorem prover. Nipkow establishes in
+carried out in HOL-based theorem provers. Nipkow establishes in
 \cite{Nipkow98} the link between regular expressions and automata in
 the restricted context of lexing. The only larger formalisations of automata theory
 are carried out in Nuprl \cite{Constable00} and in Coq (for example
 \cite{Filliatre97}).
 Strings in Isabelle/HOL are lists of characters with the \emph{empty string}
 being represented by the empty list, written @{term "[]"}.  \emph{Languages}
 are sets of strings. The language containing all strings is written in
 Isabelle/HOL as @{term "UNIV::string set"}. The concatenation of two languages
 is written @{term "A ;; B"} and a language raised to the power @{text n} is written
-@{term "A \<up> n"}. Their definitions are
+@{term "A \<up> n"}. They are defined as usual
 \begin{center}
 @{thm Seq_def[THEN eq_reflection, where A1="A" and B1="B"]}
 \hspace{7mm}
 @{thm pow.simps(1)[THEN eq_reflection, where A1="A"]}
 as @{text "{y | y \<approx> x}"}.
 Central to our proof will be the solution of equational systems
 involving sets of languages. For this we will use Arden's lemma \cite{Brzozowski64}
-which solves equations of the form @{term "X = A ;; X \<union> B"} in case
+which solves equations of the form @{term "X = A ;; X \<union> B"} provided
 @{term "[] \<notin> A"}. However we will need the following ``reverse''
 version of Arden's lemma.
 \begin{lemma}[Reverse Arden's Lemma]\label{arden}\mbox{}\\
 If @{thm (prem 1) arden} then
 \end{center}
 Given a set of regular expressions @{text rs}, we will make use of the operation of generating
 a regular expression that matches all languages of @{text rs}. We only need to know the existence
 of such a regular expression and therefore we use Isabelle/HOL's @{const "fold_graph"} and Hilbert's
-@{text "\<epsilon>"} to define @{term "\<Uplus>rs"}. This function, roughly speaking, folds @{const ALT} over the
+@{text "\<epsilon>"} to define @{term "\<Uplus>rs"}. This operation, roughly speaking, folds @{const ALT} over the
 set @{text rs} with @{const NULL} for the empty set. We can prove that for finite sets @{text rs}
 \begin{center}
-@{thm (lhs) folds_alt_simp}@{text "= \<Union> (\<calL> ` rs)"}
+@{thm (lhs) folds_alt_simp} @{text "= \<Union> (\<calL> ` rs)"}
 \end{center}
 \noindent
 holds, whereby @{text "\<calL> ` rs"} stands for the
 image of the set @{text rs} under function @{text "\<calL>"}.
 *}
 section {* Finite Partitions Imply Regularity of a Language *}
 text {*
 \end{definition}
 \noindent
 It is easy to see that @{term "\<approx>A"} is an equivalence relation, which
 partitions the set of all strings, @{text "UNIV"}, into a set of disjoint
-equivalence classes. An example is the regular language containing just
+equivalence classes. Let us give an example: consider the regular language containing just
 the string @{text "[c]"}. The relation @{term "\<approx>({[c]})"} partitions @{text UNIV}
 into the three equivalence classes @{text "X\<^isub>1"}, @{text "X\<^isub>2"} and  @{text "X\<^isub>3"}
 as follows
 \begin{center}
 @{text "X\<^isub>2 = {[c]}"}\hspace{5mm}
 @{text "X\<^isub>3 = UNIV - {[], [c]}"}
 \end{center}
 One direction of the Myhill-Nerode theorem establishes
-that if there are finitely many equivalence classes, then the language is
+that if there are finitely many equivalence classes, like in the example above, then
-regular. In our setting we therefore have to show:
+the language is regular. In our setting we therefore have to show:
 \begin{theorem}\label{myhillnerodeone}
 @{thm[mode=IfThen] hard_direction}
 \end{theorem}
 In our running example, @{text "X\<^isub>2"} is the only equivalence class in @{term "finals {[c]}"}.
 It is straightforward to show that in general @{thm lang_is_union_of_finals} and
 @{thm finals_in_partitions} hold.
 Therefore if we know that there exists a regular expression for every
 equivalence class in @{term "finals A"} (which by assumption must be
-a finite set), then we can using @{text "\<bigplus>"} obtain a regular expression
+a finite set), then we can use @{text "\<bigplus>"} to obtain a regular expression
 using that matches every string in @{text A}.
 Our proof of Thm.~\ref{myhillnerodeone} relies on a method that can calculate a
 regular expression for \emph{every} equivalence class, not just the ones
 in @{term "finals A"}. We
-first define the notion of \emph{one-character-transition} between equivalence classes
+first define the notion of \emph{one-character-transition} between
+two equivalence classes
 %
 \begin{equation}
 @{thm transition_def}
 \end{equation}
 which means that if we concatenate the character @{text c} to the end of all
 strings in the equivalence class @{text Y}, we obtain a subset of
 @{text X}. Note that we do not define an automaton here, we merely relate two sets
 (with the help of a regular expression). In our concrete example we have
 @{term "X\<^isub>1 \<Turnstile>c\<Rightarrow> X\<^isub>2"}, @{term "X\<^isub>1 \<Turnstile>d\<Rightarrow> X\<^isub>3"} with @{text d} being any
-other character than @{text c}, and @{term "X\<^isub>3 \<Turnstile>d\<Rightarrow> X\<^isub>3"} for any @{text c}.
+other character than @{text c}, and @{term "X\<^isub>3 \<Turnstile>d\<Rightarrow> X\<^isub>3"} for any @{text d}.
 Next we build an equational system that
 contains an equation for each equivalence class. Suppose we have
 the equivalence classes @{text "X\<^isub>1,\<dots>,X\<^isub>n"}, there must be one and only one that
 contains the empty string @{text "[]"} (since equivalence classes are disjoint).
 \end{tabular}
 \end{center}
 \noindent
 where the pairs @{text "(Y\<^isub>i\<^isub>j, CHAR c\<^isub>i\<^isub>j)"} stand for all transitions
-@{term "Y\<^isub>i\<^isub>j \<Turnstile>c\<^isub>i\<^isub>j\<Rightarrow> X\<^isub>i"}.  There can only be finitely many such
+@{term "Y\<^isub>i\<^isub>j \<Turnstile>c\<^isub>i\<^isub>j\<Rightarrow> X\<^isub>i"}. Our internal represeantation for the right-hand
-transitions since there are only finitely many equivalence classes
+sides are sets of terms.
-and finitely many characters.
+There can only be finitely many such
+terms since there are only finitely many equivalence classes
+and only finitely many characters.
 The term @{text "\<lambda>(EMPTY)"} in the first equation acts as a marker for the equivalence
-class containing @{text "[]"}. (As an aside note that we mark, roughly speaking, the
+class containing @{text "[]"}.\footnote{Note that we mark, roughly speaking, the
 single ``initial'' state in the equational system, which is different from
-the method by Brzozowski \cite{Brzozowski64}: he marks the ``terminal''
+the method by Brzozowski \cite{Brzozowski64}, where he marks the ``terminal''
-states. We are forced to set up the equational system in this way, because
+states. We are forced to set up the equational system in our way, because
 the Myhill-Nerode relation determines the ``direction'' of the transitions.
 The successor ``state'' of an equivalence class @{text Y} can be reached by adding
 characters to the end of @{text Y}. This is also the reason why we have to use
-our reverse version of Arden's lemma.)
+our reverse version of Arden's lemma.}
-Overloading the function @{text L} for the two kinds of terms in the
+Overloading the function @{text \<calL>} for the two kinds of terms in the
 equational system, we have
 \begin{center}
 @{text "\<calL>(Y, r) \<equiv>"} %
 @{thm (rhs) L_rhs_item.simps(2)[where X="Y" and r="r", THEN eq_reflection]}\hspace{10mm}
 \end{equation}
 \noindent
 The reason for adding the @{text \<lambda>}-marker to our equational system is
 to obtain this equation: it only holds in this form since none of
-the other terms contain the empty string.
+the other terms contain the empty string. Since we use sets for representing
+the right-hans side we can write \eqref{inv1} and \eqref{inv2} more
+concisely for an equation of the form @{text "X = rhs"} as
+%
+\begin{equation}\label{inv}
+\mbox{@{text "X = \<Union> (\<calL> ` rhs)"}}
+\end{equation}
+\noindent
 Our proof of Thm.~\ref{myhillnerodeone} will proceed by transforming the
-equational system into a \emph{solved form} maintaining the invariants
+equational system into a \emph{solved form} maintaining the invariant
-\eqref{inv1} and \eqref{inv2}. From the solved form we will be able to read
+\eqref{inv}. From the solved form we will be able to read
 off the regular expressions.
 In order to transform an equational system into solved form, we have two main
 operations: one that takes an equation of the form @{text "X = rhs"} and removes
-the recursive occurences of @{text X} in @{text rhs} using our variant of Arden's
+the recursive occurences of @{text X} in the @{text rhs} using our variant of Arden's
 Lemma. The other operation takes an equation @{text "X = rhs"}
 and substitutes @{text X} throughout the rest of the equational system
 adjusting the remaining regular expressions approriately. To define this adjustment
 we define the \emph{append-operation}
 @{thm append_rexp.simps(1)[where r="r\<^isub>1" and rexp="r\<^isub>2", THEN eq_reflection]}
 \end{center}
 \noindent
 which we also lift to entire right-hand sides of equations, written as
-@{thm (lhs) append_rhs_rexp_def[where rexp="r"]}.
+@{thm (lhs) append_rhs_rexp_def[where rexp="r"]}. With this we can define
+the \emph{arden-operation}, which takes an equivalence class @{text X} and
+a rhs of an equation.
 \begin{center}
 @{thm arden_op_def}
 \end{center}
+\noindent
+The term left of $\triangleright$ deletes all terms of the form @{text "(X, r)"};
+the term on the right calculates first the combinded regular expressions for all
+@{text r} in the @{text "(X, r)"}, forms the star and appends it to the remaining
+terms in the @{text rhs}. It can be easily seen that this operation mimics Arden's
+lemma on the level of equations.
 *}
 section {* Regular Expressions Generate Finitely Many Partitions *}
 text {*

changeset 93	2aa3756dcc9f
parent 92	a9ebc410a5c8
child 94	5b12cd0a3b3c