regexp: comparison Paper/Paper.thy

equal deleted inserted replaced

-:c5eb5f3065ae
+:c5f138b5fc88
 text {*
 Regular languages are an important and well-understood subject in Computer
 Science, with many beautiful theorems and many useful algorithms. There is a
 wide range of textbooks on this subject, many of which are aimed at students
-and contain very detailed ``pencil-and-paper'' proofs
+and contain very detailed `pencil-and-paper' proofs
 (e.g.~\cite{Kozen97}). It seems natural to exercise theorem provers by
 formalising the theorems and by verifying formally the algorithms.
 There is however a problem: the typical approach to regular languages is to
 introduce finite automata and then define everything in terms of them.  For
 benefits. Among them is the fact that it is easy to convince oneself that
 regular languages are closed under complementation: one just has to exchange
 the accepting and non-accepting states in the corresponding automaton to
 obtain an automaton for the complement language.  The problem, however, lies with
 formalising such reasoning in a HOL-based theorem prover, in our case
-Isabelle/HOL. Automata are build up from states and transitions that
+Isabelle/HOL. Automata are built up from states and transitions that
 need to be represented as graphs, matrices or functions, none
 of which can be defined as inductive datatype.
 In case of graphs and matrices, this means we have to build our own
 reasoning infrastructure for them, as neither Isabelle/HOL nor HOL4 nor
 \end{tabular}
 \end{center}
 \noindent
-On ``paper'' we can define the corresponding graph in terms of the disjoint
+On `paper' we can define the corresponding graph in terms of the disjoint
 union of the state nodes. Unfortunately in HOL, the standard definition for disjoint
 union, namely
 %
 \begin{equation}\label{disjointunion}
 @{term "UPLUS A\<^isub>1 A\<^isub>2 \<equiv> {(1, x) | x. x \<in> A\<^isub>1} \<union> {(2, y) | y. y \<in> A\<^isub>2}"}
 or \emph{type classes},
 which are \emph{not} avaiable in all HOL-based theorem provers.
 Because of these problems to do with representing automata, there seems
 to be no substantial formalisation of automata theory and regular languages
-carried out in HOL-based theorem provers. Nipkow establishes in
+carried out in HOL-based theorem provers. Nipkow  \cite{Nipkow98} establishes
-\cite{Nipkow98} the link between regular expressions and automata in
+the link between regular expressions and automata in
-the context of lexing. Berghofer and Reiter formalise automata working over
+the context of lexing. Berghofer and Reiter \cite{BerghoferReiter09}
-bit strings in the context of Presburger arithmetic \cite{BerghoferReiter09}.
+formalise automata working over
+bit strings in the context of Presburger arithmetic.
 The only larger formalisations of automata theory
-are carried out in Nuprl \cite{Constable00} and in Coq (for example
+are carried out in Nuprl \cite{Constable00} and in Coq \cite{Filliatre97}.
-\cite{Filliatre97}).
 In this paper, we will not attempt to formalise automata theory in
 Isabelle/HOL, but take a completely different approach to regular
 languages. Instead of defining a regular language as one where there exists
 an automaton that recognises all strings of the language, we define a
 @{text "\<lbrakk>x\<rbrakk>\<^isub>\<approx>"} for the equivalence class defined
 as @{text "{y | y \<approx> x}"}.
 Central to our proof will be the solution of equational systems
-involving equivalence classes of languages. For this we will use Arden's lemma \cite{Brzozowski64}
+involving equivalence classes of languages. For this we will use Arden's lemma \cite{Brzozowski64},
 which solves equations of the form @{term "X = A ;; X \<union> B"} provided
-@{term "[] \<notin> A"}. However we will need the following ``reverse''
+@{term "[] \<notin> A"}. However we will need the following `reverse'
 version of Arden's lemma.
 \begin{lemma}[Reverse Arden's Lemma]\label{arden}\mbox{}\\
 If @{thm (prem 1) arden} then
-@{thm (lhs) arden} has the unique solution
+@{thm (lhs) arden} if and only if
 @{thm (rhs) arden}.
 \end{lemma}
 \begin{proof}
 For the right-to-left direction we assume @{thm (rhs) arden} and show
 text {*
 The key definition in the Myhill-Nerode theorem is the
 \emph{Myhill-Nerode relation}, which states that w.r.t.~a language two
 strings are related, provided there is no distinguishing extension in this
-language. This can be defined as:
+language. This can be defined as tertiary relation:
 \begin{definition}[Myhill-Nerode Relation]\mbox{}\\
 @{thm str_eq_def[simplified str_eq_rel_def Pair_Collect]}
 \end{definition}
 X\<^isub>i"}.   There can only be
 finitely many such terms in a right-hand side since by assumption there are only finitely many
 equivalence classes and only finitely many characters.  The term @{text
 "\<lambda>(EMPTY)"} in the first equation acts as a marker for the equivalence class
 containing @{text "[]"}.\footnote{Note that we mark, roughly speaking, the
-single ``initial'' state in the equational system, which is different from
+single `initial' state in the equational system, which is different from
 the method by Brzozowski \cite{Brzozowski64}, where he marks the
-``terminal'' states. We are forced to set up the equational system in our
+`terminal' states. We are forced to set up the equational system in our
-way, because the Myhill-Nerode relation determines the ``direction'' of the
+way, because the Myhill-Nerode relation determines the `direction' of the
-transitions. The successor ``state'' of an equivalence class @{text Y} can
+transitions. The successor `state' of an equivalence class @{text Y} can
 be reached by adding characters to the end of @{text Y}. This is also the
 reason why we have to use our reverse version of Arden's lemma.}
 Overloading the function @{text \<calL>} for the two kinds of terms in the
 equational system, we have
 This allows us to prove
 \begin{lemma}\label{ardenable}
 Given an equation @{text "X = rhs"}.
 If @{text "X = \<Union>\<calL> ` rhs"},
-@{thm (prem 2) Arden_keeps_eq} and
+@{thm (prem 2) Arden_keeps_eq}, and
 @{thm (prem 3) Arden_keeps_eq}, then
 @{text "X = \<Union>\<calL> ` (Arden X rhs)"}
 \end{lemma}
 \noindent
 @{thm Solve_def}
 \end{center}
 \noindent
 We are not concerned here with the definition of this operator
-(see \cite{BerghoferNipkow00}), but note that we eliminate
+(see Berghofer and Nipkow \cite{BerghoferNipkow00}), but note that we eliminate
 in each @{const Iter}-step a single equation, and therefore
 have a well-founded termination order by taking the cardinality
 of the equational system @{text ES}. This enables us to prove
-properties about our definition of @{const Solve} when we ``call'' it with
+properties about our definition of @{const Solve} when we `call' it with
 the equivalence class @{text X} and the initial equational system
 @{term "Init (UNIV // \<approx>A)"} from
 \eqref{initcs} using the principle:
 %
 \begin{equation}\label{whileprinciple}
 \end{tabular}
 \end{center}
 \noindent
 The first two ensure that the equational system is always finite (number of equations
-and number of terms in each equation); the second makes sure the ``meaning'' of the
+and number of terms in each equation); the second makes sure the `meaning' of the
 equations is preserved under our transformations. The other properties are a bit more
 technical, but are needed to get our proof through. Distinctness states that every
 equation in the system is distinct. Ardenable ensures that we can always
 apply the arden operation.
 The last property states that every @{text rhs} can only contain equivalence classes
 section {* Conclusion and Related Work *}
 text {*
 In this paper we took the view that a regular language is one where there
-exists a regular expression that matches all its strings. Regular
+exists a regular expression that matches all of its strings. Regular
-expressions can be conveniently defined as a datatype in a HOL-based theorem
+expressions can conveniently be defined as a datatype in a HOL-based theorem
 prover. For us it was therefore interesting to find out how far we can push
 this point of view.
 Having formalised the Myhill-Nerode theorem means we
 pushed quite far. Using this theorem we can obviously prove when a language
 classes with the states of the automaton, then the most natural choice is to
 characterise each state with the set of strings starting from the initial
 state leading up to that state. Usually, however, the states are characterised as the
 ones starting from that state leading to the terminal states.  The first
 choice has consequences how the initial equational system is set up. We have
-the $\lambda$-term on our ``initial state'', while Brzozowski has it on the
+the $\lambda$-term on our `initial state', while Brzozowski has it on the
 terminal states. This means we also need to reverse the direction of Arden's
 lemma.
 We briefly considered using the method Brzozowski presented in the Appendix
 of~\cite{Brzozowski64} in order to prove the second direction of the
 While regular expressions are convenient in formalisations, they have some
 limitations. One is that there seems to be no notion of a minimal regular
 expression, like there is for automata. For an implementation of a simple
 regular expression matcher, whose correctness has been formally
-established, we refer the reader to \cite{OwensSlind08}.
+established, we refer the reader to Owens and Slind \cite{OwensSlind08}.
 *}
 (*<*)

changeset 115	c5f138b5fc88
parent 114	c5eb5f3065ae
child 116	342983676c8f