regexp: comparison Paper/Paper.thy

equal deleted inserted replaced

-:828ea293b61f
+:7478be786f87
 Star ("_\<^bsup>\<star>\<^esup>") and
 pow ("_\<^bsup>_\<^esup>" [100, 100] 100) and
 Suc ("_+1" [100] 100) and
 quotient ("_ \<^raw:\ensuremath{\!\sslash\!}> _" [90, 90] 90) and
 REL ("\<approx>") and
-UPLUS ("_ \<^raw:\ensuremath{\uplus}> _" [90, 90] 90)
+UPLUS ("_ \<^raw:\ensuremath{\uplus}> _" [90, 90] 90) and
+L ("L '(_')" [0] 101)
 (*>*)
 section {* Introduction *}
 text {*
 recognised by a finite deterministic automaton. This approach has many
 benefits. Among them is that it is easy to convince oneself from the fact that
 regular languages are closed under complementation: one just has to exchange
 the accepting and non-accepting states in the corresponding automaton to
 obtain an automaton for the complement language.  The problem, however, lies with
-formalising such reasoning in a theorem prover, in our case
+formalising such reasoning in a HOL-based theorem prover, in our case
-Isabelle/HOL. Automata need to be represented as graphs or matrices, neither
+Isabelle/HOL. Automata consist of states and transitions. They need to be represented
+as graphs or matrices, neither
 of which can be defined as inductive datatype.\footnote{In some works
 functions are used to represent state transitions, but also they are not
 inductive datatypes.} This means we have to build our own reasoning
 infrastructure for them, as neither Isabelle/HOL nor HOL4 nor HOLlight support
 them with libraries.
 Even worse, reasoning about graphs and matrices can be a real hassle in HOL-based
 theorem provers.  Consider for example the operation of sequencing
 two automata, say $A_1$ and $A_2$, by connecting the
-accepting states of one atomaton to the
+accepting states of $A_1$ to the initial state of $A_2$:
-initial state of the other:
 \begin{center}
 \begin{tabular}{ccc}
 \begin{tikzpicture}[scale=0.8]
 \end{tabular}
 \end{center}
 \noindent
-On ``paper'' we can build the corresponding graph using the disjoint union of
+On ``paper'' we can define the corresponding graph in terms of the disjoint
-the state nodes and add
+union of the state nodes. Unfortunately in HOL, the definition for disjoint
-two more nodes for the new initial state and the new accepting
-state. Unfortunately in HOL, the definition for disjoint
 union, namely
 \begin{center}
 @{term "UPLUS A\<^isub>1 A\<^isub>2 \<equiv> {(1, x) | x. x \<in> A\<^isub>1} \<union> {(2, y) | y. y \<in> A\<^isub>2}"}
 \end{center}
 \noindent
 changes the type---the disjoint union is not a set, but a set of pairs.
 Using this definition for disjoint unions means we do not have a single type for automata
 and hence will not be able to state properties about \emph{all}
-automata, since there is no type quantification available in HOL. A working
+automata, since there is no type quantification available in HOL. An
-alternative is to give every state node an identity, for example a natural
+alternative, which provides us with a single type for automata, is to give every
+state node an identity, for example a natural
 number, and then be careful renaming these identities apart whenever
-connecting two automata. This results in very clunky proofs
+connecting two automata. This results in clunky proofs
 establishing that properties are invariant under renaming. Similarly,
-combining two automata represented as matrices results in very adhoc
+connecting two automata represented as matrices results in very adhoc
 constructions, which are not pleasant to reason about.
 Because of these problems to do with representing automata, there seems
 to be no substantial formalisation of automata theory and regular languages
 carried out in a HOL-based theorem prover. We are only aware of the
 large formalisation of the automata theory in Nuprl \cite{Constable00} and
-some smaller in Coq \cite{Filliatre97}.
+some smaller formalisations in Coq, for example \cite{Filliatre97}.
 In this paper, we will not attempt to formalise automata theory, but take a completely
 different approach to regular languages. Instead of defining a regular language as one
 where there exists an automaton that recognises all strings of the language, we define
-a regular language as
+a regular language as:
 \begin{definition}[A Regular Language]
 A language @{text A} is regular, provided there is a regular expression that matches all
 strings of @{text "A"}.
 \end{definition}
 \noindent
 The reason is that regular expressions, unlike graphs and matrices, can
 be easily defined as inductive datatype. Therefore a corresponding reasoning
-infrastructure comes in Isabelle for free. The purpose of this paper is to
+infrastructure comes for free. This has recently been used for formalising regular
+expression matching in HOL4 \cite{OwensSlind08}.  The purpose of this paper is to
 show that a central result about regular languages, the Myhill-Nerode theorem,
-can be recreated by only using regular expressions. A corollary of this
+can be recreated by only using regular expressions. This theorem give a necessary
-theorem will be the usual closure properties, including complementation,
+and sufficient condition for when a language is regular. As a corollary of this
-of regular languages.
+theorem we can easily establish the usual closure properties, including
+complementation, for regular languages.\smallskip
 \noindent
-{\bf Contributions:} A proof of the Myhill-Nerode Theorem based on regular expressions. The
+{\bf Contributions:} To our knowledge, our proof of the Myhill-Nerode theorem is the
-finiteness part of this theorem is proved using tagging-functions (which to our knowledge
+first that is based on regular expressions, only. We prove the part of this theorem
-are novel in this context).
+stating that a regular expression has only finitely many partitions using certain
+tagging-functions. Again to our best knowledge, these tagging functions have
+not been used before to establish the Myhill-Nerode theorem.
 *}
 section {* Preliminaries *}
 text {*
-Strings in Isabelle/HOL are lists of characters and the
+Strings in Isabelle/HOL are lists of characters with the \emph{empty string}
-\emph{empty string} is the empty list, written @{term "[]"}. \emph{Languages} are sets of
+being represented by the empty list, written @{term "[]"}. \emph{Languages}
-strings. The language containing all strings is written in Isabelle/HOL as @{term "UNIV::string set"}.
+are sets of strings. The language containing all strings is written in
-The notation for the quotient of a language @{text A} according to a relation @{term REL} is
+Isabelle/HOL as @{term "UNIV::string set"}.  The notation for the quotient
-@{term "A // REL"}. The concatenation of two languages is written @{term "A ;; B"}; a language
+of a language @{text A} according to a relation @{term REL} is @{term "A //
-raised  tow the power $n$ is written @{term "A \<up> n"}. Both concepts are defined as
+REL"}. The concatenation of two languages is written @{term "A ;; B"}; a
+language raised to the power $n$ is written @{term "A \<up> n"}. Both concepts
+are defined as
 \begin{center}
 @{thm Seq_def[THEN eq_reflection, where A1="A" and B1="B"]}
 \hspace{7mm}
 @{thm pow.simps(1)[THEN eq_reflection, where A1="A"]}
 \noindent
 where @{text "@"} is the usual list-append operation. The Kleene-star of a language @{text A}
 is defined as the union over all powers, namely @{thm Star_def}.
-Regular expressions are defined as the following datatype
-\begin{center}
-@{text r} @{text "::="}
-@{term NULL}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
-@{term EMPTY}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
-@{term "CHAR c"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
-@{term "SEQ r r"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
-@{term "ALT r r"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
-@{term "STAR r"}
-\end{center}
 Central to our proof will be the solution of equational systems
 involving regular expressions. For this we will use the following ``reverse''
 version of Arden's lemma.
 \begin{lemma}[Reverse Arden's Lemma]\mbox{}\\
 From @{text "(*)"} it follows then that
 @{term s} must be element in @{term "(\<Union>m\<in>{0..k}. B ;; (A \<up> m))"}. This in turn
 implies that @{term s} is in @{term "(\<Union>n. B ;; (A \<up> n))"}. Using Lemma ??? this
 is equal to @{term "B ;; A\<star>"}, as we needed to show.\qed
 \end{proof}
+\noindent
+Regular expressions are defined as the following inductive datatype
+\begin{center}
+@{text r} @{text "::="}
+@{term NULL}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
+@{term EMPTY}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
+@{term "CHAR c"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
+@{term "SEQ r r"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
+@{term "ALT r r"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
+@{term "STAR r"}
+\end{center}
+\noindent
+The language matched by a regular expression is defined as usual:
+\begin{center}
+\begin{tabular}{c@ {\hspace{10mm}}c}
+\begin{tabular}{rcl}
+@{thm (lhs) L_rexp.simps(1)} & @{text "\<equiv>"} & @{thm (rhs) L_rexp.simps(1)}\\
+@{thm (lhs) L_rexp.simps(2)} & @{text "\<equiv>"} & @{thm (rhs) L_rexp.simps(2)}\\
+@{thm (lhs) L_rexp.simps(3)[where c="c"]} & @{text "\<equiv>"} & @{thm (rhs) L_rexp.simps(3)[where c="c"]}\\
+\end{tabular}
+&
+\begin{tabular}{rcl}
+@{thm (lhs) L_rexp.simps(4)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]} & @{text "\<equiv>"} &
+@{thm (rhs) L_rexp.simps(4)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]}\\
+@{thm (lhs) L_rexp.simps(5)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]} & @{text "\<equiv>"} &
+@{thm (rhs) L_rexp.simps(5)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]}\\
+@{thm (lhs) L_rexp.simps(6)[where r="r"]} & @{text "\<equiv>"} &
+@{thm (rhs) L_rexp.simps(6)[where r="r"]}\\
+\end{tabular}
+\end{tabular}
+\end{center}
 *}
 section {* Finite Partitions Imply Regularity of a Language *}
 text {*

changeset 67	7478be786f87
parent 66	828ea293b61f
child 70	8ab3a06577cf