(*<*)
theory Paper
imports "../Myhill" "LaTeXsugar"
begin
declare [[show_question_marks = false]]
consts
REL :: "(string \<times> string) \<Rightarrow> bool"
UPLUS :: "'a set \<Rightarrow> 'a set \<Rightarrow> (nat \<times> 'a) set"
notation (latex output)
str_eq_rel ("\<approx>\<^bsub>_\<^esub>") and
Seq (infixr "\<cdot>" 100) and
Star ("_\<^bsup>\<star>\<^esup>") and
pow ("_\<^bsup>_\<^esup>" [100, 100] 100) and
Suc ("_+1" [100] 100) and
quotient ("_ \<^raw:\ensuremath{\!\sslash\!}> _" [90, 90] 90) and
REL ("\<approx>") and
UPLUS ("_ \<^raw:\ensuremath{\uplus}> _" [90, 90] 90)
(*>*)
section {* Introduction *}
text {*
Regular languages are an important and well-understood subject in Computer
Science, with many beautiful theorems and many useful algorithms. There is a
wide range of textbooks on this subject, many of which are aimed at students
and contain very detailed ``pencil-and-paper'' proofs
(e.g.~\cite{Kozen97}). It seems natural to exercise theorem provers by
formalising these theorems and by verifying formally the algorithms.
There is however a problem: the typical approach to regular languages is to
introduce finite automata and then define everything in terms of them. For
example, a regular language is normally defined as one whose strings are
recognised by a finite deterministic automaton. This approach has many
benefits. Among them is that it is easy to convince oneself from the fact that
regular languages are closed under complementation: one just has to exchange
the accepting and non-accepting states in the corresponding automaton to
obtain an automaton for the complement language. The problem, however, lies with
formalising such reasoning in a theorem prover, in our case
Isabelle/HOL. Automata need to be represented as graphs or matrices, neither
of which can be defined as inductive datatype.\footnote{In some works
functions are used to represent state transitions, but also they are not
inductive datatypes.} This means we have to build our own reasoning
infrastructure for them, as neither Isabelle/HOL nor HOL4 nor HOLlight support
them with libraries.
Even worse, reasoning about graphs and matrices can be a real hassle in HOL-based
theorem provers. Consider for example the operation of sequencing
two automata, say $A_1$ and $A_2$, by connecting the
accepting states of one atomaton to the
initial state of the other:
\begin{center}
\begin{tabular}{ccc}
\begin{tikzpicture}[scale=0.8]
%\draw[step=2mm] (-1,-1) grid (1,1);
\draw[rounded corners=1mm, very thick] (-1.0,-0.3) rectangle (-0.2,0.3);
\draw[rounded corners=1mm, very thick] ( 0.2,-0.3) rectangle ( 1.0,0.3);
\node (A) at (-1.0,0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (B) at ( 0.2,0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (C) at (-0.2, 0.13) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (D) at (-0.2,-0.13) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (E) at (1.0, 0.2) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (F) at (1.0,-0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (G) at (1.0,-0.2) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\draw (-0.6,0.0) node {\footnotesize$A_1$};
\draw ( 0.6,0.0) node {\footnotesize$A_2$};
\end{tikzpicture}
&
\raisebox{1.1mm}{\bf\Large$\;\;\;\Rightarrow\,\;\;$}
&
\begin{tikzpicture}[scale=0.8]
%\draw[step=2mm] (-1,-1) grid (1,1);
\draw[rounded corners=1mm, very thick] (-1.0,-0.3) rectangle (-0.2,0.3);
\draw[rounded corners=1mm, very thick] ( 0.2,-0.3) rectangle ( 1.0,0.3);
\node (A) at (-1.0,0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (B) at ( 0.2,0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (C) at (-0.2, 0.13) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (D) at (-0.2,-0.13) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (E) at (1.0, 0.2) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (F) at (1.0,-0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (G) at (1.0,-0.2) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\draw (C) to [very thick, bend left=45] (B);
\draw (D) to [very thick, bend right=45] (B);
\draw (-0.6,0.0) node {\footnotesize$A_1$};
\draw ( 0.6,0.0) node {\footnotesize$A_2$};
\end{tikzpicture}
\end{tabular}
\end{center}
\noindent
On ``paper'' we can build the corresponding graph using the disjoint union of
the state nodes and add
two more nodes for the new initial state and the new accepting
state. Unfortunately in HOL, the definition for disjoint
union, namely
\begin{center}
@{term "UPLUS A\<^isub>1 A\<^isub>2 \<equiv> {(1, x) | x. x \<in> A\<^isub>1} \<union> {(2, y) | y. y \<in> A\<^isub>2}"}
\end{center}
\noindent
changes the type---the disjoint union is not a set, but a set of pairs.
Using this definition for disjoint unions means we do not have a single type for automata
and hence will not be able to state properties about \emph{all}
automata, since there is no type quantification available in HOL. A working
alternative is to give every state node an identity, for example a natural
number, and then be careful renaming these identities apart whenever
connecting two automata. This results in very clunky proofs
establishing that properties are invariant under renaming. Similarly,
combining two automata represented as matrices results in very adhoc
constructions, which are not pleasant to reason about.
Because of these problems to do with representing automata, there seems
to be no substantial formalisation of automata theory and regular languages
carried out in a HOL-based theorem prover. We are only aware of the
large formalisation of the automata theory in Nuprl \cite{Constable00} and
some smaller in Coq \cite{Filliatre97}.
In this paper, we will not attempt to formalise automata theory, but take a completely
different approach to regular languages. Instead of defining a regular language as one
where there exists an automaton that recognises all strings of the language, we define
a regular language as
\begin{definition}[A Regular Language]
A language @{text A} is regular, provided there is a regular expression that matches all
strings of @{text "A"}.
\end{definition}
\noindent
The reason is that regular expressions, unlike graphs and matrices, can
be easily defined as inductive datatype. Therefore a corresponding reasoning
infrastructure comes in Isabelle for free. The purpose of this paper is to
show that a central result about regular languages, the Myhill-Nerode theorem,
can be recreated by only using regular expressions. A corollary of this
theorem will be the usual closure properties, including complementation,
of regular languages.
\noindent
{\bf Contributions:} A proof of the Myhill-Nerode Theorem based on regular expressions. The
finiteness part of this theorem is proved using tagging-functions (which to our knowledge
are novel in this context).
*}
section {* Preliminaries *}
text {*
Strings in Isabelle/HOL are lists of characters and the
\emph{empty string} is the empty list, written @{term "[]"}. \emph{Languages} are sets of
strings. The language containing all strings is written in Isabelle/HOL as @{term "UNIV::string set"}.
The notation for the quotient of a language @{text A} according to a relation @{term REL} is
@{term "A // REL"}. The concatenation of two languages is written @{term "A ;; B"}; a language
raised tow the power $n$ is written @{term "A \<up> n"}. Both concepts are defined as
\begin{center}
@{thm Seq_def[THEN eq_reflection, where A1="A" and B1="B"]}
\hspace{7mm}
@{thm pow.simps(1)[THEN eq_reflection, where A1="A"]}
\hspace{7mm}
@{thm pow.simps(2)[THEN eq_reflection, where A1="A" and n1="n"]}
\end{center}
\noindent
where @{text "@"} is the usual list-append operation. The Kleene-star of a language @{text A}
is defined as the union over all powers, namely @{thm Star_def}.
Regular expressions are defined as the following datatype
\begin{center}
@{text r} @{text "::="}
@{term NULL}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
@{term EMPTY}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
@{term "CHAR c"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
@{term "SEQ r r"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
@{term "ALT r r"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
@{term "STAR r"}
\end{center}
Central to our proof will be the solution of equational systems
involving regular expressions. For this we will use the following ``reverse''
version of Arden's lemma.
\begin{lemma}[Reverse Arden's Lemma]\mbox{}\\
If @{thm (prem 1) ardens_revised} then
@{thm (lhs) ardens_revised} has the unique solution
@{thm (rhs) ardens_revised}.
\end{lemma}
\begin{proof}
For the right-to-left direction we assume @{thm (rhs) ardens_revised} and show
that @{thm (lhs) ardens_revised} holds. From Lemma ??? we have @{term "A\<star> = {[]} \<union> A ;; A\<star>"},
which is equal to @{term "A\<star> = {[]} \<union> A\<star> ;; A"}. Adding @{text B} to both
sides gives @{term "B ;; A\<star> = B ;; ({[]} \<union> A\<star> ;; A)"}, whose right-hand side
is equal to @{term "(B ;; A\<star>) ;; A \<union> B"}. This completes this direction.
For the other direction we assume @{thm (lhs) ardens_revised}. By a simple induction
on @{text n}, we can establish the property
\begin{center}
@{text "(*)"}\hspace{5mm} @{thm (concl) ardens_helper}
\end{center}
\noindent
Using this property we can show that @{term "B ;; (A \<up> n) \<subseteq> X"} holds for
all @{text n}. From this we can infer @{term "B ;; A\<star> \<subseteq> X"} using Lemma ???.
For the inclusion in the other direction we assume a string @{text s}
with length @{text k} is element in @{text X}. Since @{thm (prem 1) ardens_revised}
we know that @{term "s \<notin> X ;; (A \<up> Suc k)"} since its length is only @{text k}
(the strings in @{term "X ;; (A \<up> Suc k)"} are all longer).
From @{text "(*)"} it follows then that
@{term s} must be element in @{term "(\<Union>m\<in>{0..k}. B ;; (A \<up> m))"}. This in turn
implies that @{term s} is in @{term "(\<Union>n. B ;; (A \<up> n))"}. Using Lemma ??? this
is equal to @{term "B ;; A\<star>"}, as we needed to show.\qed
\end{proof}
*}
section {* Finite Partitions Imply Regularity of a Language *}
text {*
\begin{theorem}
Given a language @{text A}.
@{thm[mode=IfThen] hard_direction[where Lang="A"]}
\end{theorem}
*}
section {* Regular Expressions Generate Finitely Many Partitions *}
text {*
\begin{theorem}
Given @{text "r"} is a regular expressions, then @{thm rexp_imp_finite}.
\end{theorem}
\begin{proof}
By induction on the structure of @{text r}. The cases for @{const NULL}, @{const EMPTY}
and @{const CHAR} are straightforward, because we can easily establish
\begin{center}
\begin{tabular}{l}
@{thm quot_null_eq}\\
@{thm quot_empty_subset}\\
@{thm quot_char_subset}
\end{tabular}
\end{center}
\end{proof}
*}
section {* Conclusion and Related Work *}
(*<*)
end
(*>*)