(*<*)
theory Paper
imports "../Myhill" "LaTeXsugar"
begin
declare [[show_question_marks = false]]
consts
REL :: "(string \<times> string) \<Rightarrow> bool"
UPLUS :: "'a set \<Rightarrow> 'a set \<Rightarrow> (nat \<times> 'a) set"
abbreviation
"EClass x R \<equiv> R `` {x}"
notation (latex output)
str_eq_rel ("\<approx>\<^bsub>_\<^esub>") and
str_eq ("_ \<approx>\<^bsub>_\<^esub> _") and
Seq (infixr "\<cdot>" 100) and
Star ("_\<^bsup>\<star>\<^esup>") and
pow ("_\<^bsup>_\<^esup>" [100, 100] 100) and
Suc ("_+1" [100] 100) and
quotient ("_ \<^raw:\ensuremath{\!\sslash\!}> _" [90, 90] 90) and
REL ("\<approx>") and
UPLUS ("_ \<^raw:\ensuremath{\uplus}> _" [90, 90] 90) and
L ("\<^raw:\ensuremath{\cal{L}}>'(_')" [0] 101) and
Lam ("\<lambda>'(_')" [100] 100) and
Trn ("_, _" [100, 100] 100) and
EClass ("\<lbrakk>_\<rbrakk>\<^bsub>_\<^esub>" [100, 100] 100) and
transition ("_ \<^raw:\ensuremath{\stackrel{\text{>_\<^raw:}}{\Longmapsto}}> _" [100, 100, 100] 100)
(*>*)
section {* Introduction *}
text {*
Regular languages are an important and well-understood subject in Computer
Science, with many beautiful theorems and many useful algorithms. There is a
wide range of textbooks on this subject, many of which are aimed at students
and contain very detailed ``pencil-and-paper'' proofs
(e.g.~\cite{Kozen97}). It seems natural to exercise theorem provers by
formalising these theorems and by verifying formally the algorithms.
There is however a problem: the typical approach to regular languages is to
introduce finite automata and then define everything in terms of them. For
example, a regular language is normally defined as one whose strings are
recognised by a finite deterministic automaton. This approach has many
benefits. Among them is the fact that it is easy to convince oneself that
regular languages are closed under complementation: one just has to exchange
the accepting and non-accepting states in the corresponding automaton to
obtain an automaton for the complement language. The problem, however, lies with
formalising such reasoning in a HOL-based theorem prover, in our case
Isabelle/HOL. Automata are build up from states and transitions that
need to be represented as graphs, matrices or functions, none
of which can be defined as inductive datatype.
In case of graphs and matrices, this means we have to build our own
reasoning infrastructure for them, as neither Isabelle/HOL nor HOL4 nor
HOLlight support them with libraries. Even worse, reasoning about graphs and
matrices can be a real hassle in HOL-based theorem provers. Consider for
example the operation of sequencing two automata, say $A_1$ and $A_2$, by
connecting the accepting states of $A_1$ to the initial state of $A_2$:
\begin{center}
\begin{tabular}{ccc}
\begin{tikzpicture}[scale=0.8]
%\draw[step=2mm] (-1,-1) grid (1,1);
\draw[rounded corners=1mm, very thick] (-1.0,-0.3) rectangle (-0.2,0.3);
\draw[rounded corners=1mm, very thick] ( 0.2,-0.3) rectangle ( 1.0,0.3);
\node (A) at (-1.0,0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (B) at ( 0.2,0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (C) at (-0.2, 0.13) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (D) at (-0.2,-0.13) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (E) at (1.0, 0.2) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (F) at (1.0,-0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (G) at (1.0,-0.2) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\draw (-0.6,0.0) node {\footnotesize$A_1$};
\draw ( 0.6,0.0) node {\footnotesize$A_2$};
\end{tikzpicture}
&
\raisebox{1.1mm}{\bf\Large$\;\;\;\Rightarrow\,\;\;$}
&
\begin{tikzpicture}[scale=0.8]
%\draw[step=2mm] (-1,-1) grid (1,1);
\draw[rounded corners=1mm, very thick] (-1.0,-0.3) rectangle (-0.2,0.3);
\draw[rounded corners=1mm, very thick] ( 0.2,-0.3) rectangle ( 1.0,0.3);
\node (A) at (-1.0,0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (B) at ( 0.2,0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (C) at (-0.2, 0.13) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (D) at (-0.2,-0.13) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (E) at (1.0, 0.2) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (F) at (1.0,-0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\node (G) at (1.0,-0.2) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
\draw (C) to [very thick, bend left=45] (B);
\draw (D) to [very thick, bend right=45] (B);
\draw (-0.6,0.0) node {\footnotesize$A_1$};
\draw ( 0.6,0.0) node {\footnotesize$A_2$};
\end{tikzpicture}
\end{tabular}
\end{center}
\noindent
On ``paper'' we can define the corresponding graph in terms of the disjoint
union of the state nodes. Unfortunately in HOL, the definition for disjoint
union, namely
%
\begin{equation}\label{disjointunion}
@{term "UPLUS A\<^isub>1 A\<^isub>2 \<equiv> {(1, x) | x. x \<in> A\<^isub>1} \<union> {(2, y) | y. y \<in> A\<^isub>2}"}
\end{equation}
\noindent
changes the type---the disjoint union is not a set, but a set of pairs.
Using this definition for disjoint unions means we do not have a single type for automata
and hence will not be able to state properties about \emph{all}
automata, since there is no type quantification available in HOL. An
alternative, which provides us with a single type for automata, is to give every
state node an identity, for example a natural
number, and then be careful to rename these identities apart whenever
connecting two automata. This results in clunky proofs
establishing that properties are invariant under renaming. Similarly,
connecting two automata represented as matrices results in very adhoc
constructions, which are not pleasant to reason about.
Functions are much better supported in Isabelle/HOL, but they still lead to similar
problems as with graphs. Composing two non-deterministic automata in parallel
poses still the problem of how to implement disjoint unions. Nipkow \cite{Nipkow98}
dismisses the option using identities, because it leads to messy proofs. He
opts for a variant of \eqref{disjointunion}, but writes
\begin{quote}
\it ``If the reader finds the above treatment in terms of bit lists revoltingly
concrete, I cannot disagree.''
\end{quote}
\noindent
Moreover, it is not so clear how to conveniently impose a finiteness condition
upon functions in order to represent \emph{finite} automata. The best is
probably to resort to more advanced reasoning frameworks, such as \emph{locales}.
Because of these problems to do with representing automata, there seems
to be no substantial formalisation of automata theory and regular languages
carried out in a HOL-based theorem prover. Nipkow establishes in
\cite{Nipkow98} the link between regular expressions and automata in
the context of lexing. The only larger formalisations of automata theory
are carried out in Nuprl \cite{Constable00} and in Coq (for example
\cite{Filliatre97}).
In this paper, we will not attempt to formalise automata theory in
Isabelle/HOL, but take a completely different approach to regular
languages. Instead of defining a regular language as one where there exists
an automaton that recognises all strings of the language, we define a
regular language as:
\begin{definition}
A language @{text A} is \emph{regular}, provided there is a regular expression that matches all
strings of @{text "A"}.
\end{definition}
\noindent
The reason is that regular expressions, unlike graphs and matrices, can
be easily defined as inductive datatype. Consequently a corresponding reasoning
infrastructure comes for free. This has recently been exploited in HOL4 with a formalisation
of regular expression matching based on derivatives \cite{OwensSlind08}. The purpose of this paper is to
show that a central result about regular languages---the Myhill-Nerode theorem---can
be recreated by only using regular expressions. This theorem gives necessary
and sufficient conditions for when a language is regular. As a corollary of this
theorem we can easily establish the usual closure properties, including
complementation, for regular languages.\smallskip
\noindent
{\bf Contributions:} To our knowledge, our proof of the Myhill-Nerode theorem is the
first that is based on regular expressions, only. We prove the part of this theorem
stating that a regular expression has only finitely many partitions using certain
tagging-functions. Again to our best knowledge, these tagging functions have
not been used before to establish the Myhill-Nerode theorem.
*}
section {* Preliminaries *}
text {*
Strings in Isabelle/HOL are lists of characters with the \emph{empty string}
being represented by the empty list, written @{term "[]"}. \emph{Languages}
are sets of strings. The language containing all strings is written in
Isabelle/HOL as @{term "UNIV::string set"}. The concatenation of two languages
is written @{term "A ;; B"} and a language raised to the power $n$ is written
@{term "A \<up> n"}. Their definitions are
\begin{center}
@{thm Seq_def[THEN eq_reflection, where A1="A" and B1="B"]}
\hspace{7mm}
@{thm pow.simps(1)[THEN eq_reflection, where A1="A"]}
\hspace{7mm}
@{thm pow.simps(2)[THEN eq_reflection, where A1="A" and n1="n"]}
\end{center}
\noindent
where @{text "@"} is the usual list-append operation. The Kleene-star of a language @{text A}
is defined as the union over all powers, namely @{thm Star_def}. In the paper
we will often make use of the following properties.
\begin{proposition}\label{langprops}\mbox{}\\
\begin{tabular}{@ {}ll@ {\hspace{10mm}}ll}
(i) & @{thm star_cases} & (ii) & @{thm[mode=IfThen] pow_length}\\
(iii) & @{thm seq_Union_left} &
\end{tabular}
\end{proposition}
\noindent
We omit the proofs of these properties, but invite the reader to consult
our formalisation.\footnote{Available at ???}
The notation for the quotient of a language @{text A} according to an
equivalence relation @{term REL} is @{term "A // REL"}. We will write
@{text "\<lbrakk>x\<rbrakk>\<^isub>\<approx>"} for the equivalence class defined
as @{text "{y | y \<approx> x}"}.
Central to our proof will be the solution of equational systems
involving sets of languages. For this we will use Arden's lemma \cite{Brzozowski64}
which solves equations of the form @{term "X = A ;; X \<union> B"} provided
@{term "[] \<notin> A"}. However we will need the following ``reverse''
version of Arden's lemma.
\begin{lemma}[Reverse Arden's Lemma]\label{arden}\mbox{}\\
If @{thm (prem 1) ardens_revised} then
@{thm (lhs) ardens_revised} has the unique solution
@{thm (rhs) ardens_revised}.
\end{lemma}
\begin{proof}
For the right-to-left direction we assume @{thm (rhs) ardens_revised} and show
that @{thm (lhs) ardens_revised} holds. From Prop.~\ref{langprops}@{text "(i)"}
we have @{term "A\<star> = {[]} \<union> A ;; A\<star>"},
which is equal to @{term "A\<star> = {[]} \<union> A\<star> ;; A"}. Adding @{text B} to both
sides gives @{term "B ;; A\<star> = B ;; ({[]} \<union> A\<star> ;; A)"}, whose right-hand side
is equal to @{term "(B ;; A\<star>) ;; A \<union> B"}. This completes this direction.
For the other direction we assume @{thm (lhs) ardens_revised}. By a simple induction
on @{text n}, we can establish the property
\begin{center}
@{text "(*)"}\hspace{5mm} @{thm (concl) ardens_helper}
\end{center}
\noindent
Using this property we can show that @{term "B ;; (A \<up> n) \<subseteq> X"} holds for
all @{text n}. From this we can infer @{term "B ;; A\<star> \<subseteq> X"} using the definition
of @{text "\<star>"}.
For the inclusion in the other direction we assume a string @{text s}
with length @{text k} is element in @{text X}. Since @{thm (prem 1) ardens_revised}
we know by Prop.~\ref{langprops}@{text "(ii)"} that
@{term "s \<notin> X ;; (A \<up> Suc k)"} since its length is only @{text k}
(the strings in @{term "X ;; (A \<up> Suc k)"} are all longer).
From @{text "(*)"} it follows then that
@{term s} must be element in @{term "(\<Union>m\<in>{0..k}. B ;; (A \<up> m))"}. This in turn
implies that @{term s} is in @{term "(\<Union>n. B ;; (A \<up> n))"}. Using Prop.~\ref{langprops}@{text "(iii)"}
this is equal to @{term "B ;; A\<star>"}, as we needed to show.\qed
\end{proof}
\noindent
Regular expressions are defined as the following inductive datatype
\begin{center}
@{text r} @{text "::="}
@{term NULL}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
@{term EMPTY}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
@{term "CHAR c"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
@{term "SEQ r r"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
@{term "ALT r r"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
@{term "STAR r"}
\end{center}
\noindent
and the language matched by a regular expression is defined as:
\begin{center}
\begin{tabular}{c@ {\hspace{10mm}}c}
\begin{tabular}{rcl}
@{thm (lhs) L_rexp.simps(1)} & @{text "\<equiv>"} & @{thm (rhs) L_rexp.simps(1)}\\
@{thm (lhs) L_rexp.simps(2)} & @{text "\<equiv>"} & @{thm (rhs) L_rexp.simps(2)}\\
@{thm (lhs) L_rexp.simps(3)[where c="c"]} & @{text "\<equiv>"} & @{thm (rhs) L_rexp.simps(3)[where c="c"]}\\
\end{tabular}
&
\begin{tabular}{rcl}
@{thm (lhs) L_rexp.simps(4)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]} & @{text "\<equiv>"} &
@{thm (rhs) L_rexp.simps(4)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]}\\
@{thm (lhs) L_rexp.simps(5)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]} & @{text "\<equiv>"} &
@{thm (rhs) L_rexp.simps(5)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]}\\
@{thm (lhs) L_rexp.simps(6)[where r="r"]} & @{text "\<equiv>"} &
@{thm (rhs) L_rexp.simps(6)[where r="r"]}\\
\end{tabular}
\end{tabular}
\end{center}
*}
section {* Finite Partitions Imply Regularity of a Language *}
text {*
The key definition in the Myhill-Nerode theorem is the
\emph{Myhill-Nerode relation}, which states that w.r.t.~a language two
strings are related, provided there is no distinguishing extension in this
language. This can be defined as:
\begin{definition}[Myhill-Nerode Relation]\mbox{}\\
@{thm str_eq_def[simplified str_eq_rel_def Pair_Collect]}
\end{definition}
\noindent
It is easy to see that @{term "\<approx>A"} is an equivalence relation, which
partitions the set of all strings, @{text "UNIV"}, into a set of disjoint
equivalence classes. One direction of the Myhill-Nerode theorem establishes
that if there are finitely many equivalence classes, then the language is
regular. In our setting we therefore have to show:
\begin{theorem}\label{myhillnerodeone}
@{thm[mode=IfThen] hard_direction}
\end{theorem}
\noindent
To prove this theorem, we define the set @{term "finals A"} as those equivalence
classes that contain strings of @{text A}, namely
%
\begin{equation}
@{thm finals_def}
\end{equation}
\noindent
It is straightforward to show that @{thm lang_is_union_of_finals} and
@{thm finals_in_partitions} hold.
Therefore if we know that there exists a regular expression for every
equivalence class in @{term "finals A"} (which by assumption must be
a finite set), then we can combine these regular expressions with @{const ALT}
and obtain a regular expression that matches every string in @{text A}.
We prove Thm.~\ref{myhillnerodeone} by giving a method that can calculate a
regular expression for \emph{every} equivalence class, not just the ones
in @{term "finals A"}. We
first define a notion of \emph{transition} between equivalence classes
%
\begin{equation}
@{thm transition_def}
\end{equation}
\noindent
which means that if we concatenate all strings matching the regular expression @{text r}
to the end of all strings in the equivalence class @{text Y}, we obtain a subset of
@{text X}. Note that we do not define an automaton here, we merely relate two sets
(w.r.t.~a regular expression).
Next we build an equational system that
contains an equation for each equivalence class. Suppose we have
the equivalence classes @{text "X\<^isub>1,\<dots>,X\<^isub>n"}, there must be one and only one that
contains the empty string @{text "[]"} (since equivalence classes are disjoint).
Let us assume @{text "[] \<in> X\<^isub>1"}. We build the following equational system
\begin{center}
\begin{tabular}{rcl}
@{text "X\<^isub>1"} & @{text "="} & @{text "(Y\<^isub>1\<^isub>1, CHAR c\<^isub>1\<^isub>1) + \<dots> + (Y\<^isub>1\<^isub>p, CHAR c\<^isub>1\<^isub>p) + \<lambda>(EMPTY)"} \\
@{text "X\<^isub>2"} & @{text "="} & @{text "(Y\<^isub>2\<^isub>1, CHAR c\<^isub>2\<^isub>1) + \<dots> + (Y\<^isub>2\<^isub>o, CHAR c\<^isub>2\<^isub>o)"} \\
& $\vdots$ \\
@{text "X\<^isub>n"} & @{text "="} & @{text "(Y\<^isub>n\<^isub>1, CHAR c\<^isub>n\<^isub>1) + \<dots> + (Y\<^isub>n\<^isub>q, CHAR c\<^isub>n\<^isub>q)"}\\
\end{tabular}
\end{center}
\noindent
where the pairs @{text "(Y\<^isub>i\<^isub>j, CHAR c\<^isub>i\<^isub>j)"} stand for all transitions
@{term "Y\<^isub>i\<^isub>j \<Turnstile>(CHAR c\<^isub>i\<^isub>j)\<Rightarrow> X\<^isub>i"}. The term @{text "\<lambda>(EMPTY)"} acts as a marker for the equivalence
class containing @{text "[]"}. (Note that we mark, roughly speaking, the
single ``initial'' state in the equational system, which is different from
the method by Brzozowski \cite{Brzozowski64}, since for his purposes he needs to mark
the ``terminal'' states.) Overloading the function @{text L} for the two kinds of terms in the
equational system as follows
\begin{center}
@{thm L_rhs_e.simps(2)[where X="Y" and r="r", THEN eq_reflection]}\hspace{10mm}
@{thm L_rhs_e.simps(1)[where r="r", THEN eq_reflection]}
\end{center}
\noindent
we can prove for @{text "X\<^isub>2\<^isub>.\<^isub>.\<^isub>n"} that the following equations
%
\begin{equation}\label{inv1}
@{text "X\<^isub>i = \<calL>(Y\<^isub>i\<^isub>1, CHAR c\<^isub>i\<^isub>1) \<union> \<dots> \<union> \<calL>(Y\<^isub>i\<^isub>q, CHAR c\<^isub>i\<^isub>q)"}.
\end{equation}
\noindent
hold. Similarly for @{text "X\<^isub>1"} we can show the following equation
%
\begin{equation}\label{inv2}
@{text "X\<^isub>1 = \<calL>(Y\<^isub>i\<^isub>1, CHAR c\<^isub>i\<^isub>1) \<union> \<dots> \<union> \<calL>(Y\<^isub>i\<^isub>p, CHAR c\<^isub>i\<^isub>p) \<union> \<calL>(\<lambda>(EMPTY))"}.
\end{equation}
\noindent
The reason for adding the @{text \<lambda>}-marker to our equational system is
to obtain this equation, which only holds in this form since none of
the other terms contain the empty string.
Our proof of Thm.~\ref{myhillnerodeone}
will be by transforming the equational system into a \emph{solved form}
maintaining the invariants \eqref{inv1} and \eqref{inv2}. From the solved
form we will be able to read off the regular expressions using our
variant of Arden's Lemma (Lem.~\ref{arden}).
*}
section {* Regular Expressions Generate Finitely Many Partitions *}
text {*
\begin{theorem}
Given @{text "r"} is a regular expressions, then @{thm rexp_imp_finite}.
\end{theorem}
\begin{proof}
By induction on the structure of @{text r}. The cases for @{const NULL}, @{const EMPTY}
and @{const CHAR} are straightforward, because we can easily establish
\begin{center}
\begin{tabular}{l}
@{thm quot_null_eq}\\
@{thm quot_empty_subset}\\
@{thm quot_char_subset}
\end{tabular}
\end{center}
\end{proof}
*}
section {* Conclusion and Related Work *}
(*<*)
end
(*>*)