24
|
1 |
(*<*)
|
|
2 |
theory Paper
|
39
|
3 |
imports "../Myhill" "LaTeXsugar"
|
24
|
4 |
begin
|
39
|
5 |
|
|
6 |
declare [[show_question_marks = false]]
|
|
7 |
|
54
|
8 |
consts
|
|
9 |
REL :: "(string \<times> string) \<Rightarrow> bool"
|
66
|
10 |
UPLUS :: "'a set \<Rightarrow> 'a set \<Rightarrow> (nat \<times> 'a) set"
|
54
|
11 |
|
|
12 |
|
39
|
13 |
notation (latex output)
|
50
|
14 |
str_eq_rel ("\<approx>\<^bsub>_\<^esub>") and
|
|
15 |
Seq (infixr "\<cdot>" 100) and
|
|
16 |
Star ("_\<^bsup>\<star>\<^esup>") and
|
|
17 |
pow ("_\<^bsup>_\<^esup>" [100, 100] 100) and
|
58
|
18 |
Suc ("_+1" [100] 100) and
|
54
|
19 |
quotient ("_ \<^raw:\ensuremath{\!\sslash\!}> _" [90, 90] 90) and
|
66
|
20 |
REL ("\<approx>") and
|
67
|
21 |
UPLUS ("_ \<^raw:\ensuremath{\uplus}> _" [90, 90] 90) and
|
|
22 |
L ("L '(_')" [0] 101)
|
24
|
23 |
(*>*)
|
|
24 |
|
|
25 |
section {* Introduction *}
|
|
26 |
|
|
27 |
text {*
|
58
|
28 |
Regular languages are an important and well-understood subject in Computer
|
60
|
29 |
Science, with many beautiful theorems and many useful algorithms. There is a
|
66
|
30 |
wide range of textbooks on this subject, many of which are aimed at students
|
|
31 |
and contain very detailed ``pencil-and-paper'' proofs
|
60
|
32 |
(e.g.~\cite{Kozen97}). It seems natural to exercise theorem provers by
|
|
33 |
formalising these theorems and by verifying formally the algorithms.
|
59
|
34 |
|
66
|
35 |
There is however a problem: the typical approach to regular languages is to
|
|
36 |
introduce finite automata and then define everything in terms of them. For
|
|
37 |
example, a regular language is normally defined as one whose strings are
|
|
38 |
recognised by a finite deterministic automaton. This approach has many
|
|
39 |
benefits. Among them is that it is easy to convince oneself from the fact that
|
|
40 |
regular languages are closed under complementation: one just has to exchange
|
|
41 |
the accepting and non-accepting states in the corresponding automaton to
|
|
42 |
obtain an automaton for the complement language. The problem, however, lies with
|
67
|
43 |
formalising such reasoning in a HOL-based theorem prover, in our case
|
|
44 |
Isabelle/HOL. Automata consist of states and transitions. They need to be represented
|
|
45 |
as graphs or matrices, neither
|
66
|
46 |
of which can be defined as inductive datatype.\footnote{In some works
|
|
47 |
functions are used to represent state transitions, but also they are not
|
|
48 |
inductive datatypes.} This means we have to build our own reasoning
|
|
49 |
infrastructure for them, as neither Isabelle/HOL nor HOL4 nor HOLlight support
|
|
50 |
them with libraries.
|
|
51 |
|
|
52 |
Even worse, reasoning about graphs and matrices can be a real hassle in HOL-based
|
|
53 |
theorem provers. Consider for example the operation of sequencing
|
|
54 |
two automata, say $A_1$ and $A_2$, by connecting the
|
67
|
55 |
accepting states of $A_1$ to the initial state of $A_2$:
|
61
|
56 |
|
60
|
57 |
|
|
58 |
\begin{center}
|
66
|
59 |
\begin{tabular}{ccc}
|
|
60 |
\begin{tikzpicture}[scale=0.8]
|
|
61 |
%\draw[step=2mm] (-1,-1) grid (1,1);
|
|
62 |
|
|
63 |
\draw[rounded corners=1mm, very thick] (-1.0,-0.3) rectangle (-0.2,0.3);
|
|
64 |
\draw[rounded corners=1mm, very thick] ( 0.2,-0.3) rectangle ( 1.0,0.3);
|
|
65 |
|
|
66 |
\node (A) at (-1.0,0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
|
|
67 |
\node (B) at ( 0.2,0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
|
|
68 |
|
|
69 |
\node (C) at (-0.2, 0.13) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
|
|
70 |
\node (D) at (-0.2,-0.13) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
|
|
71 |
|
|
72 |
\node (E) at (1.0, 0.2) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
|
|
73 |
\node (F) at (1.0,-0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
|
|
74 |
\node (G) at (1.0,-0.2) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
|
|
75 |
|
|
76 |
\draw (-0.6,0.0) node {\footnotesize$A_1$};
|
|
77 |
\draw ( 0.6,0.0) node {\footnotesize$A_2$};
|
|
78 |
\end{tikzpicture}
|
|
79 |
|
|
80 |
&
|
|
81 |
|
|
82 |
\raisebox{1.1mm}{\bf\Large$\;\;\;\Rightarrow\,\;\;$}
|
|
83 |
|
|
84 |
&
|
|
85 |
|
|
86 |
\begin{tikzpicture}[scale=0.8]
|
|
87 |
%\draw[step=2mm] (-1,-1) grid (1,1);
|
|
88 |
|
|
89 |
\draw[rounded corners=1mm, very thick] (-1.0,-0.3) rectangle (-0.2,0.3);
|
|
90 |
\draw[rounded corners=1mm, very thick] ( 0.2,-0.3) rectangle ( 1.0,0.3);
|
|
91 |
|
|
92 |
\node (A) at (-1.0,0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
|
|
93 |
\node (B) at ( 0.2,0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
|
|
94 |
|
|
95 |
\node (C) at (-0.2, 0.13) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
|
|
96 |
\node (D) at (-0.2,-0.13) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
|
|
97 |
|
|
98 |
\node (E) at (1.0, 0.2) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
|
|
99 |
\node (F) at (1.0,-0.0) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
|
|
100 |
\node (G) at (1.0,-0.2) [circle, very thick, draw, fill=white, inner sep=0.4mm] {};
|
|
101 |
|
|
102 |
\draw (C) to [very thick, bend left=45] (B);
|
|
103 |
\draw (D) to [very thick, bend right=45] (B);
|
|
104 |
|
|
105 |
\draw (-0.6,0.0) node {\footnotesize$A_1$};
|
|
106 |
\draw ( 0.6,0.0) node {\footnotesize$A_2$};
|
|
107 |
\end{tikzpicture}
|
|
108 |
|
|
109 |
\end{tabular}
|
60
|
110 |
\end{center}
|
|
111 |
|
|
112 |
\noindent
|
67
|
113 |
On ``paper'' we can define the corresponding graph in terms of the disjoint
|
|
114 |
union of the state nodes. Unfortunately in HOL, the definition for disjoint
|
66
|
115 |
union, namely
|
60
|
116 |
|
61
|
117 |
\begin{center}
|
66
|
118 |
@{term "UPLUS A\<^isub>1 A\<^isub>2 \<equiv> {(1, x) | x. x \<in> A\<^isub>1} \<union> {(2, y) | y. y \<in> A\<^isub>2}"}
|
61
|
119 |
\end{center}
|
60
|
120 |
|
61
|
121 |
\noindent
|
66
|
122 |
changes the type---the disjoint union is not a set, but a set of pairs.
|
|
123 |
Using this definition for disjoint unions means we do not have a single type for automata
|
|
124 |
and hence will not be able to state properties about \emph{all}
|
67
|
125 |
automata, since there is no type quantification available in HOL. An
|
|
126 |
alternative, which provides us with a single type for automata, is to give every
|
|
127 |
state node an identity, for example a natural
|
66
|
128 |
number, and then be careful renaming these identities apart whenever
|
67
|
129 |
connecting two automata. This results in clunky proofs
|
66
|
130 |
establishing that properties are invariant under renaming. Similarly,
|
67
|
131 |
connecting two automata represented as matrices results in very adhoc
|
66
|
132 |
constructions, which are not pleasant to reason about.
|
|
133 |
|
|
134 |
Because of these problems to do with representing automata, there seems
|
|
135 |
to be no substantial formalisation of automata theory and regular languages
|
|
136 |
carried out in a HOL-based theorem prover. We are only aware of the
|
|
137 |
large formalisation of the automata theory in Nuprl \cite{Constable00} and
|
67
|
138 |
some smaller formalisations in Coq, for example \cite{Filliatre97}.
|
58
|
139 |
|
66
|
140 |
In this paper, we will not attempt to formalise automata theory, but take a completely
|
|
141 |
different approach to regular languages. Instead of defining a regular language as one
|
|
142 |
where there exists an automaton that recognises all strings of the language, we define
|
67
|
143 |
a regular language as:
|
54
|
144 |
|
|
145 |
\begin{definition}[A Regular Language]
|
66
|
146 |
A language @{text A} is regular, provided there is a regular expression that matches all
|
54
|
147 |
strings of @{text "A"}.
|
|
148 |
\end{definition}
|
|
149 |
|
|
150 |
\noindent
|
66
|
151 |
The reason is that regular expressions, unlike graphs and matrices, can
|
|
152 |
be easily defined as inductive datatype. Therefore a corresponding reasoning
|
67
|
153 |
infrastructure comes for free. This has recently been used for formalising regular
|
|
154 |
expression matching in HOL4 \cite{OwensSlind08}. The purpose of this paper is to
|
66
|
155 |
show that a central result about regular languages, the Myhill-Nerode theorem,
|
67
|
156 |
can be recreated by only using regular expressions. This theorem give a necessary
|
|
157 |
and sufficient condition for when a language is regular. As a corollary of this
|
|
158 |
theorem we can easily establish the usual closure properties, including
|
|
159 |
complementation, for regular languages.\smallskip
|
61
|
160 |
|
|
161 |
\noindent
|
67
|
162 |
{\bf Contributions:} To our knowledge, our proof of the Myhill-Nerode theorem is the
|
|
163 |
first that is based on regular expressions, only. We prove the part of this theorem
|
|
164 |
stating that a regular expression has only finitely many partitions using certain
|
|
165 |
tagging-functions. Again to our best knowledge, these tagging functions have
|
|
166 |
not been used before to establish the Myhill-Nerode theorem.
|
24
|
167 |
*}
|
|
168 |
|
50
|
169 |
section {* Preliminaries *}
|
|
170 |
|
|
171 |
text {*
|
67
|
172 |
Strings in Isabelle/HOL are lists of characters with the \emph{empty string}
|
|
173 |
being represented by the empty list, written @{term "[]"}. \emph{Languages}
|
|
174 |
are sets of strings. The language containing all strings is written in
|
|
175 |
Isabelle/HOL as @{term "UNIV::string set"}. The notation for the quotient
|
|
176 |
of a language @{text A} according to a relation @{term REL} is @{term "A //
|
|
177 |
REL"}. The concatenation of two languages is written @{term "A ;; B"}; a
|
|
178 |
language raised to the power $n$ is written @{term "A \<up> n"}. Both concepts
|
|
179 |
are defined as
|
54
|
180 |
|
|
181 |
\begin{center}
|
58
|
182 |
@{thm Seq_def[THEN eq_reflection, where A1="A" and B1="B"]}
|
|
183 |
\hspace{7mm}
|
|
184 |
@{thm pow.simps(1)[THEN eq_reflection, where A1="A"]}
|
|
185 |
\hspace{7mm}
|
|
186 |
@{thm pow.simps(2)[THEN eq_reflection, where A1="A" and n1="n"]}
|
54
|
187 |
\end{center}
|
|
188 |
|
|
189 |
\noindent
|
58
|
190 |
where @{text "@"} is the usual list-append operation. The Kleene-star of a language @{text A}
|
|
191 |
is defined as the union over all powers, namely @{thm Star_def}.
|
|
192 |
|
51
|
193 |
Central to our proof will be the solution of equational systems
|
50
|
194 |
involving regular expressions. For this we will use the following ``reverse''
|
|
195 |
version of Arden's lemma.
|
|
196 |
|
|
197 |
\begin{lemma}[Reverse Arden's Lemma]\mbox{}\\
|
|
198 |
If @{thm (prem 1) ardens_revised} then
|
|
199 |
@{thm (lhs) ardens_revised} has the unique solution
|
|
200 |
@{thm (rhs) ardens_revised}.
|
|
201 |
\end{lemma}
|
|
202 |
|
|
203 |
\begin{proof}
|
51
|
204 |
For the right-to-left direction we assume @{thm (rhs) ardens_revised} and show
|
|
205 |
that @{thm (lhs) ardens_revised} holds. From Lemma ??? we have @{term "A\<star> = {[]} \<union> A ;; A\<star>"},
|
50
|
206 |
which is equal to @{term "A\<star> = {[]} \<union> A\<star> ;; A"}. Adding @{text B} to both
|
|
207 |
sides gives @{term "B ;; A\<star> = B ;; ({[]} \<union> A\<star> ;; A)"}, whose right-hand side
|
51
|
208 |
is equal to @{term "(B ;; A\<star>) ;; A \<union> B"}. This completes this direction.
|
50
|
209 |
|
|
210 |
For the other direction we assume @{thm (lhs) ardens_revised}. By a simple induction
|
51
|
211 |
on @{text n}, we can establish the property
|
50
|
212 |
|
|
213 |
\begin{center}
|
|
214 |
@{text "(*)"}\hspace{5mm} @{thm (concl) ardens_helper}
|
|
215 |
\end{center}
|
|
216 |
|
|
217 |
\noindent
|
|
218 |
Using this property we can show that @{term "B ;; (A \<up> n) \<subseteq> X"} holds for
|
|
219 |
all @{text n}. From this we can infer @{term "B ;; A\<star> \<subseteq> X"} using Lemma ???.
|
51
|
220 |
For the inclusion in the other direction we assume a string @{text s}
|
50
|
221 |
with length @{text k} is element in @{text X}. Since @{thm (prem 1) ardens_revised}
|
51
|
222 |
we know that @{term "s \<notin> X ;; (A \<up> Suc k)"} since its length is only @{text k}
|
|
223 |
(the strings in @{term "X ;; (A \<up> Suc k)"} are all longer).
|
53
|
224 |
From @{text "(*)"} it follows then that
|
50
|
225 |
@{term s} must be element in @{term "(\<Union>m\<in>{0..k}. B ;; (A \<up> m))"}. This in turn
|
|
226 |
implies that @{term s} is in @{term "(\<Union>n. B ;; (A \<up> n))"}. Using Lemma ??? this
|
|
227 |
is equal to @{term "B ;; A\<star>"}, as we needed to show.\qed
|
|
228 |
\end{proof}
|
67
|
229 |
|
|
230 |
\noindent
|
|
231 |
Regular expressions are defined as the following inductive datatype
|
|
232 |
|
|
233 |
\begin{center}
|
|
234 |
@{text r} @{text "::="}
|
|
235 |
@{term NULL}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
|
|
236 |
@{term EMPTY}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
|
|
237 |
@{term "CHAR c"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
|
|
238 |
@{term "SEQ r r"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
|
|
239 |
@{term "ALT r r"}\hspace{1.5mm}@{text"|"}\hspace{1.5mm}
|
|
240 |
@{term "STAR r"}
|
|
241 |
\end{center}
|
|
242 |
|
|
243 |
\noindent
|
|
244 |
The language matched by a regular expression is defined as usual:
|
|
245 |
|
|
246 |
\begin{center}
|
|
247 |
\begin{tabular}{c@ {\hspace{10mm}}c}
|
|
248 |
\begin{tabular}{rcl}
|
|
249 |
@{thm (lhs) L_rexp.simps(1)} & @{text "\<equiv>"} & @{thm (rhs) L_rexp.simps(1)}\\
|
|
250 |
@{thm (lhs) L_rexp.simps(2)} & @{text "\<equiv>"} & @{thm (rhs) L_rexp.simps(2)}\\
|
|
251 |
@{thm (lhs) L_rexp.simps(3)[where c="c"]} & @{text "\<equiv>"} & @{thm (rhs) L_rexp.simps(3)[where c="c"]}\\
|
|
252 |
\end{tabular}
|
|
253 |
&
|
|
254 |
\begin{tabular}{rcl}
|
|
255 |
@{thm (lhs) L_rexp.simps(4)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]} & @{text "\<equiv>"} &
|
|
256 |
@{thm (rhs) L_rexp.simps(4)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]}\\
|
|
257 |
@{thm (lhs) L_rexp.simps(5)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]} & @{text "\<equiv>"} &
|
|
258 |
@{thm (rhs) L_rexp.simps(5)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]}\\
|
|
259 |
@{thm (lhs) L_rexp.simps(6)[where r="r"]} & @{text "\<equiv>"} &
|
|
260 |
@{thm (rhs) L_rexp.simps(6)[where r="r"]}\\
|
|
261 |
\end{tabular}
|
|
262 |
\end{tabular}
|
|
263 |
\end{center}
|
50
|
264 |
*}
|
39
|
265 |
|
54
|
266 |
section {* Finite Partitions Imply Regularity of a Language *}
|
|
267 |
|
|
268 |
text {*
|
|
269 |
\begin{theorem}
|
|
270 |
Given a language @{text A}.
|
|
271 |
@{thm[mode=IfThen] hard_direction[where Lang="A"]}
|
|
272 |
\end{theorem}
|
|
273 |
*}
|
|
274 |
|
|
275 |
section {* Regular Expressions Generate Finitely Many Partitions *}
|
39
|
276 |
|
|
277 |
text {*
|
|
278 |
|
54
|
279 |
\begin{theorem}
|
39
|
280 |
Given @{text "r"} is a regular expressions, then @{thm rexp_imp_finite}.
|
54
|
281 |
\end{theorem}
|
39
|
282 |
|
|
283 |
\begin{proof}
|
|
284 |
By induction on the structure of @{text r}. The cases for @{const NULL}, @{const EMPTY}
|
50
|
285 |
and @{const CHAR} are straightforward, because we can easily establish
|
39
|
286 |
|
|
287 |
\begin{center}
|
|
288 |
\begin{tabular}{l}
|
|
289 |
@{thm quot_null_eq}\\
|
|
290 |
@{thm quot_empty_subset}\\
|
|
291 |
@{thm quot_char_subset}
|
|
292 |
\end{tabular}
|
|
293 |
\end{center}
|
|
294 |
|
|
295 |
\end{proof}
|
|
296 |
*}
|
|
297 |
|
|
298 |
|
54
|
299 |
section {* Conclusion and Related Work *}
|
|
300 |
|
24
|
301 |
(*<*)
|
|
302 |
end
|
|
303 |
(*>*) |