regexp: comparison Journal/Paper.thy

equal deleted inserted replaced

-:8749db46d5e6
+:07a269d9642b
 pow ("_\<^bsup>_\<^esup>" [100, 100] 100) and
 Suc ("_+1" [100] 100) and
 quotient ("_ \<^raw:\ensuremath{\!\sslash\!}> _" [90, 90] 90) and
 REL ("\<approx>") and
 UPLUS ("_ \<^raw:\ensuremath{\uplus}> _" [90, 90] 90) and
+lang ("\<^raw:\ensuremath{\cal{L}}>" 101) and
 lang ("\<^raw:\ensuremath{\cal{L}}>'(_')" [0] 101) and
 lang_trm ("\<^raw:\ensuremath{\cal{L}}>'(_')" [0] 101) and
 Lam ("\<lambda>'(_')" [100] 100) and
 Trn ("'(_, _')" [100, 100] 100) and
 EClass ("\<lbrakk>_\<rbrakk>\<^bsub>_\<^esub>" [100, 100] 100) and
 tag_Star ("\<star>tag _" [100] 100) and
 tag_Star ("\<star>tag _ _" [100, 100] 100) and
 tag_eq ("\<^raw:$\threesim$>\<^bsub>_\<^esub>") and
 Delta ("\<Delta>'(_')") and
 nullable ("\<delta>'(_')") and
-Cons ("_ :: _" [100, 100] 100)
+Cons ("_ :: _" [100, 100] 100) and
+Rev ("Rev _" [1000] 100) and
+Der ("Der _ _" [1000, 1000] 100) and
+ONE ("ONE" 999) and
+ZERO ("ZERO" 999)
 lemma meta_eq_app:
 shows "f \<equiv> \<lambda>x. g x \<Longrightarrow> f x \<equiv> g x"
 by auto
 involve subtle side-conditions which are easily overlooked, but which make
 formal reasoning rather painful. For example Kozen's proof of the
 Myhill-Nerode theorem requires that automata do not have inaccessible
 states \cite{Kozen97}. Another subtle side-condition is completeness of
 automata, that is automata need to have total transition functions and at most one
-`sink' state from which there is no connection to a final state (Brozowski
+`sink' state from which there is no connection to a final state (Brzozowski
 mentions this side-condition in the context of state complexity
-of automata \cite{Brozowski10}). Such side-conditions mean that if we define a regular
+of automata \cite{Brzozowski10}). Such side-conditions mean that if we define a regular
 language as one for which there exists \emph{a} finite automaton that
 recognises all its strings (see Def.~\ref{baddef}), then we need a lemma which
 ensures that another equivalent can be found satisfying the
 side-condition. Unfortunately, such `little' and `obvious' lemmas make
 a formalisation of automata theory a hair-pulling experience.
 literature, but take a different approach to regular languages than is
 usually taken. Instead of defining a regular language as one where there
 exists an automaton that recognises all its strings, we define a
 regular language as:
-\begin{dfntn}
+\begin{dfntn}\label{regular}
-A language @{text A} is \emph{regular}, provided there is a regular expression that matches all
+A language @{text A} is \emph{regular}, provided there is a regular expression
-strings of @{text "A"}.
+that matches all strings of @{text "A"}.
 \end{dfntn}
 \noindent
 The reason is that regular expressions, unlike graphs, matrices and
 functions, can be easily defined as an inductive datatype. A reasoning
 @{text "x\<^isub>p < x"} and the rest @{term "x\<^isub>s @ z \<in> A\<star>"}. For example the empty string
 @{text "[]"} would do.
 There are potentially many such prefixes, but there can only be finitely many of them (the
 string @{text x} is finite). Let us therefore choose the longest one and call it
 @{text "x\<^bsub>pmax\<^esub>"}. Now for the rest of the string @{text "x\<^isub>s @ z"} we
-know it is in @{term "A\<star>"} and it is not the empty string. By Lemma~\ref{langprops}@{text "(i)"},
+know it is in @{term "A\<star>"} and it is not the empty string. By Prop.~\ref{langprops}@{text "(i)"},
 we can separate
 this string into two parts, say @{text "a"} and @{text "b"}, such that @{text "a \<in> A"}
 and @{term "b \<in> A\<star>"}. Now @{text a} must be strictly longer than @{text "x\<^isub>s"},
 otherwise @{text "x\<^bsub>pmax\<^esub>"} is not the longest prefix. That means @{text a}
 `overlaps' with @{text z}, splitting it into two components @{text "z\<^isub>a"} and
 @{thm (lhs) Ders_def} & @{text "\<equiv>"} & @{thm (rhs) Ders_def}\\
 \end{tabular}
 \end{center}
 \noindent
-Now clearly we have the following relation between the Myhill-Nerode relation
+Clearly we have the following relation between the Myhill-Nerode relation
-(Definition~\ref{myhillneroderel}) and left-quotients
+(Def.~\ref{myhillneroderel}) and left-quotients
 \begin{equation}\label{mhders}
 @{term "x \<approx>A y"} \hspace{4mm}\text{if and only if}\hspace{4mm} @{term "Ders x A = Ders y A"}
 \end{equation}
 \noindent
 It is realtively easy to establish the following properties for left-quotients:
-\begin{center}
+\begin{equation}
-\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{2mm}}l}
+\mbox{\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{2mm}}l}
 @{thm (lhs) Der_zero}  & $=$ & @{thm (rhs) Der_zero}\\
 @{thm (lhs) Der_one}   & $=$ & @{thm (rhs) Der_one}\\
 @{thm (lhs) Der_atom}  & $=$ & @{thm (rhs) Der_atom}\\
 @{thm (lhs) Der_union} & $=$ & @{thm (rhs) Der_union}\\
 @{thm (lhs) Der_conc}  & $=$ & @{thm (rhs) Der_conc}\\
 @{thm (lhs) Der_star}  & $=$ & @{thm (rhs) Der_star}\\
-\end{tabular}
+@{thm (lhs) Ders_nil}  & $=$ & @{thm (rhs) Ders_nil}\\
-\end{center}
+@{thm (lhs) Ders_cons} & $=$ & @{thm (rhs) Ders_cons}\\
+%@{thm (lhs) Ders_append[where ?s1.0="s\<^isub>1" and ?s2.0="s\<^isub>2"]}  & $=$
+%   & @{thm (rhs) Ders_append[where ?s1.0="s\<^isub>1" and ?s2.0="s\<^isub>2"]}\\
+\end{tabular}}
+\end{equation}
 \noindent
 where @{text "\<Delta>"} is a function that tests whether the empty string
 is in the language and returns @{term "{[]}"} or @{term "{}"}, respectively.
 The only interesting case above is the last one where we use Prop.~\ref{langprops}
 in order to infer that @{term "Der c (A\<star>) = Der c (A \<cdot> A\<star>)"}. We can
 then complete the proof by observing that @{term "Delta A \<cdot> Der c (A\<star>) \<subseteq> (Der c A) \<cdot> A\<star>"}.
 Brzozowski observed that the left-quotients for languages of regular
 expressions can be calculated directly via the notion of \emph{derivatives
-of a regular expressions} \cite{Brzozowski64} which we define in Isabelle/HOL as
+of a regular expression} \cite{Brzozowski64}, which we define in Isabelle/HOL as
 follows:
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1.5mm}}l@ {}}
 @{thm (lhs) der.simps(1)}  & @{text "\<equiv>"} & @{thm (rhs) der.simps(1)}\\
 @{thm (lhs) ders.simps(2)}  & @{text "\<equiv>"} & @{thm (rhs) ders.simps(2)}\\
 \end{tabular}
 \end{center}
 \noindent
-The function @{term "nullable r"} tests whether the regular expression
+The last two lines extend @{const der} to strings (list of characters) where
-can recognise the empty string:
+list-cons is written \mbox{@{text "_ :: _"}}. The function @{term "nullable r"} needed
+in the @{const Times}-case tests whether a regular expression can recognise
+the empty string:
 \begin{center}
 \begin{tabular}{c@ {\hspace{10mm}}c}
 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1.5mm}}l@ {}}
 @{thm (lhs) nullable.simps(1)}  & @{text "\<equiv>"} & @{thm (rhs) nullable.simps(1)}\\
 \end{tabular}
 \end{tabular}
 \end{center}
 \noindent
-Brzozowski proved
+By induction on the regular expression @{text r}, respectively the string @{text s},
+one can easily show that left-quotients and derivatives relate as follows
+\cite{Sakarovitch09}:
 \begin{equation}\label{Dersders}
-\mbox{\begin{tabular}{l}
+\mbox{\begin{tabular}{c}
 @{thm Der_der}\\
 @{thm Ders_ders}
 \end{tabular}}
 \end{equation}
 \noindent
-where the first is by induction on @{text r} and the second by a simple
-calculation.
 The importance in the context of the Myhill-Nerode theorem is that
-we can use \eqref{mhders} and \eqref{Dersders} in order to derive
+we can use \eqref{mhders} and the equations above in order to
+establish that @{term "x \<approx>(lang r) y"} if and only if
-\begin{center}
+@{term "lang (ders x r) = lang (ders y r)"}. From this we obtain
-@{term "x \<approx>(lang r) y"} \hspace{4mm}if and only if\hspace{4mm}
-@{term "lang (ders x r) = lang (ders y r)"}
+\begin{equation}
-\end{center}
+@{term "x \<approx>(lang r) y"}\hspace{4mm}\mbox{provided}\hspace{4mm}@{term "ders x r = ders y r"}
+\end{equation}
-\noindent
-which means @{term "x \<approx>(lang r) y"} provided @{term "ders x r = ders y r"}.
-Consequently, we can use as the tagging relation
+\noindent
-@{text "\<^raw:$\threesim$>\<^bsub>(\<lambda>x. ders x r)\<^esub>"}, which we know refines @{term "\<approx>(lang r)"}.
+Consequently, we can use as the tagging relation @{text
-This almost helps us because Brozowski also proved that there for every
+"\<^raw:$\threesim$>\<^bsub>(\<lambda>x. ders x r)\<^esub>"}, since it refines
-language there are only
+@{term "\<approx>(lang r)"}. However, in order to be useful in the Myhill-Nerode
-finitely `dissimilar' derivatives for a regular expression. Two regulare
+theorem, we have to show that for a language there are only finitely many
-expressions are similar if they can be identified using the using the
+derivatives. Unfortunately, this is not true in general: Sakarovitch gives
-ACI-identities
+an example with infinitely many derivatives
+\cite[Page~141]{Sakarovitch09}. What Brzozowski \cite{Brzozowski64} proved is
+that for every language there \emph{are} finitely `dissimilar' derivatives for a
+regular expression. Two regular expressions are said to be \emph{similar}
+provided they can be identified using the using the @{text "ACI"}-identities:
 \begin{center}
 \begin{tabular}{cl}
-(A) & @{term "Plus (Plus r\<^isub>1 r\<^isub>2) r\<^isub>3"} $\equiv$ @{term "Plus r\<^isub>1 (Plus r\<^isub>2 r\<^isub>3)"}\\
+(@{text A}) & @{term "Plus (Plus r\<^isub>1 r\<^isub>2) r\<^isub>3"} $\equiv$ @{term "Plus r\<^isub>1 (Plus r\<^isub>2 r\<^isub>3)"}\\
-(C) & @{term "Plus r\<^isub>1 r\<^isub>2"} $\equiv$ @{term "Plus r\<^isub>2 r\<^isub>1"}\\
+(@{text C}) & @{term "Plus r\<^isub>1 r\<^isub>2"} $\equiv$ @{term "Plus r\<^isub>2 r\<^isub>1"}\\
-(I) & @{term "Plus r r"} $\equiv$ @{term "r"}\\
+(@{text I}) & @{term "Plus r r"} $\equiv$ @{term "r"}\\
 \end{tabular}
 \end{center}
 \noindent
-Without the indentification, we unfortunately obtain infinitely many
+Carrying this idea through menas we now have to reasoning modulo @{text "ACI"}.
-derivations (an example is given in \cite[Page~141]{Sakarovitch09}).
+This can be done, but it is very painful in a theorem prover (since there is
-Reasoning modulo ACI can be done, but it is very painful in a theorem prover.
+no direct characterisation of the set of dissimlar derivatives).
-Fortunately, there is a much simpler approach using partial
+Fortunately, there is a much simpler approach using \emph{partial
-derivatives introduced by Antimirov \cite{Antimirov95}.
+derivatives}. They were introduced by Antimirov \cite{Antimirov95} and can be defined
+in Isabelle/HOL as follows:
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{1mm}}c@ {\hspace{1.5mm}}l@ {}}
 @{thm (lhs) pder.simps(1)}  & @{text "\<equiv>"} & @{thm (rhs) pder.simps(1)}\\
 @{thm (lhs) pder.simps(2)}  & @{text "\<equiv>"} & @{thm (rhs) pder.simps(2)}\\
 @{thm (lhs) pders.simps(1)}  & @{text "\<equiv>"} & @{thm (rhs) pders.simps(1)}\\
 @{thm (lhs) pders.simps(2)}  & @{text "\<equiv>"} & @{thm (rhs) pders.simps(2)}\\
 \end{tabular}
 \end{center}
+\noindent
+Unlike `simple' derivatives, these functions return a set of regular
+expressions. In the @{const Times} and @{const Star} cases we use
+\begin{center}
+@{text "TIMESS rs r \<equiv> {TIMES r' r | r' \<in> rs}"}
+\end{center}
+\noindent
+in order to `sequence' a regular expressions with respect to a set of regular
+expresions. Note that in the last case we first build the set of partial derivatives
+w.r.t~@{text c}, then build the image of this set under the function @{term "pders s"}
+and finally `union up' all resulting sets. Note also that in some sense, partial
+derivatives have the @{text "ACI"}-identities already build in. Antimirov
+\cite{Antimirov95}
+showed
+\begin{equation}
+\mbox{\begin{tabular}{c}
+@{thm Der_pder}\\
+@{thm Ders_pders}
+\end{tabular}}
+\end{equation}
+\noindent
+and proved that for every language and regular expression there are only finitely
+many partial derivatives.
 *}
 section {* Closure Properties *}
+text {*
+The beautiful characteristics of regular languages is that they are closed
+under many operations. Closure under union, sequencencing and Kleene-star
+are trivial to establish given our definition of regularity (Def.~\ref{regular}).
+More interesting is the closure under complement, because
+it seems difficult to construct a regular expression for the complement
+language by direct means. However the existence of such a regular expression
+can now be easily proved using the Myhill-Nerode theorem since
+\begin{center}
+@{term "s\<^isub>1 \<approx>A s\<^isub>2"} if and only if @{term "s\<^isub>1 \<approx>(-A) s\<^isub>2"}
+\end{center}
+\noindent
+holds for any strings @{text "s\<^isub>1"} and @{text
+"s\<^isub>2"}. Therefore @{text A} and the complement language @{term "-A"}
+give rise to the same partitions.
+Once closure under complement is established, closure under intersection
+and set difference is also easy, because
+\begin{center}
+\begin{tabular}{c}
+@{term "A \<inter> B = - (- A \<union> - B)"}\\
+@{term "A - B = - (- A \<union> B)"}
+\end{tabular}
+\end{center}
+\noindent
+Closure of regular languages under reversal, which means
+\begin{center}
+@{text "A\<^bsup>-1\<^esup> \<equiv> {s\<^bsup>-1\<^esup> | s \<in> A}"}
+\end{center}
+\noindent
+can be shown with the help of the following operation defined on regular
+expressions
+\begin{center}
+\begin{tabular}{r@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
+@{thm (lhs) Rev.simps(1)} & @{text "\<equiv>"} & @{thm (rhs) Rev.simps(1)}\\
+@{thm (lhs) Rev.simps(2)} & @{text "\<equiv>"} & @{thm (rhs) Rev.simps(2)}\\
+@{thm (lhs) Rev.simps(3)} & @{text "\<equiv>"} & @{thm (rhs) Rev.simps(3)}\\
+@{thm (lhs) Rev.simps(4)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]} & @{text "\<equiv>"} &
+@{thm (rhs) Rev.simps(4)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]}\\
+@{thm (lhs) Rev.simps(5)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]} & @{text "\<equiv>"} &
+@{thm (rhs) Rev.simps(5)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]}\\
+@{thm (lhs) Rev.simps(6)} & @{text "\<equiv>"} & @{thm (rhs) Rev.simps(6)}\\
+\end{tabular}
+\end{center}
+In connection with left-quotient, the perhaps surprising fact is that
+regular languages are closed under any left-quotient.
+*}
 section {* Conclusion and Related Work *}
 text {*
+\noindent
 In this paper we took the view that a regular language is one where there
 exists a regular expression that matches all of its strings. Regular
 expressions can conveniently be defined as a datatype in HOL-based theorem
 provers. For us it was therefore interesting to find out how far we can push
 this point of view. We have established in Isabelle/HOL both directions
 of the Myhill-Nerode theorem.
 %
 \begin{thrm}[The Myhill-Nerode Theorem]\mbox{}\\
 A language @{text A} is regular if and only if @{thm (rhs) Myhill_Nerode}.
 \end{thrm}
-%
 \noindent
-Having formalised this theorem means we
+Having formalised this theorem means we pushed our point of view quite
-pushed our point of view quite far. Using this theorem we can obviously prove when a language
+far. Using this theorem we can obviously prove when a language is \emph{not}
-is \emph{not} regular---by establishing that it has infinitely many
+regular---by establishing that it has infinitely many equivalence classes
-equivalence classes generated by the Myhill-Nerode relation (this is usually
+generated by the Myhill-Nerode relation (this is usually the purpose of the
-the purpose of the pumping lemma \cite{Kozen97}).  We can also use it to
+pumping lemma \cite{Kozen97}).  We can also use it to establish the standard
-establish the standard textbook results about closure properties of regular
+textbook results about closure properties of regular languages. Interesting
-languages. Interesting is the case of closure under complement, because
+is the case of closure under complement, because it seems difficult to
-it seems difficult to construct a regular expression for the complement
+construct a regular expression for the complement language by direct
-language by direct means. However the existence of such a regular expression
+means. However the existence of such a regular expression can be easily
-can be easily proved using the Myhill-Nerode theorem since
+proved using the Myhill-Nerode theorem.  Proving the existence of such a
-%
+regular expression via automata using the standard method would be quite
-\begin{center}
+involved. It includes the steps: regular expression @{text "\<Rightarrow>"}
-@{term "s\<^isub>1 \<approx>A s\<^isub>2"} if and only if @{term "s\<^isub>1 \<approx>(-A) s\<^isub>2"}
+non-deterministic automaton @{text "\<Rightarrow>"} deterministic automaton @{text "\<Rightarrow>"}
-\end{center}
+complement automaton @{text "\<Rightarrow>"} regular expression.
-%
-\noindent
-holds for any strings @{text "s\<^isub>1"} and @{text
-"s\<^isub>2"}. Therefore @{text A} and the complement language @{term "-A"} give rise to the same
-partitions.  Proving the existence of such a regular expression via automata
-using the standard method would
-be quite involved. It includes the
-steps: regular expression @{text "\<Rightarrow>"} non-deterministic automaton @{text
-"\<Rightarrow>"} deterministic automaton @{text "\<Rightarrow>"} complement automaton @{text "\<Rightarrow>"}
-regular expression.
 While regular expressions are convenient in formalisations, they have some
 limitations. One is that there seems to be no method of calculating a
 minimal regular expression (for example in terms of length) for a regular
 language, like there is
 whose correctness has been formally established, we refer the reader to
 Owens and Slind \cite{OwensSlind08}.
 Our formalisation consists of 780 lines of Isabelle/Isar code for the first
-direction and 460 for the second, plus around 300 lines of standard material about
+direction and 460 for the second, plus around 300 lines of standard material
-regular languages. While this might be seen as too large to count as a
+about regular languages. While this might be seen large, it should be seen
-concise proof pearl, this should be seen in the context of the work done by
+in the context of the work done by Constable at al \cite{Constable00} who
-Constable at al \cite{Constable00} who formalised the Myhill-Nerode theorem
+formalised the Myhill-Nerode theorem in Nuprl using automata. They write
-in Nuprl using automata. They write that their four-member team needed
+that their four-member team needed something on the magnitude of 18 months
-something on the magnitude of 18 months for their formalisation. The
+for their formalisation. The estimate for our formalisation is that we
-estimate for our formalisation is that we needed approximately 3 months and
+needed approximately 3 months and this included the time to find our proof
-this included the time to find our proof arguments. Unlike Constable et al,
+arguments. Unlike Constable et al, who were able to follow the proofs from
-who were able to follow the proofs from \cite{HopcroftUllman69}, we had to
+\cite{HopcroftUllman69}, we had to find our own arguments.  So for us the
-find our own arguments.  So for us the formalisation was not the
+formalisation was not the bottleneck. It is hard to gauge the size of a
-bottleneck. It is hard to gauge the size of a formalisation in Nurpl, but
+formalisation in Nurpl, but from what is shown in the Nuprl Math Library
-from what is shown in the Nuprl Math Library about their development it
+about their development it seems substantially larger than ours. The code of
-seems substantially larger than ours. The code of ours can be found in the
+ours can be found in the Mercurial Repository at
-Mercurial Repository at
 \mbox{\url{http://www4.in.tum.de/~urbanc/regexp.html}}.
 Our proof of the first direction is very much inspired by \emph{Brzozowski's
 algebraic method} used to convert a finite automaton to a regular
 expression \cite{Brzozowski64}. The close connection can be seen by considering the equivalence

changeset 186	07a269d9642b
parent 185	8749db46d5e6
child 187	9f46a9571e37