regexp: comparison Journal/Paper.thy

equal deleted inserted replaced

-:5bbe63876f84
+:fa8d33d13cb6
 HOLlight support them with libraries. Even worse, reasoning about graphs and
 matrices can be a real hassle in HOL-based theorem provers, because
 we have to be able to combine automata.  Consider for
 example the operation of sequencing two automata, say $A_1$ and $A_2$, by
 connecting the accepting states of $A_1$ to the initial state of $A_2$:
-%
 \begin{center}
 \begin{tabular}{ccc}
 \begin{tikzpicture}[scale=1]
 %\draw[step=2mm] (-1,-1) grid (1,1);
 \end{equation}
 \noindent
 changes the type---the disjoint union is not a set, but a set of
 pairs. Using this definition for disjoint union means we do not have a
-single type for automata. As a result we will not be able to define a regular
+single type for automata. As a result we will not be able to define a
-language as one for which there exists an automaton that recognises all its
+regular language as one for which there exists an automaton that recognises
-strings (Definition~\ref{baddef}). This is because we cannot make a definition in HOL that is polymorphic in
+all its strings (Definition~\ref{baddef}). This is because we cannot make a
-the state type and there is no type quantification available in HOL (unlike
+definition in HOL that is polymorphic in the state type and there is no type
-in Coq, for example).\footnote{Slind already pointed out this problem in an email
+quantification available in HOL (unlike in Coq, for example).\footnote{Slind
-to the HOL4 mailing list on 21st April 2005.}
+already pointed out this problem in an email to the HOL4 mailing list on
+21st April 2005.}
 An alternative, which provides us with a single type for automata, is to give every
 state node an identity, for example a natural
 number, and then be careful to rename these identities apart whenever
 connecting two automata. This results in clunky proofs
 Let us return to our proof of the second direction in the Myhill-Nerode
 theorem. The point of the above calculations is to use
 @{text "\<^raw:$\threesim$>\<^bsub>(\<lambda>x. ders x r)\<^esub>"} as tagging-relation.
-\begin{proof}[Proof of Theorem~\ref{myhillnerodetwo}]
+\begin{proof}[Proof of Theorem~\ref{myhillnerodetwo} (second version)]
 Using \eqref{mhders}
 and \eqref{Derspders} we can easily infer that
 \begin{center}
 @{term "x \<approx>(lang r) y"}\hspace{4mm}\mbox{provided}\hspace{4mm}@{term "pders x r = pders y r"}
 are only finitely many equivalence classes of @{text "\<^raw:$\threesim$>\<^bsub>(\<lambda>x. ders x r)\<^esub>"},
 and we can again conclude the second part of the Myhill-Nerode theorem.
 \end{proof}
 *}
-section {* Closure Properties *}
+section {* Closure Properties of Regular Languages *}
 text {*
 \noindent
-The real beauty of regular languages is that they are closed
+The beauty of regular languages is that they are closed under many set
-under almost all set operations. Closure under union, concatenation and Kleene-star
+operations. Closure under union, concatenation and Kleene-star are trivial
-are trivial to establish given our definition of regularity (Definition~\ref{regular}).
+to establish given our definition of regularity (recall Definition~\ref{regular}).
-More interesting is the closure under complement, because
+More interesting is the closure under complement, because it seems difficult
-it seems difficult to construct a regular expression for the complement
+to construct a regular expression for the complement language by direct
-language by direct means. However the existence of such a regular expression
+means. However the existence of such a regular expression can now be easily
-can now be easily proved using the Myhill-Nerode theorem since
+proved using the Myhill-Nerode theorem since
 \begin{center}
 @{term "s\<^isub>1 \<approx>A s\<^isub>2"} if and only if @{term "s\<^isub>1 \<approx>(-A) s\<^isub>2"}
 \end{center}
 \noindent
 holds for any strings @{text "s\<^isub>1"} and @{text
 "s\<^isub>2"}. Therefore @{text A} and the complement language @{term "-A"}
-give rise to the same partitions. So if one is finite, the other is too and the
+give rise to the same partitions. So if one is finite, the other is too, and
-other way around.
+the other way around. Proving the existence of such a regular expression via
+automata using the standard method would be quite involved. It includes the
+steps: regular expression @{text "\<Rightarrow>"} non-deterministic automaton @{text
+"\<Rightarrow>"} deterministic automaton @{text "\<Rightarrow>"} complement automaton @{text "\<Rightarrow>"}
+regular expression. Clearly not something you want to formalise in a theorem
+prover in which it is cumbersome to reason about automata.
 Once closure under complement is established, closure under intersection
 and set difference is also easy, because
 \begin{center}
 @{term "A - B = - (- A \<union> B)"}
 \end{tabular}
 \end{center}
 \noindent
-Closure of regular languages under reversal, which means
+Closure of regular languages under reversal, that is
 \begin{center}
 @{text "A\<^bsup>-1\<^esup> \<equiv> {s\<^bsup>-1\<^esup> | s \<in> A}"}
 \end{center}
 \noindent
-can be shown with the help of the following operation defined on regular
+can be shown with the help of the following operation defined recursively over
-expressions
+regular expressions
 \begin{center}
 \begin{tabular}{r@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
 @{thm (lhs) Rev.simps(1)} & @{text "\<equiv>"} & @{thm (rhs) Rev.simps(1)}\\
 @{thm (lhs) Rev.simps(2)} & @{text "\<equiv>"} & @{thm (rhs) Rev.simps(2)}\\
 @{thm (lhs) Rev.simps(6)} & @{text "\<equiv>"} & @{thm (rhs) Rev.simps(6)}\\
 \end{tabular}
 \end{center}
 \noindent
-For this operation we can so
+For this operation we can show
 \begin{center}
 @{text "(\<calL>(r))\<^bsup>-1\<^esup>"}~@{text "="}~@{thm (rhs) rev_lang}
 \end{center}
 \noindent
-from which closure under reversal follows.
+from which closure under reversal of regular languages follows.
-The perhaps the most surprising fact is that regular languages are closed under any
+A perhaps surprising fact is that regular languages are closed under any
 left-quotient. Define
 \begin{center}
 @{abbrev "Ders_lang B A"}
 \end{center}
 \noindent
-and assume @{text A} is regular. From this we know there exists a regular
+and assume @{text B} is any language and @{text A} is regular, then @{term
-expression @{text r} such that @{term "A = lang r"}. We also know that
+"Ders_lang B A"} is regular. To see this consider the following argument
-@{term "pders_lang B r"} is finite. By definition and Lemma~\ref{Derspders}
+using partial derivatives: From @{text A} being regular we know there exists
+a regular expression @{text r} such that @{term "A = lang r"}. We also know
+that @{term "pders_lang B r"} is finite for every language @{text B} and
+regular expression @{text r} (recall Theorem~\ref{antimirov}). By definition
+and Lemma~\ref{Derspders} therefore
 \begin{equation}\label{eqq}
 @{term "Ders_lang B (lang r) = (\<Union> lang ` (pders_lang B r))"}
 \end{equation}
 \noindent
-Since there are only finitely many regular expressions in @{term "pders_lang B r"}
+Since there are only finitely many regular expressions in @{term "pders_lang
-by Theorem~\ref{antimirov}, we know that the right-hand side of \eqref{eqq}, is
+B r"}, we know by \eqref{uplus} that there exists a regular expression that
-equal to @{term "lang (\<Uplus>(pders_lang B r))"} using \eqref{uplus}. Hence
+the right-hand side of \eqref{eqq} is equal to \mbox{@{term "lang (\<Uplus>(pders_lang B
-the regular expression @{term "pders_lang B r"} verifies that @{term "Ders_lang B A"}
+r))"}}. Thus the regular expression @{term "\<Uplus>(pders_lang B r)"} verifies that
-is regular.
+@{term "Ders_lang B A"} is regular.
 *}
 section {* Conclusion and Related Work *}
 pumping lemma \cite{Kozen97}).  We can also use it to establish the standard
 textbook results about closure properties of regular languages. Interesting
 is the case of closure under complement, because it seems difficult to
 construct a regular expression for the complement language by direct
 means. However the existence of such a regular expression can be easily
-proved using the Myhill-Nerode theorem.  Proving the existence of such a
+proved using the Myhill-Nerode theorem.
-regular expression via automata using the standard method would be quite
-involved. It includes the steps: regular expression @{text "\<Rightarrow>"}
+Our insistence on regular expressions for proving the Myhill-Nerode theorem
-non-deterministic automaton @{text "\<Rightarrow>"} deterministic automaton @{text "\<Rightarrow>"}
+arose from the limitations of HOL on which the popular theorem provers HOL4,
-complement automaton @{text "\<Rightarrow>"} regular expression.
+HOLlight and Isabelle/HOL are based. In order to guarantee consistency,
+formalisations can only extend HOL by definitions that introduce a notion in
+terms of already existing concepts. A convenient definition for automata
-While regular expressions are convenient in formalisations, they have some
+(based on graphs) use a polymorphic type for the state nodes. This allows us
-limitations. One is that there seems to be no method of calculating a
+to use the standard operation of disjoint union in order to compose two
-minimal regular expression (for example in terms of length) for a regular
+automata. But we cannot use such a polymorphic definition of automata in HOL
-language, like there is
+as part of the definition for regularity of a language (a set of strings).
-for automata. On the other hand, efficient regular expression matching,
+Consider the following attempt
-without using automata, poses no problem \cite{OwensReppyTuron09}.
-For an implementation of a simple regular expression matcher,
-whose correctness has been formally established, we refer the reader to
+\begin{center}
-Owens and Slind \cite{OwensSlind08}.
+@{text "is_regular A \<equiv> \<exists>M(\<alpha>). is_finite_automata (M) \<and> \<calL>(M) = A"}
+\end{center}
+\noindent
+which means the definiens is polymorphic in the type of the automata @{text
+"M"}, but the definiendum @{text "is_regular"} is not. Such definitions are
+excluded in HOL, because they lead easily to inconsistencies (see
+\cite{PittsHOL4} for a simple example). Also HOL does not contain
+type-quantifiers which would allow us to get rid of the polymorphism by
+quantifying over the type-variable @{text "\<alpha>"}. Therefore when defining
+regularity in terms of automata, the only natural way out in HOL is to use state
+nodes with an identity, for example a natural number. Unfortunatly, the
+consequence is that we have to be careful when combining two automata so
+that there is no clash between two such states. This makes formalisations
+quite fiddly and unpleasant. Regular expressions proved much more convenient
+for reasoning in HOL and we showed they can be used for establishing the
+Myhill-Nerode theorem.
+While regular expressions are convenient, they have some limitations. One is
+that there seems to be no method of calculating a minimal regular expression
+(for example in terms of length) for a regular language, like there is for
+automata. On the other hand, efficient regular expression matching, without
+using automata, poses no problem \cite{OwensReppyTuron09}.  For an
+implementation of a simple regular expression matcher, whose correctness has
+been formally established, we refer the reader to Owens and Slind
+\cite{OwensSlind08}.
 Our formalisation consists of 780 lines of Isabelle/Isar code for the first
 direction and 460 for the second, plus around 300 lines of standard material
 about regular languages. The formalisation about derivatives and partial
-derivatives shown in Section~\ref{derivatives} consists of 390 lines of code.
+derivatives shown in Section~\ref{derivatives} consists of 390 lines of
-While this might be seen large, it should be seen
+code.  The algorithm for solving equational systems, which we used in the
+first direction, is conceptually not that complicated. Still the use of sets
+over which the algorithm operates, means it is not as easy to formalise as
+one might wish. It seems sets cannot be avoided since the `input' of the
+algorithm consists of equivalence classes and we cannot see how to
+reformulate the theory so that we can use lists, which are usually easier to
+reason about in a theorem prover.
+While our formalisation might be seen large, it should be seen
 in the context of the work done by Constable at al \cite{Constable00} who
 formalised the Myhill-Nerode theorem in Nuprl using automata. They write
 that their four-member team needed something on the magnitude of 18 months
 for their formalisation. The estimate for our formalisation is that we
 needed approximately 3 months and this included the time to find our proof
 formalisation in Nurpl, but from what is shown in the Nuprl Math Library
 about their development it seems substantially larger than ours. The code of
 ours can be found in the Mercurial Repository at
 \mbox{\url{http://www4.in.tum.de/~urbanc/regexp.html}}.
 Our proof of the first direction is very much inspired by \emph{Brzozowski's
-algebraic method} used to convert a finite automaton to a regular
+algebraic method} used to convert a finite automaton to a regular expression
-expression \cite{Brzozowski64}. The close connection can be seen by considering the equivalence
+\cite{Brzozowski64}. The close connection can be seen by considering the
-classes as the states of the minimal automaton for the regular language.
+equivalence classes as the states of the minimal automaton for the regular
-However there are some subtle differences. Since we identify equivalence
+language.  However there are some subtle differences. Since we identify
-classes with the states of the automaton, then the most natural choice is to
+equivalence classes with the states of the automaton, then the most natural
-characterise each state with the set of strings starting from the initial
+choice is to characterise each state with the set of strings starting from
-state leading up to that state. Usually, however, the states are characterised as the
+the initial state leading up to that state. Usually, however, the states are
-strings starting from that state leading to the terminal states.  The first
+characterised as the strings starting from that state leading to the
-choice has consequences about how the initial equational system is set up. We have
+terminal states.  The first choice has consequences about how the initial
-the $\lambda$-term on our `initial state', while Brzozowski has it on the
+equational system is set up. We have the $\lambda$-term on our `initial
-terminal states. This means we also need to reverse the direction of Arden's
+state', while Brzozowski has it on the terminal states. This means we also
-Lemma.
+need to reverse the direction of Arden's Lemma. We have not found anything
+in the literature about this way of proving the first direction of the
-This is also where our method shines, because we can completely
+Myhill-Nerode theorem, but it appears to be folklore.
-side-step the standard argument \cite{Kozen97} where automata need
-to be composed, which as stated in the Introduction is not so easy
+We presented two proofs for the second direction of the Myhill-Nerode
-to formalise in a HOL-based theorem prover. However, it is also the
+theorem. One direct proof using tagging-functions and another using partial
-direction where we had to spend most of the `conceptual' time, as
+derivatives. These proofs is where our method shines, because we can
-our proof-argument based on tagging-functions is new for
+completely side-step the standard argument \cite{Kozen97} where automata
-establishing the Myhill-Nerode theorem. All standard proofs of this
+need to be composed. However, it is also the direction where we had to spend
-direction proceed by arguments over automata.\medskip
+most of the `conceptual' time, as our first proof based on
+tagging-functions is new for establishing the Myhill-Nerode theorem. All
+standard proofs of this direction proceed by arguments over automata.
+Our indirect proof for the second direction arose from the interest in
+Brzozowski's derivatives for regular expression matching. A corresponding
+regular expression matcher has been formalised in HOL4 in
+\cite{OwensSlind08}. In our opinion, this formalisation is considerably
+slicker than for example the approach to regular expression matching taken
+in \cite{Harper99} and \cite{Yi06}. While Brzozowski's derivatives lead to
+simple regular expression matchers and he proved that there are only
+finitely many dissimilar derivatives for every regular expression, this
+result is not as straightforward to formalise in a theorem prover. The
+reason is that the set of dissimilar derivatives is not defined inductively,
+but in terms of an ACI-equivalence relation.
+\medskip
 We expect that the development of Krauss \& Nipkow gets easier by
 using partial derivatives.\medskip
 \noindent

changeset 196	fa8d33d13cb6
parent 194	5347d7556487
child 197	cf1c17431dab