regexp: changeset 196:fa8d33d13cb6

--- a/Journal/Paper.thy	Fri Aug 12 17:08:58 2011 +0000
+++ b/Journal/Paper.thy	Mon Aug 15 21:09:08 2011 +0000
@@ -183,7 +183,6 @@
   we have to be able to combine automata.  Consider for
   example the operation of sequencing two automata, say $A_1$ and $A_2$, by
   connecting the accepting states of $A_1$ to the initial state of $A_2$:
-  %  
 
   \begin{center}
   \begin{tabular}{ccc}
@@ -252,12 +251,14 @@
   \noindent
   changes the type---the disjoint union is not a set, but a set of
   pairs. Using this definition for disjoint union means we do not have a
-  single type for automata. As a result we will not be able to define a regular
-  language as one for which there exists an automaton that recognises all its
-  strings (Definition~\ref{baddef}). This is because we cannot make a definition in HOL that is polymorphic in 
-  the state type and there is no type quantification available in HOL (unlike 
-  in Coq, for example).\footnote{Slind already pointed out this problem in an email 
-  to the HOL4 mailing list on 21st April 2005.}
+  single type for automata. As a result we will not be able to define a
+  regular language as one for which there exists an automaton that recognises
+  all its strings (Definition~\ref{baddef}). This is because we cannot make a
+  definition in HOL that is polymorphic in the state type and there is no type
+  quantification available in HOL (unlike in Coq, for example).\footnote{Slind
+  already pointed out this problem in an email to the HOL4 mailing list on
+  21st April 2005.}
+
 
   An alternative, which provides us with a single type for automata, is to give every 
   state node an identity, for example a natural
@@ -1888,7 +1889,7 @@
   @{text "\<^raw:$\threesim$>\<^bsub>(\<lambda>x. ders x r)\<^esub>"} as tagging-relation. 
 
 
-  \begin{proof}[Proof of Theorem~\ref{myhillnerodetwo}]
+  \begin{proof}[Proof of Theorem~\ref{myhillnerodetwo} (second version)]
   Using \eqref{mhders}
   and \eqref{Derspders} we can easily infer that
 
@@ -1916,18 +1917,18 @@
   \end{proof}
 *}
 
-section {* Closure Properties *}
+section {* Closure Properties of Regular Languages *}
 
 text {*
   \noindent
-  The real beauty of regular languages is that they are closed
-  under almost all set operations. Closure under union, concatenation and Kleene-star
-  are trivial to establish given our definition of regularity (Definition~\ref{regular}).
-  More interesting is the closure under complement, because
-  it seems difficult to construct a regular expression for the complement
-  language by direct means. However the existence of such a regular expression
-  can now be easily proved using the Myhill-Nerode theorem since 
-  
+  The beauty of regular languages is that they are closed under many set
+  operations. Closure under union, concatenation and Kleene-star are trivial
+  to establish given our definition of regularity (recall Definition~\ref{regular}).
+  More interesting is the closure under complement, because it seems difficult
+  to construct a regular expression for the complement language by direct
+  means. However the existence of such a regular expression can now be easily
+  proved using the Myhill-Nerode theorem since
+
   \begin{center}
   @{term "s\<^isub>1 \<approx>A s\<^isub>2"} if and only if @{term "s\<^isub>1 \<approx>(-A) s\<^isub>2"}
   \end{center}
@@ -1935,8 +1936,13 @@
   \noindent
   holds for any strings @{text "s\<^isub>1"} and @{text
   "s\<^isub>2"}. Therefore @{text A} and the complement language @{term "-A"}
-  give rise to the same partitions. So if one is finite, the other is too and the
-  other way around.
+  give rise to the same partitions. So if one is finite, the other is too, and
+  the other way around. Proving the existence of such a regular expression via
+  automata using the standard method would be quite involved. It includes the
+  steps: regular expression @{text "\<Rightarrow>"} non-deterministic automaton @{text
+  "\<Rightarrow>"} deterministic automaton @{text "\<Rightarrow>"} complement automaton @{text "\<Rightarrow>"}
+  regular expression. Clearly not something you want to formalise in a theorem
+  prover in which it is cumbersome to reason about automata.
 
   Once closure under complement is established, closure under intersection
   and set difference is also easy, because
@@ -1949,15 +1955,15 @@
   \end{center}
 
   \noindent
-  Closure of regular languages under reversal, which means 
+  Closure of regular languages under reversal, that is
 
   \begin{center}
   @{text "A\<^bsup>-1\<^esup> \<equiv> {s\<^bsup>-1\<^esup> | s \<in> A}"}
   \end{center}
 
   \noindent 
-  can be shown with the help of the following operation defined on regular
-  expressions
+  can be shown with the help of the following operation defined recursively over 
+  regular expressions
 
   \begin{center}
   \begin{tabular}{r@ {\hspace{1mm}}c@ {\hspace{1mm}}l}
@@ -1973,38 +1979,42 @@
   \end{center}
 
   \noindent
-  For this operation we can so
+  For this operation we can show
 
   \begin{center}
   @{text "(\<calL>(r))\<^bsup>-1\<^esup>"}~@{text "="}~@{thm (rhs) rev_lang}
   \end{center}
   
   \noindent
-  from which closure under reversal follows.
+  from which closure under reversal of regular languages follows.
 
-  The perhaps the most surprising fact is that regular languages are closed under any 
-  left-quotient. Define 
+  A perhaps surprising fact is that regular languages are closed under any
+  left-quotient. Define
 
   \begin{center}
   @{abbrev "Ders_lang B A"}
   \end{center}
 
   \noindent
-  and assume @{text A} is regular. From this we know there exists a regular
-  expression @{text r} such that @{term "A = lang r"}. We also know that
-  @{term "pders_lang B r"} is finite. By definition and Lemma~\ref{Derspders}
+  and assume @{text B} is any language and @{text A} is regular, then @{term
+  "Ders_lang B A"} is regular. To see this consider the following argument
+  using partial derivatives: From @{text A} being regular we know there exists
+  a regular expression @{text r} such that @{term "A = lang r"}. We also know
+  that @{term "pders_lang B r"} is finite for every language @{text B} and 
+  regular expression @{text r} (recall Theorem~\ref{antimirov}). By definition 
+  and Lemma~\ref{Derspders} therefore
+
   
   \begin{equation}\label{eqq}
   @{term "Ders_lang B (lang r) = (\<Union> lang ` (pders_lang B r))"}
   \end{equation}
 
   \noindent
-  Since there are only finitely many regular expressions in @{term "pders_lang B r"}
-  by Theorem~\ref{antimirov}, we know that the right-hand side of \eqref{eqq}, is 
-  equal to @{term "lang (\<Uplus>(pders_lang B r))"} using \eqref{uplus}. Hence
-  the regular expression @{term "pders_lang B r"} verifies that @{term "Ders_lang B A"} 
-  is regular.
-
+  Since there are only finitely many regular expressions in @{term "pders_lang
+  B r"}, we know by \eqref{uplus} that there exists a regular expression that
+  the right-hand side of \eqref{eqq} is equal to \mbox{@{term "lang (\<Uplus>(pders_lang B
+  r))"}}. Thus the regular expression @{term "\<Uplus>(pders_lang B r)"} verifies that
+  @{term "Ders_lang B A"} is regular.
 *}
 
 
@@ -2033,29 +2043,61 @@
   is the case of closure under complement, because it seems difficult to
   construct a regular expression for the complement language by direct
   means. However the existence of such a regular expression can be easily
-  proved using the Myhill-Nerode theorem.  Proving the existence of such a
-  regular expression via automata using the standard method would be quite
-  involved. It includes the steps: regular expression @{text "\<Rightarrow>"}
-  non-deterministic automaton @{text "\<Rightarrow>"} deterministic automaton @{text "\<Rightarrow>"}
-  complement automaton @{text "\<Rightarrow>"} regular expression.
+  proved using the Myhill-Nerode theorem.  
+
+  Our insistence on regular expressions for proving the Myhill-Nerode theorem
+  arose from the limitations of HOL on which the popular theorem provers HOL4,
+  HOLlight and Isabelle/HOL are based. In order to guarantee consistency,
+  formalisations can only extend HOL by definitions that introduce a notion in
+  terms of already existing concepts. A convenient definition for automata
+  (based on graphs) use a polymorphic type for the state nodes. This allows us
+  to use the standard operation of disjoint union in order to compose two
+  automata. But we cannot use such a polymorphic definition of automata in HOL
+  as part of the definition for regularity of a language (a set of strings).
+  Consider the following attempt
 
 
-  While regular expressions are convenient in formalisations, they have some
-  limitations. One is that there seems to be no method of calculating a
-  minimal regular expression (for example in terms of length) for a regular
-  language, like there is
-  for automata. On the other hand, efficient regular expression matching,
-  without using automata, poses no problem \cite{OwensReppyTuron09}.
-  For an implementation of a simple regular expression matcher,
-  whose correctness has been formally established, we refer the reader to
-  Owens and Slind \cite{OwensSlind08}.
+  \begin{center}
+  @{text "is_regular A \<equiv> \<exists>M(\<alpha>). is_finite_automata (M) \<and> \<calL>(M) = A"}
+  \end{center}
 
+  \noindent
+  which means the definiens is polymorphic in the type of the automata @{text
+  "M"}, but the definiendum @{text "is_regular"} is not. Such definitions are
+  excluded in HOL, because they lead easily to inconsistencies (see
+  \cite{PittsHOL4} for a simple example). Also HOL does not contain
+  type-quantifiers which would allow us to get rid of the polymorphism by
+  quantifying over the type-variable @{text "\<alpha>"}. Therefore when defining
+  regularity in terms of automata, the only natural way out in HOL is to use state
+  nodes with an identity, for example a natural number. Unfortunatly, the
+  consequence is that we have to be careful when combining two automata so
+  that there is no clash between two such states. This makes formalisations
+  quite fiddly and unpleasant. Regular expressions proved much more convenient
+  for reasoning in HOL and we showed they can be used for establishing the
+  Myhill-Nerode theorem.
+
+  While regular expressions are convenient, they have some limitations. One is
+  that there seems to be no method of calculating a minimal regular expression
+  (for example in terms of length) for a regular language, like there is for
+  automata. On the other hand, efficient regular expression matching, without
+  using automata, poses no problem \cite{OwensReppyTuron09}.  For an
+  implementation of a simple regular expression matcher, whose correctness has
+  been formally established, we refer the reader to Owens and Slind
+  \cite{OwensSlind08}.
 
   Our formalisation consists of 780 lines of Isabelle/Isar code for the first
   direction and 460 for the second, plus around 300 lines of standard material
-  about regular languages. The formalisation about derivatives and partial 
-  derivatives shown in Section~\ref{derivatives} consists of 390 lines of code.
-  While this might be seen large, it should be seen
+  about regular languages. The formalisation about derivatives and partial
+  derivatives shown in Section~\ref{derivatives} consists of 390 lines of
+  code.  The algorithm for solving equational systems, which we used in the
+  first direction, is conceptually not that complicated. Still the use of sets
+  over which the algorithm operates, means it is not as easy to formalise as
+  one might wish. It seems sets cannot be avoided since the `input' of the
+  algorithm consists of equivalence classes and we cannot see how to
+  reformulate the theory so that we can use lists, which are usually easier to
+  reason about in a theorem prover.
+
+  While our formalisation might be seen large, it should be seen
   in the context of the work done by Constable at al \cite{Constable00} who
   formalised the Myhill-Nerode theorem in Nuprl using automata. They write
   that their four-member team needed something on the magnitude of 18 months
@@ -2069,29 +2111,46 @@
   ours can be found in the Mercurial Repository at
   \mbox{\url{http://www4.in.tum.de/~urbanc/regexp.html}}.
 
+  Our proof of the first direction is very much inspired by \emph{Brzozowski's
+  algebraic method} used to convert a finite automaton to a regular expression
+  \cite{Brzozowski64}. The close connection can be seen by considering the
+  equivalence classes as the states of the minimal automaton for the regular
+  language.  However there are some subtle differences. Since we identify
+  equivalence classes with the states of the automaton, then the most natural
+  choice is to characterise each state with the set of strings starting from
+  the initial state leading up to that state. Usually, however, the states are
+  characterised as the strings starting from that state leading to the
+  terminal states.  The first choice has consequences about how the initial
+  equational system is set up. We have the $\lambda$-term on our `initial
+  state', while Brzozowski has it on the terminal states. This means we also
+  need to reverse the direction of Arden's Lemma. We have not found anything
+  in the literature about this way of proving the first direction of the
+  Myhill-Nerode theorem, but it appears to be folklore.
 
-  Our proof of the first direction is very much inspired by \emph{Brzozowski's
-  algebraic method} used to convert a finite automaton to a regular
-  expression \cite{Brzozowski64}. The close connection can be seen by considering the equivalence
-  classes as the states of the minimal automaton for the regular language.
-  However there are some subtle differences. Since we identify equivalence
-  classes with the states of the automaton, then the most natural choice is to
-  characterise each state with the set of strings starting from the initial
-  state leading up to that state. Usually, however, the states are characterised as the
-  strings starting from that state leading to the terminal states.  The first
-  choice has consequences about how the initial equational system is set up. We have
-  the $\lambda$-term on our `initial state', while Brzozowski has it on the
-  terminal states. This means we also need to reverse the direction of Arden's
-  Lemma.
+  We presented two proofs for the second direction of the Myhill-Nerode
+  theorem. One direct proof using tagging-functions and another using partial
+  derivatives. These proofs is where our method shines, because we can
+  completely side-step the standard argument \cite{Kozen97} where automata
+  need to be composed. However, it is also the direction where we had to spend
+  most of the `conceptual' time, as our first proof based on
+  tagging-functions is new for establishing the Myhill-Nerode theorem. All
+  standard proofs of this direction proceed by arguments over automata.
+  
+  Our indirect proof for the second direction arose from the interest in
+  Brzozowski's derivatives for regular expression matching. A corresponding
+  regular expression matcher has been formalised in HOL4 in
+  \cite{OwensSlind08}. In our opinion, this formalisation is considerably
+  slicker than for example the approach to regular expression matching taken
+  in \cite{Harper99} and \cite{Yi06}. While Brzozowski's derivatives lead to
+  simple regular expression matchers and he proved that there are only
+  finitely many dissimilar derivatives for every regular expression, this
+  result is not as straightforward to formalise in a theorem prover. The
+  reason is that the set of dissimilar derivatives is not defined inductively,
+  but in terms of an ACI-equivalence relation.
 
-  This is also where our method shines, because we can completely
-  side-step the standard argument \cite{Kozen97} where automata need
-  to be composed, which as stated in the Introduction is not so easy
-  to formalise in a HOL-based theorem prover. However, it is also the
-  direction where we had to spend most of the `conceptual' time, as
-  our proof-argument based on tagging-functions is new for
-  establishing the Myhill-Nerode theorem. All standard proofs of this
-  direction proceed by arguments over automata.\medskip
+  
+
+  \medskip
 
   We expect that the development of Krauss \& Nipkow gets easier by
   using partial derivatives.\medskip

--- a/Journal/document/root.bib	Fri Aug 12 17:08:58 2011 +0000
+++ b/Journal/document/root.bib	Mon Aug 15 21:09:08 2011 +0000
@@ -1,4 +1,30 @@
+@article{Harper99,
+  author    = {R.~Harper},
+  title     = {Proof-Directed Debugging},
+  journal   = {Journal of Functional Programming},
+  volume    = {9},
+  number    = {4},
+  year      = {1999},
+  pages     = {463-469}
+}
 
+@article{Yi06,
+  author    = {K.~Yi},
+  title     = {Educational Pearl: 'Proof-directed debugging' revisited
+               for a first-order version},
+  journal   = {Journal of Functional Programming},
+  volume    = {16},
+  number    = {6},
+  year      = {2006},
+  pages     = {663-670}
+}
+
+
+@Manual{PittsHOL4,
+  title = 	 {{S}yntax and {S}emantics},
+  author = 	 {A.~M.~Pitts},
+  note = 	 {Part of the documentation for the HOL4 system.}
+}
 
 @article{OwensReppyTuron09,
   author = {S.~Owens and J.~Reppy and A.~Turon},
@@ -63,7 +89,7 @@
 @article{Brzozowski64,
  author = {J.~A.~Brzozowski},
  title = {{D}erivatives of {R}egular {E}xpressions},
- journal = {J.~ACM},
+ journal = {Journal of the ACM},
  volume = {11},
  issue = {4},
  year = {1964},

author	urbanc
	Mon, 15 Aug 2011 21:09:08 +0000
changeset 196	fa8d33d13cb6
parent 195	5bbe63876f84
child 197	cf1c17431dab

Journal/Paper.thy		file \| annotate \| diff \| comparison \| revisions
Journal/document/root.bib		file \| annotate \| diff \| comparison \| revisions