250 \end{equation} |
249 \end{equation} |
251 |
250 |
252 \noindent |
251 \noindent |
253 changes the type---the disjoint union is not a set, but a set of |
252 changes the type---the disjoint union is not a set, but a set of |
254 pairs. Using this definition for disjoint union means we do not have a |
253 pairs. Using this definition for disjoint union means we do not have a |
255 single type for automata. As a result we will not be able to define a regular |
254 single type for automata. As a result we will not be able to define a |
256 language as one for which there exists an automaton that recognises all its |
255 regular language as one for which there exists an automaton that recognises |
257 strings (Definition~\ref{baddef}). This is because we cannot make a definition in HOL that is polymorphic in |
256 all its strings (Definition~\ref{baddef}). This is because we cannot make a |
258 the state type and there is no type quantification available in HOL (unlike |
257 definition in HOL that is polymorphic in the state type and there is no type |
259 in Coq, for example).\footnote{Slind already pointed out this problem in an email |
258 quantification available in HOL (unlike in Coq, for example).\footnote{Slind |
260 to the HOL4 mailing list on 21st April 2005.} |
259 already pointed out this problem in an email to the HOL4 mailing list on |
|
260 21st April 2005.} |
|
261 |
261 |
262 |
262 An alternative, which provides us with a single type for automata, is to give every |
263 An alternative, which provides us with a single type for automata, is to give every |
263 state node an identity, for example a natural |
264 state node an identity, for example a natural |
264 number, and then be careful to rename these identities apart whenever |
265 number, and then be careful to rename these identities apart whenever |
265 connecting two automata. This results in clunky proofs |
266 connecting two automata. This results in clunky proofs |
1914 are only finitely many equivalence classes of @{text "\<^raw:$\threesim$>\<^bsub>(\<lambda>x. ders x r)\<^esub>"}, |
1915 are only finitely many equivalence classes of @{text "\<^raw:$\threesim$>\<^bsub>(\<lambda>x. ders x r)\<^esub>"}, |
1915 and we can again conclude the second part of the Myhill-Nerode theorem. |
1916 and we can again conclude the second part of the Myhill-Nerode theorem. |
1916 \end{proof} |
1917 \end{proof} |
1917 *} |
1918 *} |
1918 |
1919 |
1919 section {* Closure Properties *} |
1920 section {* Closure Properties of Regular Languages *} |
1920 |
1921 |
1921 text {* |
1922 text {* |
1922 \noindent |
1923 \noindent |
1923 The real beauty of regular languages is that they are closed |
1924 The beauty of regular languages is that they are closed under many set |
1924 under almost all set operations. Closure under union, concatenation and Kleene-star |
1925 operations. Closure under union, concatenation and Kleene-star are trivial |
1925 are trivial to establish given our definition of regularity (Definition~\ref{regular}). |
1926 to establish given our definition of regularity (recall Definition~\ref{regular}). |
1926 More interesting is the closure under complement, because |
1927 More interesting is the closure under complement, because it seems difficult |
1927 it seems difficult to construct a regular expression for the complement |
1928 to construct a regular expression for the complement language by direct |
1928 language by direct means. However the existence of such a regular expression |
1929 means. However the existence of such a regular expression can now be easily |
1929 can now be easily proved using the Myhill-Nerode theorem since |
1930 proved using the Myhill-Nerode theorem since |
1930 |
1931 |
1931 \begin{center} |
1932 \begin{center} |
1932 @{term "s\<^isub>1 \<approx>A s\<^isub>2"} if and only if @{term "s\<^isub>1 \<approx>(-A) s\<^isub>2"} |
1933 @{term "s\<^isub>1 \<approx>A s\<^isub>2"} if and only if @{term "s\<^isub>1 \<approx>(-A) s\<^isub>2"} |
1933 \end{center} |
1934 \end{center} |
1934 |
1935 |
1935 \noindent |
1936 \noindent |
1936 holds for any strings @{text "s\<^isub>1"} and @{text |
1937 holds for any strings @{text "s\<^isub>1"} and @{text |
1937 "s\<^isub>2"}. Therefore @{text A} and the complement language @{term "-A"} |
1938 "s\<^isub>2"}. Therefore @{text A} and the complement language @{term "-A"} |
1938 give rise to the same partitions. So if one is finite, the other is too and the |
1939 give rise to the same partitions. So if one is finite, the other is too, and |
1939 other way around. |
1940 the other way around. Proving the existence of such a regular expression via |
|
1941 automata using the standard method would be quite involved. It includes the |
|
1942 steps: regular expression @{text "\<Rightarrow>"} non-deterministic automaton @{text |
|
1943 "\<Rightarrow>"} deterministic automaton @{text "\<Rightarrow>"} complement automaton @{text "\<Rightarrow>"} |
|
1944 regular expression. Clearly not something you want to formalise in a theorem |
|
1945 prover in which it is cumbersome to reason about automata. |
1940 |
1946 |
1941 Once closure under complement is established, closure under intersection |
1947 Once closure under complement is established, closure under intersection |
1942 and set difference is also easy, because |
1948 and set difference is also easy, because |
1943 |
1949 |
1944 \begin{center} |
1950 \begin{center} |
1971 @{thm (lhs) Rev.simps(6)} & @{text "\<equiv>"} & @{thm (rhs) Rev.simps(6)}\\ |
1977 @{thm (lhs) Rev.simps(6)} & @{text "\<equiv>"} & @{thm (rhs) Rev.simps(6)}\\ |
1972 \end{tabular} |
1978 \end{tabular} |
1973 \end{center} |
1979 \end{center} |
1974 |
1980 |
1975 \noindent |
1981 \noindent |
1976 For this operation we can so |
1982 For this operation we can show |
1977 |
1983 |
1978 \begin{center} |
1984 \begin{center} |
1979 @{text "(\<calL>(r))\<^bsup>-1\<^esup>"}~@{text "="}~@{thm (rhs) rev_lang} |
1985 @{text "(\<calL>(r))\<^bsup>-1\<^esup>"}~@{text "="}~@{thm (rhs) rev_lang} |
1980 \end{center} |
1986 \end{center} |
1981 |
1987 |
1982 \noindent |
1988 \noindent |
1983 from which closure under reversal follows. |
1989 from which closure under reversal of regular languages follows. |
1984 |
1990 |
1985 The perhaps the most surprising fact is that regular languages are closed under any |
1991 A perhaps surprising fact is that regular languages are closed under any |
1986 left-quotient. Define |
1992 left-quotient. Define |
1987 |
1993 |
1988 \begin{center} |
1994 \begin{center} |
1989 @{abbrev "Ders_lang B A"} |
1995 @{abbrev "Ders_lang B A"} |
1990 \end{center} |
1996 \end{center} |
1991 |
1997 |
1992 \noindent |
1998 \noindent |
1993 and assume @{text A} is regular. From this we know there exists a regular |
1999 and assume @{text B} is any language and @{text A} is regular, then @{term |
1994 expression @{text r} such that @{term "A = lang r"}. We also know that |
2000 "Ders_lang B A"} is regular. To see this consider the following argument |
1995 @{term "pders_lang B r"} is finite. By definition and Lemma~\ref{Derspders} |
2001 using partial derivatives: From @{text A} being regular we know there exists |
|
2002 a regular expression @{text r} such that @{term "A = lang r"}. We also know |
|
2003 that @{term "pders_lang B r"} is finite for every language @{text B} and |
|
2004 regular expression @{text r} (recall Theorem~\ref{antimirov}). By definition |
|
2005 and Lemma~\ref{Derspders} therefore |
|
2006 |
1996 |
2007 |
1997 \begin{equation}\label{eqq} |
2008 \begin{equation}\label{eqq} |
1998 @{term "Ders_lang B (lang r) = (\<Union> lang ` (pders_lang B r))"} |
2009 @{term "Ders_lang B (lang r) = (\<Union> lang ` (pders_lang B r))"} |
1999 \end{equation} |
2010 \end{equation} |
2000 |
2011 |
2001 \noindent |
2012 \noindent |
2002 Since there are only finitely many regular expressions in @{term "pders_lang B r"} |
2013 Since there are only finitely many regular expressions in @{term "pders_lang |
2003 by Theorem~\ref{antimirov}, we know that the right-hand side of \eqref{eqq}, is |
2014 B r"}, we know by \eqref{uplus} that there exists a regular expression that |
2004 equal to @{term "lang (\<Uplus>(pders_lang B r))"} using \eqref{uplus}. Hence |
2015 the right-hand side of \eqref{eqq} is equal to \mbox{@{term "lang (\<Uplus>(pders_lang B |
2005 the regular expression @{term "pders_lang B r"} verifies that @{term "Ders_lang B A"} |
2016 r))"}}. Thus the regular expression @{term "\<Uplus>(pders_lang B r)"} verifies that |
2006 is regular. |
2017 @{term "Ders_lang B A"} is regular. |
2007 |
|
2008 *} |
2018 *} |
2009 |
2019 |
2010 |
2020 |
2011 section {* Conclusion and Related Work *} |
2021 section {* Conclusion and Related Work *} |
2012 |
2022 |
2031 pumping lemma \cite{Kozen97}). We can also use it to establish the standard |
2041 pumping lemma \cite{Kozen97}). We can also use it to establish the standard |
2032 textbook results about closure properties of regular languages. Interesting |
2042 textbook results about closure properties of regular languages. Interesting |
2033 is the case of closure under complement, because it seems difficult to |
2043 is the case of closure under complement, because it seems difficult to |
2034 construct a regular expression for the complement language by direct |
2044 construct a regular expression for the complement language by direct |
2035 means. However the existence of such a regular expression can be easily |
2045 means. However the existence of such a regular expression can be easily |
2036 proved using the Myhill-Nerode theorem. Proving the existence of such a |
2046 proved using the Myhill-Nerode theorem. |
2037 regular expression via automata using the standard method would be quite |
2047 |
2038 involved. It includes the steps: regular expression @{text "\<Rightarrow>"} |
2048 Our insistence on regular expressions for proving the Myhill-Nerode theorem |
2039 non-deterministic automaton @{text "\<Rightarrow>"} deterministic automaton @{text "\<Rightarrow>"} |
2049 arose from the limitations of HOL on which the popular theorem provers HOL4, |
2040 complement automaton @{text "\<Rightarrow>"} regular expression. |
2050 HOLlight and Isabelle/HOL are based. In order to guarantee consistency, |
2041 |
2051 formalisations can only extend HOL by definitions that introduce a notion in |
2042 |
2052 terms of already existing concepts. A convenient definition for automata |
2043 While regular expressions are convenient in formalisations, they have some |
2053 (based on graphs) use a polymorphic type for the state nodes. This allows us |
2044 limitations. One is that there seems to be no method of calculating a |
2054 to use the standard operation of disjoint union in order to compose two |
2045 minimal regular expression (for example in terms of length) for a regular |
2055 automata. But we cannot use such a polymorphic definition of automata in HOL |
2046 language, like there is |
2056 as part of the definition for regularity of a language (a set of strings). |
2047 for automata. On the other hand, efficient regular expression matching, |
2057 Consider the following attempt |
2048 without using automata, poses no problem \cite{OwensReppyTuron09}. |
2058 |
2049 For an implementation of a simple regular expression matcher, |
2059 |
2050 whose correctness has been formally established, we refer the reader to |
2060 \begin{center} |
2051 Owens and Slind \cite{OwensSlind08}. |
2061 @{text "is_regular A \<equiv> \<exists>M(\<alpha>). is_finite_automata (M) \<and> \<calL>(M) = A"} |
2052 |
2062 \end{center} |
|
2063 |
|
2064 \noindent |
|
2065 which means the definiens is polymorphic in the type of the automata @{text |
|
2066 "M"}, but the definiendum @{text "is_regular"} is not. Such definitions are |
|
2067 excluded in HOL, because they lead easily to inconsistencies (see |
|
2068 \cite{PittsHOL4} for a simple example). Also HOL does not contain |
|
2069 type-quantifiers which would allow us to get rid of the polymorphism by |
|
2070 quantifying over the type-variable @{text "\<alpha>"}. Therefore when defining |
|
2071 regularity in terms of automata, the only natural way out in HOL is to use state |
|
2072 nodes with an identity, for example a natural number. Unfortunatly, the |
|
2073 consequence is that we have to be careful when combining two automata so |
|
2074 that there is no clash between two such states. This makes formalisations |
|
2075 quite fiddly and unpleasant. Regular expressions proved much more convenient |
|
2076 for reasoning in HOL and we showed they can be used for establishing the |
|
2077 Myhill-Nerode theorem. |
|
2078 |
|
2079 While regular expressions are convenient, they have some limitations. One is |
|
2080 that there seems to be no method of calculating a minimal regular expression |
|
2081 (for example in terms of length) for a regular language, like there is for |
|
2082 automata. On the other hand, efficient regular expression matching, without |
|
2083 using automata, poses no problem \cite{OwensReppyTuron09}. For an |
|
2084 implementation of a simple regular expression matcher, whose correctness has |
|
2085 been formally established, we refer the reader to Owens and Slind |
|
2086 \cite{OwensSlind08}. |
2053 |
2087 |
2054 Our formalisation consists of 780 lines of Isabelle/Isar code for the first |
2088 Our formalisation consists of 780 lines of Isabelle/Isar code for the first |
2055 direction and 460 for the second, plus around 300 lines of standard material |
2089 direction and 460 for the second, plus around 300 lines of standard material |
2056 about regular languages. The formalisation about derivatives and partial |
2090 about regular languages. The formalisation about derivatives and partial |
2057 derivatives shown in Section~\ref{derivatives} consists of 390 lines of code. |
2091 derivatives shown in Section~\ref{derivatives} consists of 390 lines of |
2058 While this might be seen large, it should be seen |
2092 code. The algorithm for solving equational systems, which we used in the |
|
2093 first direction, is conceptually not that complicated. Still the use of sets |
|
2094 over which the algorithm operates, means it is not as easy to formalise as |
|
2095 one might wish. It seems sets cannot be avoided since the `input' of the |
|
2096 algorithm consists of equivalence classes and we cannot see how to |
|
2097 reformulate the theory so that we can use lists, which are usually easier to |
|
2098 reason about in a theorem prover. |
|
2099 |
|
2100 While our formalisation might be seen large, it should be seen |
2059 in the context of the work done by Constable at al \cite{Constable00} who |
2101 in the context of the work done by Constable at al \cite{Constable00} who |
2060 formalised the Myhill-Nerode theorem in Nuprl using automata. They write |
2102 formalised the Myhill-Nerode theorem in Nuprl using automata. They write |
2061 that their four-member team needed something on the magnitude of 18 months |
2103 that their four-member team needed something on the magnitude of 18 months |
2062 for their formalisation. The estimate for our formalisation is that we |
2104 for their formalisation. The estimate for our formalisation is that we |
2063 needed approximately 3 months and this included the time to find our proof |
2105 needed approximately 3 months and this included the time to find our proof |
2067 formalisation in Nurpl, but from what is shown in the Nuprl Math Library |
2109 formalisation in Nurpl, but from what is shown in the Nuprl Math Library |
2068 about their development it seems substantially larger than ours. The code of |
2110 about their development it seems substantially larger than ours. The code of |
2069 ours can be found in the Mercurial Repository at |
2111 ours can be found in the Mercurial Repository at |
2070 \mbox{\url{http://www4.in.tum.de/~urbanc/regexp.html}}. |
2112 \mbox{\url{http://www4.in.tum.de/~urbanc/regexp.html}}. |
2071 |
2113 |
2072 |
|
2073 Our proof of the first direction is very much inspired by \emph{Brzozowski's |
2114 Our proof of the first direction is very much inspired by \emph{Brzozowski's |
2074 algebraic method} used to convert a finite automaton to a regular |
2115 algebraic method} used to convert a finite automaton to a regular expression |
2075 expression \cite{Brzozowski64}. The close connection can be seen by considering the equivalence |
2116 \cite{Brzozowski64}. The close connection can be seen by considering the |
2076 classes as the states of the minimal automaton for the regular language. |
2117 equivalence classes as the states of the minimal automaton for the regular |
2077 However there are some subtle differences. Since we identify equivalence |
2118 language. However there are some subtle differences. Since we identify |
2078 classes with the states of the automaton, then the most natural choice is to |
2119 equivalence classes with the states of the automaton, then the most natural |
2079 characterise each state with the set of strings starting from the initial |
2120 choice is to characterise each state with the set of strings starting from |
2080 state leading up to that state. Usually, however, the states are characterised as the |
2121 the initial state leading up to that state. Usually, however, the states are |
2081 strings starting from that state leading to the terminal states. The first |
2122 characterised as the strings starting from that state leading to the |
2082 choice has consequences about how the initial equational system is set up. We have |
2123 terminal states. The first choice has consequences about how the initial |
2083 the $\lambda$-term on our `initial state', while Brzozowski has it on the |
2124 equational system is set up. We have the $\lambda$-term on our `initial |
2084 terminal states. This means we also need to reverse the direction of Arden's |
2125 state', while Brzozowski has it on the terminal states. This means we also |
2085 Lemma. |
2126 need to reverse the direction of Arden's Lemma. We have not found anything |
2086 |
2127 in the literature about this way of proving the first direction of the |
2087 This is also where our method shines, because we can completely |
2128 Myhill-Nerode theorem, but it appears to be folklore. |
2088 side-step the standard argument \cite{Kozen97} where automata need |
2129 |
2089 to be composed, which as stated in the Introduction is not so easy |
2130 We presented two proofs for the second direction of the Myhill-Nerode |
2090 to formalise in a HOL-based theorem prover. However, it is also the |
2131 theorem. One direct proof using tagging-functions and another using partial |
2091 direction where we had to spend most of the `conceptual' time, as |
2132 derivatives. These proofs is where our method shines, because we can |
2092 our proof-argument based on tagging-functions is new for |
2133 completely side-step the standard argument \cite{Kozen97} where automata |
2093 establishing the Myhill-Nerode theorem. All standard proofs of this |
2134 need to be composed. However, it is also the direction where we had to spend |
2094 direction proceed by arguments over automata.\medskip |
2135 most of the `conceptual' time, as our first proof based on |
|
2136 tagging-functions is new for establishing the Myhill-Nerode theorem. All |
|
2137 standard proofs of this direction proceed by arguments over automata. |
|
2138 |
|
2139 Our indirect proof for the second direction arose from the interest in |
|
2140 Brzozowski's derivatives for regular expression matching. A corresponding |
|
2141 regular expression matcher has been formalised in HOL4 in |
|
2142 \cite{OwensSlind08}. In our opinion, this formalisation is considerably |
|
2143 slicker than for example the approach to regular expression matching taken |
|
2144 in \cite{Harper99} and \cite{Yi06}. While Brzozowski's derivatives lead to |
|
2145 simple regular expression matchers and he proved that there are only |
|
2146 finitely many dissimilar derivatives for every regular expression, this |
|
2147 result is not as straightforward to formalise in a theorem prover. The |
|
2148 reason is that the set of dissimilar derivatives is not defined inductively, |
|
2149 but in terms of an ACI-equivalence relation. |
|
2150 |
|
2151 |
|
2152 |
|
2153 \medskip |
2095 |
2154 |
2096 We expect that the development of Krauss \& Nipkow gets easier by |
2155 We expect that the development of Krauss \& Nipkow gets easier by |
2097 using partial derivatives.\medskip |
2156 using partial derivatives.\medskip |
2098 |
2157 |
2099 \noindent |
2158 \noindent |