61 transition ("_ \<^raw:\ensuremath{\stackrel{\text{>_\<^raw:}}{\Longmapsto}}> _" [100, 100, 100] 100) and |
62 transition ("_ \<^raw:\ensuremath{\stackrel{\text{>_\<^raw:}}{\Longmapsto}}> _" [100, 100, 100] 100) and |
62 Setalt ("\<^raw:\ensuremath{\bigplus}>_" [1000] 999) and |
63 Setalt ("\<^raw:\ensuremath{\bigplus}>_" [1000] 999) and |
63 Append_rexp2 ("_ \<^raw:\ensuremath{\triangleleft}> _" [100, 100] 100) and |
64 Append_rexp2 ("_ \<^raw:\ensuremath{\triangleleft}> _" [100, 100] 100) and |
64 Append_rexp_rhs ("_ \<^raw:\ensuremath{\triangleleft}> _" [100, 100] 50) and |
65 Append_rexp_rhs ("_ \<^raw:\ensuremath{\triangleleft}> _" [100, 100] 50) and |
65 |
66 |
66 uminus ("\<^raw:\ensuremath{\overline{>_\<^raw:}}>" [100] 100) and |
67 uminus ("\<^raw:\ensuremath{\overline{\isa{>_\<^raw:}}}>" [100] 100) and |
67 tag_Plus ("+tag _ _" [100, 100] 100) and |
68 tag_Plus ("+tag _ _" [100, 100] 100) and |
68 tag_Plus ("+tag _ _ _" [100, 100, 100] 100) and |
69 tag_Plus ("+tag _ _ _" [100, 100, 100] 100) and |
69 tag_Times ("\<times>tag _ _" [100, 100] 100) and |
70 tag_Times ("\<times>tag _ _" [100, 100] 100) and |
70 tag_Times ("\<times>tag _ _ _" [100, 100, 100] 100) and |
71 tag_Times ("\<times>tag _ _ _" [100, 100, 100] 100) and |
71 tag_Star ("\<star>tag _" [100] 100) and |
72 tag_Star ("\<star>tag _" [100] 100) and |
198 Logic (HOL), for example HOL4, HOLlight or Isabelle/HOL. For the development |
215 Logic (HOL), for example HOL4, HOLlight or Isabelle/HOL. For the development |
199 presented in this paper we will use the Isabelle/HOL. HOL is a predicate calculus |
216 presented in this paper we will use the Isabelle/HOL. HOL is a predicate calculus |
200 that allows quantification over predicate variables. Its type system is |
217 that allows quantification over predicate variables. Its type system is |
201 based on Church's Simple Theory of Types \cite{Church40}. Although many |
218 based on Church's Simple Theory of Types \cite{Church40}. Although many |
202 mathematical concepts can be conveniently expressed in HOL, there are some |
219 mathematical concepts can be conveniently expressed in HOL, there are some |
203 limitations that hurt badly, if one attempts a simple-minded formalisation |
220 limitations that hurt badly when attempting a simple-minded formalisation |
204 of regular languages in it. |
221 of regular languages in it. |
205 |
222 |
206 The typical approach (for example \cite{HopcroftUllman69,Kozen97}) to |
223 The typical approach (for example \cite{HopcroftUllman69,Kozen97}) to |
207 regular languages is to introduce finite deterministic automata and then |
224 regular languages is to introduce finite deterministic automata and then |
208 define everything in terms of them. For example, a regular language is |
225 define everything in terms of them. For example, a regular language is |
359 lexing. Berghofer and Reiter \cite{BerghoferReiter09} formalise automata |
376 lexing. Berghofer and Reiter \cite{BerghoferReiter09} formalise automata |
360 working over bit strings in the context of Presburger arithmetic. The only |
377 working over bit strings in the context of Presburger arithmetic. The only |
361 larger formalisations of automata theory are carried out in Nuprl |
378 larger formalisations of automata theory are carried out in Nuprl |
362 \cite{Constable00} and in Coq, e.g.~\cite{Filliatre97,Almeidaetal10}. |
379 \cite{Constable00} and in Coq, e.g.~\cite{Filliatre97,Almeidaetal10}. |
363 |
380 |
364 Also one might consider automata theory and regular languages as a meticulously |
381 Also one might consider that automata are convenient `vehicles' for |
365 researched subject where everything is crystal clear. However, paper proofs about |
382 establishing properties about regular languages. However, paper proofs |
366 automata often involve subtle side-conditions which are easily overlooked, |
383 about automata often involve subtle side-conditions which are easily |
367 but which make formal reasoning rather painful. For example Kozen's proof of |
384 overlooked, but which make formal reasoning rather painful. For example |
368 the Myhill-Nerode theorem requires that automata do not have inaccessible |
385 Kozen's proof of the Myhill-Nerode theorem requires that automata do not |
369 states \cite{Kozen97}. Another subtle side-condition is completeness of |
386 have inaccessible states \cite{Kozen97}. Another subtle side-condition is |
370 automata, that is automata need to have total transition functions and at |
387 completeness of automata, that is automata need to have total transition |
371 most one `sink' state from which there is no connection to a final state |
388 functions and at most one `sink' state from which there is no connection to |
372 (Brzozowski mentions this side-condition in the context of state complexity |
389 a final state (Brzozowski mentions this side-condition in the context of |
373 of automata \cite{Brzozowski10}). Such side-conditions mean that if we |
390 state complexity of automata \cite{Brzozowski10}). Such side-conditions mean |
374 define a regular language as one for which there exists \emph{a} finite |
391 that if we define a regular language as one for which there exists \emph{a} |
375 automaton that recognises all its strings (see Definition~\ref{baddef}), then we |
392 finite automaton that recognises all its strings (see |
376 need a lemma which ensures that another equivalent one can be found |
393 Definition~\ref{baddef}), then we need a lemma which ensures that another |
377 satisfying the side-condition, and also need to make sure our operations on |
394 equivalent one can be found satisfying the side-condition, and also need to |
378 automata preserve them. Unfortunately, such `little' and `obvious' |
395 make sure our operations on automata preserve them. Unfortunately, such |
379 lemmas make a formalisation of automata theory a hair-pulling experience. |
396 `little' and `obvious' lemmas make a formalisation of automata theory a |
380 |
397 hair-pulling experience. |
381 |
398 |
382 In this paper, we will not attempt to formalise automata theory in |
399 In this paper, we will not attempt to formalise automata theory in |
383 Isabelle/HOL nor will we attempt to formalise automata proofs from the |
400 Isabelle/HOL nor will we attempt to formalise automata proofs from the |
384 literature, but take a different approach to regular languages than is |
401 literature, but take a different approach to regular languages than is |
385 usually taken. Instead of defining a regular language as one where there |
402 usually taken. Instead of defining a regular language as one where there |
403 main purpose of this paper is to show that a central result about regular |
420 main purpose of this paper is to show that a central result about regular |
404 languages---the Myhill-Nerode theorem---can be recreated by only using |
421 languages---the Myhill-Nerode theorem---can be recreated by only using |
405 regular expressions. This theorem gives necessary and sufficient conditions |
422 regular expressions. This theorem gives necessary and sufficient conditions |
406 for when a language is regular. As a corollary of this theorem we can easily |
423 for when a language is regular. As a corollary of this theorem we can easily |
407 establish the usual closure properties, including complementation, for |
424 establish the usual closure properties, including complementation, for |
408 regular languages.\medskip |
425 regular languages. We also use in one example the continuation lemma, which |
|
426 is based on Myhill-Nerode, for establishing non-regularity of languages |
|
427 \cite{rosenberg06}.\medskip |
409 |
428 |
410 \noindent |
429 \noindent |
411 {\bf Contributions:} There is an extensive literature on regular languages. |
430 {\bf Contributions:} There is an extensive literature on regular languages. |
412 To our best knowledge, our proof of the Myhill-Nerode theorem is the first |
431 To our best knowledge, our proof of the Myhill-Nerode theorem is the first |
413 that is based on regular expressions, only. The part of this theorem stating |
432 that is based on regular expressions, only. The part of this theorem stating |
702 finitely many terms of the form @{text "(Y\<^isub>i\<^isub>j, ATOM c\<^isub>i\<^isub>j)"} in a right-hand side |
721 finitely many terms of the form @{text "(Y\<^isub>i\<^isub>j, ATOM c\<^isub>i\<^isub>j)"} in a right-hand side |
703 since by assumption there are only finitely many |
722 since by assumption there are only finitely many |
704 equivalence classes and only finitely many characters. |
723 equivalence classes and only finitely many characters. |
705 The term @{text "\<lambda>(ONE)"} in the first equation acts as a marker for the initial state, that |
724 The term @{text "\<lambda>(ONE)"} in the first equation acts as a marker for the initial state, that |
706 is the equivalence class |
725 is the equivalence class |
707 containing @{text "[]"}.\footnote{Note that we mark, roughly speaking, the |
726 containing the empty string @{text "[]"}.\footnote{Note that we mark, roughly speaking, the |
708 single `initial' state in the equational system, which is different from |
727 single `initial' state in the equational system, which is different from |
709 the method by Brzozowski \cite{Brzozowski64}, where he marks the |
728 the method by Brzozowski \cite{Brzozowski64}, where he marks the |
710 `terminal' states. We are forced to set up the equational system in our |
729 `terminal' states. We are forced to set up the equational system in our |
711 way, because the Myhill-Nerode relation determines the `direction' of the |
730 way, because the Myhill-Nerode relation determines the `direction' of the |
712 transitions---the successor `state' of an equivalence class @{text Y} can |
731 transitions---the successor `state' of an equivalence class @{text Y} can |
1999 |
2025 |
2000 \noindent |
2026 \noindent |
2001 holds for any strings @{text "s\<^isub>1"} and @{text |
2027 holds for any strings @{text "s\<^isub>1"} and @{text |
2002 "s\<^isub>2"}. Therefore @{text A} and the complement language @{term "-A"} |
2028 "s\<^isub>2"}. Therefore @{text A} and the complement language @{term "-A"} |
2003 give rise to the same partitions. So if one is finite, the other is too, and |
2029 give rise to the same partitions. So if one is finite, the other is too, and |
2004 vice versa. Proving the existence of such a regular expression via |
2030 vice versa. As noted earlier, our algorithm for solving equational systems |
|
2031 actually calculates the regular expression. Calculating this regular expression |
|
2032 via |
2005 automata using the standard method would be quite involved. It includes the |
2033 automata using the standard method would be quite involved. It includes the |
2006 steps: regular expression @{text "\<Rightarrow>"} non-deterministic automaton @{text |
2034 steps: regular expression @{text "\<Rightarrow>"} non-deterministic automaton @{text |
2007 "\<Rightarrow>"} deterministic automaton @{text "\<Rightarrow>"} complement automaton @{text "\<Rightarrow>"} |
2035 "\<Rightarrow>"} deterministic automaton @{text "\<Rightarrow>"} complement automaton @{text "\<Rightarrow>"} |
2008 regular expression. Clearly not something you want to formalise in a theorem |
2036 regular expression. Clearly not something you want to formalise in a theorem |
2009 prover in which it is cumbersome to reason about automata. |
2037 prover in which it is cumbersome to reason about automata. |
2078 Since there are only finitely many regular expressions in @{term "pderivs_lang |
2106 Since there are only finitely many regular expressions in @{term "pderivs_lang |
2079 B r"}, we know by \eqref{uplus} that there exists a regular expression so that |
2107 B r"}, we know by \eqref{uplus} that there exists a regular expression so that |
2080 the right-hand side of \eqref{eqq} is equal to the language \mbox{@{term "lang (\<Uplus>(pderivs_lang B |
2108 the right-hand side of \eqref{eqq} is equal to the language \mbox{@{term "lang (\<Uplus>(pderivs_lang B |
2081 r))"}}. Thus the regular expression @{term "\<Uplus>(pderivs_lang B r)"} verifies that |
2109 r))"}}. Thus the regular expression @{term "\<Uplus>(pderivs_lang B r)"} verifies that |
2082 @{term "Deriv_lang B A"} is regular. |
2110 @{term "Deriv_lang B A"} is regular. |
|
2111 |
|
2112 Even more surprising is the fact that for every language @{text A}, the language |
|
2113 consisting of all substrings of @{text A} is regular \cite{Shallit08, Gasarch09}. |
|
2114 A substring can be obtained |
|
2115 by striking out zero or more characters from a string. This can be defined |
|
2116 inductively by the following three rules: |
|
2117 |
|
2118 \begin{center} |
|
2119 @{thm[mode=Axiom] emb0[where bs="x"]}\hspace{10mm} |
|
2120 @{thm[mode=Rule] emb1[where as="x" and b="c" and bs="y"]}\hspace{10mm} |
|
2121 @{thm[mode=Rule] emb2[where as="x" and a="c" and bs="y"]} |
|
2122 \end{center} |
|
2123 |
|
2124 \noindent |
|
2125 It can be easily proved that @{text "\<preceq>"} is a partial order. Now define the |
|
2126 language of substrings and superstrings of a language @{text A}: |
|
2127 |
|
2128 \begin{center} |
|
2129 \begin{tabular}{l} |
|
2130 @{thm SUBSEQ_def}\\ |
|
2131 @{thm SUPSEQ_def} |
|
2132 \end{tabular} |
|
2133 \end{center} |
|
2134 |
|
2135 \noindent |
|
2136 We like to establish |
|
2137 |
|
2138 \begin{lmm}\label{subseqreg} |
|
2139 For any language @{text A}, the languages @{term "SUBSEQ A"} and @{term "SUPSEQ A"} |
|
2140 are regular. |
|
2141 \end{lmm} |
|
2142 |
|
2143 \noindent |
|
2144 Our proof follows the one given in \cite[92--95]{Shallit08}, except that we use |
|
2145 Higman's lemma, which is already proved in the Isabelle/HOL library. Higman's |
|
2146 lemma allows us to prove that every set @{text A} of antichains, namely |
|
2147 |
|
2148 \begin{equation}\label{higman} |
|
2149 @{text "\<forall>x, y \<in> A."}~@{term "x \<noteq> y \<longrightarrow> \<not>(x \<preceq> y) \<and> \<not>(y \<preceq> x)"} |
|
2150 \end{equation} |
|
2151 |
|
2152 \noindent |
|
2153 is finite. |
|
2154 |
|
2155 It is straightforward to prove the following properties of @{term SUPSEQ} |
|
2156 |
|
2157 \begin{equation}\label{supseqprops} |
|
2158 \mbox{\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
|
2159 @{thm (lhs) SUPSEQ_simps(1)} & @{text "\<equiv>"} & @{thm (rhs) SUPSEQ_simps(1)}\\ |
|
2160 @{thm (lhs) SUPSEQ_simps(2)} & @{text "\<equiv>"} & @{thm (rhs) SUPSEQ_simps(2)}\\ |
|
2161 @{thm (lhs) SUPSEQ_atom} & @{text "\<equiv>"} & @{thm (rhs) SUPSEQ_atom}\\ |
|
2162 @{thm (lhs) SUPSEQ_union} & @{text "\<equiv>"} & @{thm (rhs) SUPSEQ_union}\\ |
|
2163 @{thm (lhs) SUPSEQ_conc} & @{text "\<equiv>"} & @{thm (rhs) SUPSEQ_conc}\\ |
|
2164 @{thm (lhs) SUPSEQ_star} & @{text "\<equiv>"} & @{thm (rhs) SUPSEQ_star} |
|
2165 \end{tabular}} |
|
2166 \end{equation} |
|
2167 |
|
2168 \noindent |
|
2169 whereby the last equation follows from the fact that @{term "A\<star>"} contains the |
|
2170 empty string. With these properties at our disposal we can establish the lemma |
|
2171 |
|
2172 \begin{lmm} |
|
2173 If @{text A} is regular, then also @{term "SUBSEQ A"}. |
|
2174 \end{lmm} |
|
2175 |
|
2176 \begin{proof} |
|
2177 Since our alphabet is finite, we can find a regular expression, written @{const Allreg}, that |
|
2178 matches every single-character string. With this regular expression we can inductively define |
|
2179 the operation @{text "r\<up>"} over regular expressions |
|
2180 |
|
2181 \begin{center} |
|
2182 \begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
|
2183 @{thm (lhs) UP.simps(1)} & @{text "\<equiv>"} & @{thm (rhs) UP.simps(1)}\\ |
|
2184 @{thm (lhs) UP.simps(2)} & @{text "\<equiv>"} & @{thm (rhs) UP.simps(2)}\\ |
|
2185 @{thm (lhs) UP.simps(3)} & @{text "\<equiv>"} & @{thm (rhs) UP.simps(3)}\\ |
|
2186 @{thm (lhs) UP.simps(4)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]} & |
|
2187 @{text "\<equiv>"} & @{thm (rhs) UP.simps(4)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]}\\ |
|
2188 @{thm (lhs) UP.simps(5)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]} & |
|
2189 @{text "\<equiv>"} & @{thm (rhs) UP.simps(5)[where ?r1.0="r\<^isub>1" and ?r2.0="r\<^isub>2"]}\\ |
|
2190 @{thm (lhs) UP.simps(6)} & @{text "\<equiv>"} & @{thm (rhs) UP.simps(6)} |
|
2191 \end{tabular} |
|
2192 \end{center} |
|
2193 |
|
2194 \noindent |
|
2195 and using \eqref{supseqprops} establish that @{thm lang_UP}. This shows |
|
2196 that @{term "SUBSEQ A"} is regular provided @{text A} is. |
|
2197 \end{proof} |
|
2198 |
|
2199 \noindent |
|
2200 Now we can prove the main lemma, namely |
|
2201 |
|
2202 \begin{lmm}\label{mset} |
|
2203 For every language @{text A}, there exists a finite language @{text M} such that |
|
2204 \begin{center} |
|
2205 \mbox{@{term "SUPSEQ M = SUPSEQ A"}}\;. |
|
2206 \end{center} |
|
2207 \end{lmm} |
|
2208 |
|
2209 \begin{proof} |
|
2210 For @{text M} we take the set of all minimal elements of @{text A}. An element @{text x} |
|
2211 is minimal in @{text A} provided |
|
2212 |
|
2213 \begin{center} |
|
2214 @{thm minimal_def} |
|
2215 \end{center} |
|
2216 |
|
2217 \noindent |
|
2218 By Higman's lemma \eqref{higman} we know |
|
2219 that @{text "M"} is finite, since every minimal element is incomparable, |
|
2220 except with itself. |
|
2221 It is also straightforward to show that @{term "SUPSEQ M \<subseteq> SUPSEQ A"}. For |
|
2222 the other direction we have @{term "x \<in> SUPSEQ A"}. From this we can obtain |
|
2223 a @{text y} such that @{term "y \<in> A"} and @{term "y \<preceq> x"}. Since we know that |
|
2224 the relation \mbox{@{term "{(y, x). y \<preceq> x \<and> x \<noteq> y}"}} is well-founded, there must |
|
2225 be a minimal element @{text "z"} such that @{term "z \<in> A"} and @{term "z \<preceq> y"}, |
|
2226 and hence by transitivity also \mbox{@{term "z \<preceq> x"}} (here we deviate from the argument |
|
2227 given in \cite{Shallit08}, because Isabelle/HOL provides already an extensive infrastructure |
|
2228 for reasoning about well-foundedness). Since @{term "z"} is |
|
2229 minimal and in @{term A}, we also know that @{term z} is in @{term M}. |
|
2230 From this together with @{term "z \<preceq> x"}, we can infer that @{term x} is in |
|
2231 @{term "SUPSEQ M"} as required. |
|
2232 \end{proof} |
|
2233 |
|
2234 \noindent |
|
2235 This lemma allows us to establish the second part of Lemma~\ref{subseqreg}. |
|
2236 |
|
2237 \begin{proof}[The Second Part of Lemma~\ref{subseqreg}] |
|
2238 Given any language @{text A}, by Lemma~\ref{mset} we know there exists |
|
2239 a finite, and thus regular, language @{text M}. We further have @{term "SUPSEQ M = SUPSEQ A"}, |
|
2240 which establishes the second part. |
|
2241 \end{proof} |
|
2242 |
|
2243 \noindent |
|
2244 In order to establish the first part of Lemma~\ref{subseqreg} we use the |
|
2245 property proved in \cite{Shallit08} |
|
2246 |
|
2247 \begin{equation}\label{compl} |
|
2248 @{thm SUBSEQ_complement} |
|
2249 \end{equation} |
|
2250 |
|
2251 \begin{proof}[The First Part of Lemma~\ref{subseqreg}] |
|
2252 By the second part of Lemma~\ref{subseqreg}, we know the right-hand side of \eqref{compl} |
|
2253 is regular, which means the complement of @{term "SUBSEQ A"} is regular. But since |
|
2254 we established already that regularity is preserved under complement, also @{term "SUBSEQ A"} |
|
2255 must be regular. |
|
2256 \end{proof} |
2083 *} |
2257 *} |
2084 |
2258 |
2085 |
2259 |
2086 section {* Conclusion and Related Work *} |
2260 section {* Conclusion and Related Work *} |
2087 |
2261 |