411 A language @{text A} is \emph{regular}, provided there is a regular expression |
411 A language @{text A} is \emph{regular}, provided there is a regular expression |
412 that matches all strings of @{text "A"}. |
412 that matches all strings of @{text "A"}. |
413 \end{dfntn} |
413 \end{dfntn} |
414 |
414 |
415 \noindent |
415 \noindent |
|
416 And then `forget' automata. |
416 The reason is that regular expressions, unlike graphs, matrices and |
417 The reason is that regular expressions, unlike graphs, matrices and |
417 functions, can be easily defined as an inductive datatype. A reasoning |
418 functions, can be easily defined as an inductive datatype. A reasoning |
418 infrastructure (like induction and recursion) comes then for free in |
419 infrastructure (like induction and recursion) comes for free in |
419 HOL. Moreover, no side-conditions will be needed for regular expressions, |
420 HOL. Moreover, no side-conditions will be needed for regular expressions, |
420 like we need for automata. This convenience of regular expressions has |
421 like we need for automata. This convenience of regular expressions has |
421 recently been exploited in HOL4 with a formalisation of regular expression |
422 recently been exploited in HOL4 with a formalisation of regular expression |
422 matching based on derivatives \cite{OwensSlind08} and with an equivalence |
423 matching based on derivatives \cite{OwensSlind08} and with an equivalence |
423 checker for regular expressions in Isabelle/HOL \cite{KraussNipkow11}. The |
424 checker for regular expressions in Isabelle/HOL \cite{KraussNipkow11}. The |
2111 the right-hand side of \eqref{eqq} is equal to the language \mbox{@{term "lang (\<Uplus>(pderivs_lang B |
2112 the right-hand side of \eqref{eqq} is equal to the language \mbox{@{term "lang (\<Uplus>(pderivs_lang B |
2112 r))"}}. Thus the regular expression @{term "\<Uplus>(pderivs_lang B r)"} verifies that |
2113 r))"}}. Thus the regular expression @{term "\<Uplus>(pderivs_lang B r)"} verifies that |
2113 @{term "Deriv_lang B A"} is regular. |
2114 @{term "Deriv_lang B A"} is regular. |
2114 |
2115 |
2115 Even more surprising is the fact that for \emph{every} language @{text A}, the language |
2116 Even more surprising is the fact that for \emph{every} language @{text A}, the language |
2116 consisting of all substrings of @{text A} is regular \cite{Shallit08, Gasarch09}. |
2117 consisting of all substrings of @{text A} is regular \cite{Haines69} (see also |
|
2118 \cite{Shallit08, Gasarch09}). |
2117 A \emph{substring} can be obtained |
2119 A \emph{substring} can be obtained |
2118 by striking out zero or more characters from a string. This can be defined |
2120 by striking out zero or more characters from a string. This can be defined |
2119 inductively in Isabelle/HOL by the following three rules: |
2121 inductively in Isabelle/HOL by the following three rules: |
2120 |
2122 |
2121 \begin{center} |
2123 \begin{center} |
2137 \end{center} |
2139 \end{center} |
2138 |
2140 |
2139 \noindent |
2141 \noindent |
2140 We like to establish |
2142 We like to establish |
2141 |
2143 |
2142 \begin{lmm}\label{subseqreg} |
2144 \begin{thrm}[Haines \cite{Haines69}]\label{subseqreg} |
2143 For every language @{text A}, the languages @{text "(i)"} @{term "SUBSEQ A"} and |
2145 For every language @{text A}, the languages @{text "(i)"} @{term "SUBSEQ A"} and |
2144 @{text "(ii)"} @{term "SUPSEQ A"} |
2146 @{text "(ii)"} @{term "SUPSEQ A"} |
2145 are regular. |
2147 are regular. |
2146 \end{lmm} |
2148 \end{thrm} |
2147 |
2149 |
2148 \noindent |
2150 \noindent |
2149 Our proof follows the one given in \cite[Pages 92--95]{Shallit08}, except that we use |
2151 Our proof follows the one given in \cite[Pages 92--95]{Shallit08}, except that we use |
2150 Higman's Lemma, which is already proved in the Isabelle/HOL library \cite{Berghofer03}. |
2152 Higman's Lemma, which is already proved in the Isabelle/HOL library |
|
2153 \cite{Berghofer03}.\footnote{Unfortunately, Berghofer's formalisation of Higman's lemma |
|
2154 is restricted to 2-letter alphabets, |
|
2155 which means also our formalisation of Theorem~\ref{subseqreg} is `tainted' with |
|
2156 this constraint. However our methodology is applicable to any alphabet of finite size.} |
2151 Higman's Lemma allows us to infer that every set @{text A} of antichains, satisfying |
2157 Higman's Lemma allows us to infer that every set @{text A} of antichains, satisfying |
2152 |
2158 |
2153 \begin{equation}\label{higman} |
2159 \begin{equation}\label{higman} |
2154 @{text "\<forall>x, y \<in> A."}~@{term "x \<noteq> y \<longrightarrow> \<not>(x \<preceq> y) \<and> \<not>(y \<preceq> x)"} |
2160 @{text "\<forall>x, y \<in> A."}~@{term "x \<noteq> y \<longrightarrow> \<not>(x \<preceq> y) \<and> \<not>(y \<preceq> x)"} |
2155 \end{equation} |
2161 \end{equation} |
2156 |
2162 |
2157 \noindent |
2163 \noindent |
2158 is finite. |
2164 is finite. |
2159 |
2165 |
2160 The first step in our proof of Lemma~\ref{subseqreg} is to establish the |
2166 The first step in our proof of Theorem~\ref{subseqreg} is to establish the |
2161 following simple properties for @{term SUPSEQ} |
2167 following simple properties for @{term SUPSEQ} |
2162 |
2168 |
2163 \begin{equation}\label{supseqprops} |
2169 \begin{equation}\label{supseqprops} |
2164 \mbox{\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
2170 \mbox{\begin{tabular}{l@ {\hspace{1mm}}c@ {\hspace{1mm}}l} |
2165 @{thm (lhs) SUPSEQ_simps(1)} & @{text "\<equiv>"} & @{thm (rhs) SUPSEQ_simps(1)}\\ |
2171 @{thm (lhs) SUPSEQ_simps(1)} & @{text "\<equiv>"} & @{thm (rhs) SUPSEQ_simps(1)}\\ |
2201 and use \eqref{supseqprops} to establish that @{thm lang_UP} holds. This shows |
2207 and use \eqref{supseqprops} to establish that @{thm lang_UP} holds. This shows |
2202 that @{term "SUBSEQ A"} is regular, provided @{text A} is. |
2208 that @{term "SUBSEQ A"} is regular, provided @{text A} is. |
2203 \end{proof} |
2209 \end{proof} |
2204 |
2210 |
2205 \noindent |
2211 \noindent |
2206 Now we can prove our main lemma w.r.t.~@{const "SUPSEQ"}, namely |
2212 Now we can prove the main lemma w.r.t.~@{const "SUPSEQ"}, namely |
2207 |
2213 |
2208 \begin{lmm}\label{mset} |
2214 \begin{lmm}\label{mset} |
2209 For every language @{text A}, there exists a finite language @{text M} such that |
2215 For every language @{text A}, there exists a finite language @{text M} such that |
2210 \begin{center} |
2216 \begin{center} |
2211 \mbox{@{term "SUPSEQ M = SUPSEQ A"}}\;. |
2217 \mbox{@{term "SUPSEQ M = SUPSEQ A"}}\;. |
2236 From this together with \mbox{@{term "z \<preceq> x"}}, we can infer that @{term x} is in |
2242 From this together with \mbox{@{term "z \<preceq> x"}}, we can infer that @{term x} is in |
2237 @{term "SUPSEQ M"}, as required. |
2243 @{term "SUPSEQ M"}, as required. |
2238 \end{proof} |
2244 \end{proof} |
2239 |
2245 |
2240 \noindent |
2246 \noindent |
2241 This lemma allows us to establish the second part of Lemma~\ref{subseqreg}. |
2247 This lemma allows us to establish the second part of Theorem~\ref{subseqreg}. |
2242 |
2248 |
2243 \begin{proof}[Proof of the Second Part of Lemma~\ref{subseqreg}] |
2249 \begin{proof}[Proof of the Second Part of Theorem~\ref{subseqreg}] |
2244 Given any language @{text A}, by Lemma~\ref{mset} we know there exists |
2250 Given any language @{text A}, by Lemma~\ref{mset} we know there exists |
2245 a finite, and thus regular, language @{text M}. We further have @{term "SUPSEQ M = SUPSEQ A"}, |
2251 a finite, and thus regular, language @{text M}. We further have @{term "SUPSEQ M = SUPSEQ A"}, |
2246 which establishes the second part. |
2252 which establishes the second part. |
2247 \end{proof} |
2253 \end{proof} |
2248 |
2254 |
2249 \noindent |
2255 \noindent |
2250 In order to establish the first part of Lemma~\ref{subseqreg}, we use the |
2256 In order to establish the first part of this theorem, we use the |
2251 property proved in \cite{Shallit08}, namely that |
2257 property proved in \cite{Shallit08}, namely that |
2252 |
2258 |
2253 \begin{equation}\label{compl} |
2259 \begin{equation}\label{compl} |
2254 @{thm SUBSEQ_complement} |
2260 @{thm SUBSEQ_complement} |
2255 \end{equation} |
2261 \end{equation} |
2256 |
2262 |
2257 \noindent |
2263 \noindent |
2258 holds. Now the first part of~\ref{subseqreg} is a simple consequence of the second part. |
2264 holds. Now the first part of Theorem~\ref{subseqreg} is a simple consequence of the second part. |
2259 |
2265 |
2260 \begin{proof}[Proof of the First Part of Lemma~\ref{subseqreg}] |
2266 \begin{proof}[Proof of the First Part of Theorem~\ref{subseqreg}] |
2261 By the second part of Lemma~\ref{subseqreg}, we know the right-hand side of \eqref{compl} |
2267 By the second part, we know the right-hand side of \eqref{compl} |
2262 is regular, which means @{term "- SUBSEQ A"} is regular. But since |
2268 is regular, which means @{term "- SUBSEQ A"} is regular. But since |
2263 we established already that regularity is preserved under complement, also @{term "SUBSEQ A"} |
2269 we established already that regularity is preserved under complement, also @{term "SUBSEQ A"} |
2264 must be regular. |
2270 must be regular. |
2265 \end{proof} |
2271 \end{proof} |
2266 |
2272 |
2282 they need to be related by the Myhill-Nerode relation. |
2288 they need to be related by the Myhill-Nerode relation. |
2283 |
2289 |
2284 Using this lemma, it is straightforward to establish that the language |
2290 Using this lemma, it is straightforward to establish that the language |
2285 \mbox{@{text "A \<equiv> \<Union>\<^isub>n a\<^sup>n @ b\<^sup>n"}} is not regular (@{text "a\<^sup>n"} stands |
2291 \mbox{@{text "A \<equiv> \<Union>\<^isub>n a\<^sup>n @ b\<^sup>n"}} is not regular (@{text "a\<^sup>n"} stands |
2286 for the strings consisting of @{text n} times the character a; similarly for |
2292 for the strings consisting of @{text n} times the character a; similarly for |
2287 @{text "b\<^isub>n"}). For this consider the infinite set @{text "B \<equiv> \<Union>\<^isub>n a\<^sup>n"}. |
2293 @{text "b\<^isup>n"}). For this consider the infinite set @{text "B \<equiv> \<Union>\<^isub>n a\<^sup>n"}. |
2288 |
2294 |
2289 \begin{lmm} |
2295 \begin{lmm} |
2290 No two distinct strings in @{text "B"} are Myhill-Nerode related by @{text A}. |
2296 No two distinct strings in set @{text "B"} are Myhill-Nerode related by language @{text A}. |
2291 \end{lmm} |
2297 \end{lmm} |
2292 |
2298 |
2293 \begin{proof} |
2299 \begin{proof} |
2294 After unfolding the definitions, we need to establish that for @{term "i \<noteq> j"}, |
2300 After unfolding the definitions, we need to establish that for @{term "i \<noteq> j"}, |
2295 the equality \mbox{@{text "a\<^sup>i @ b\<^sup>j = a\<^sup>n @ b\<^sup>n"}} leads to a contradiction. This is clearly the case |
2301 the equality \mbox{@{text "a\<^sup>i @ b\<^sup>j = a\<^sup>n @ b\<^sup>n"}} leads to a contradiction. This is clearly the case |
2426 Our formalisation of the Myhill-Nerode theorem consists of 780 lines of |
2432 Our formalisation of the Myhill-Nerode theorem consists of 780 lines of |
2427 Isabelle/Isar code for the first direction and 460 for the second (the one |
2433 Isabelle/Isar code for the first direction and 460 for the second (the one |
2428 based on tagging-functions), plus around 300 lines of standard material |
2434 based on tagging-functions), plus around 300 lines of standard material |
2429 about regular languages. The formalisation of derivatives and partial |
2435 about regular languages. The formalisation of derivatives and partial |
2430 derivatives shown in Section~\ref{derivatives} consists of 390 lines of |
2436 derivatives shown in Section~\ref{derivatives} consists of 390 lines of |
2431 code. The closure properties in Section~\ref{closure} can be established in |
2437 code. The closure properties in Section~\ref{closure} (except Theorem~\ref{subseqreg}) |
2432 190 lines of code. The algorithm for solving equational systems, which we |
2438 can be established in |
|
2439 190 lines of code. The Continuation Lemma and the non-regularity of @{text "a\<^sup>n b\<^sup>n"} |
|
2440 require 80 lines of code. |
|
2441 The algorithm for solving equational systems, which we |
2433 used in the first direction, is conceptually relatively simple. Still the |
2442 used in the first direction, is conceptually relatively simple. Still the |
2434 use of sets over which the algorithm operates means it is not as easy to |
2443 use of sets over which the algorithm operates means it is not as easy to |
2435 formalise as one might hope. However, it seems sets cannot be avoided since |
2444 formalise as one might hope. However, it seems sets cannot be avoided since |
2436 the `input' of the algorithm consists of equivalence classes and we cannot |
2445 the `input' of the algorithm consists of equivalence classes and we cannot |
2437 see how to reformulate the theory so that we can use lists. Lists would be |
2446 see how to reformulate the theory so that we can use lists. Lists would be |
2465 \cite{myhillnerodeafp11}.\medskip |
2474 \cite{myhillnerodeafp11}.\medskip |
2466 |
2475 |
2467 \noindent |
2476 \noindent |
2468 {\bf Acknowledgements:} |
2477 {\bf Acknowledgements:} |
2469 We are grateful for the comments we received from Larry Paulson. Tobias |
2478 We are grateful for the comments we received from Larry Paulson. Tobias |
2470 Nipkow made us aware of the properties in Lemma~\ref{subseqreg} and Tjark |
2479 Nipkow made us aware of the properties in Theorem~\ref{subseqreg} and Tjark |
2471 Weber helped us with proving them. |
2480 Weber helped us with proving them. |
2472 *} |
2481 *} |
2473 |
2482 |
2474 |
2483 |
2475 (*<*) |
2484 (*<*) |