335 An alternative, which provides us with a single type for states of automata, |
335 An alternative, which provides us with a single type for states of automata, |
336 is to give every state node an identity, for example a natural number, and |
336 is to give every state node an identity, for example a natural number, and |
337 then be careful to rename these identities apart whenever connecting two |
337 then be careful to rename these identities apart whenever connecting two |
338 automata. This results in clunky proofs establishing that properties are |
338 automata. This results in clunky proofs establishing that properties are |
339 invariant under renaming. Similarly, connecting two automata represented as |
339 invariant under renaming. Similarly, connecting two automata represented as |
340 matrices results in very adhoc constructions, which are not pleasant to |
340 matrices results in messy constructions, which are not pleasant to |
341 reason about. |
341 formally reason about. \citeN[Page 67]{Braibant12}, for example, writes that there are no |
|
342 problems with reasoning about matrices, but that there is an |
|
343 ``intrinsic difficulty of working with rectangular matrices'' in some |
|
344 parts of his formalisation of formal languages in Coq. |
342 |
345 |
343 Functions are much better supported in Isabelle/HOL, but they still lead to |
346 Functions are much better supported in Isabelle/HOL, but they still lead to |
344 similar problems as with graphs. Composing, for example, two |
347 similar problems as with graphs. Composing, for example, two |
345 non-deterministic automata in parallel requires also the formalisation of |
348 non-deterministic automata in parallel requires also the formalisation of |
346 disjoint unions. \citeN{Nipkow98} dismisses for this the option of |
349 disjoint unions. \citeN{Nipkow98} dismisses for this the option of |
374 %condition upon functions in order to represent \emph{finite} automata. The |
377 %condition upon functions in order to represent \emph{finite} automata. The |
375 %best is probably to resort to more advanced reasoning frameworks, such as |
378 %best is probably to resort to more advanced reasoning frameworks, such as |
376 %\emph{locales} or \emph{type classes}, which are \emph{not} available in all |
379 %\emph{locales} or \emph{type classes}, which are \emph{not} available in all |
377 %HOL-based theorem provers. |
380 %HOL-based theorem provers. |
378 |
381 |
379 Because of these problems to do with representing automata, there seems to |
382 Because of these problems to do with representing automata, formalising |
380 be no substantial formalisation of automata theory and regular languages |
383 automata theory is surprisingly not as easy as one might think, despite the |
381 carried out in HOL-based theorem provers. \citeN{Nipkow98} establishes |
384 sometimes very detailed, but informal, textbook proofs. \citeN{LammichTuerk12} |
|
385 formalised Hopcroft's algorithm using an automata library of 27 kloc in |
|
386 Isabelle/HOL. There they use matrices for representing automata. |
|
387 Functions are used by \citeN{Nipkow98}, who establishes |
382 the link between regular expressions and automata in the context of |
388 the link between regular expressions and automata in the context of |
383 lexing. \citeN{BerghoferReiter09} formalise automata |
389 lexing. \citeN{BerghoferReiter09} use functions as well for formalising automata |
384 working over bit strings in the context of Presburger arithmetic. The only |
390 working over bit strings in the context of Presburger arithmetic. |
385 larger formalisations of automata theory are carried out in Nuprl by |
391 A Larger formalisation of automata theory, including the Myhill-Nerode theorem, |
386 \citeN{Constable00} and in Coq by \citeN{Filliatre97}, \citeN{Almeidaetal10} |
392 was carried out in Nuprl by \citeN{Constable00} which also uses functions. |
387 and \citeN{Braibant12}. |
393 Other large formailsations of automata theory in Coq are by |
|
394 \citeN{Filliatre97} who essentially uses graphs and by \citeN{Almeidaetal10} |
|
395 and \citeN{Braibant12}, who both use matrices. Many of these works, |
|
396 like \citeN{Nipkow98} or \citeN{Braibant12}, mention intrinsic problems with |
|
397 their representation of |
|
398 automata which made them `fight' their respective theorem prover. |
388 |
399 |
389 %Also, one might consider automata as just convenient `vehicles' for |
400 %Also, one might consider automata as just convenient `vehicles' for |
390 %establishing properties about regular languages. However, paper proofs |
401 %establishing properties about regular languages. However, paper proofs |
391 %about automata often involve subtle side-conditions which are easily |
402 %about automata often involve subtle side-conditions which are easily |
392 %overlooked, but which make formal reasoning rather painful. For example |
403 %overlooked, but which make formal reasoning rather painful. For example |
426 HOL. %Moreover, no side-conditions will be needed for regular expressions, |
437 HOL. %Moreover, no side-conditions will be needed for regular expressions, |
427 %like we need for automata. |
438 %like we need for automata. |
428 This convenience of regular expressions has |
439 This convenience of regular expressions has |
429 recently been exploited in HOL4 with a formalisation of regular expression |
440 recently been exploited in HOL4 with a formalisation of regular expression |
430 matching based on derivatives by \citeN{OwensSlind08}, and with an equivalence |
441 matching based on derivatives by \citeN{OwensSlind08}, and with an equivalence |
431 checker for regular expressions in Isabelle/HOL by \citeN{KraussNipkow11}. The |
442 checker for regular expressions in Isabelle/HOL by \citeN{KraussNipkow11} |
|
443 and in Matida by \citeN{Asperti12} and in Coq by \citeN{CoquandSiles12}. The |
432 main purpose of this paper is to show that a central result about regular |
444 main purpose of this paper is to show that a central result about regular |
433 languages---the Myhill-Nerode Theorem---can be recreated by only using |
445 languages---the Myhill-Nerode Theorem---can be recreated by only using |
434 regular expressions. This theorem gives necessary and sufficient conditions |
446 regular expressions. This theorem gives necessary and sufficient conditions |
435 for when a language is regular. As a corollary of this theorem we can easily |
447 for when a language is regular. As a corollary of this theorem we can easily |
436 establish the usual closure properties, including complementation, for |
448 establish the usual closure properties, including complementation, for |
2078 \end{center} |
2091 \end{center} |
2079 |
2092 |
2080 \noindent |
2093 \noindent |
2081 and assume @{text B} is any language and @{text A} is regular, then @{term |
2094 and assume @{text B} is any language and @{text A} is regular, then @{term |
2082 "Deriv_lang B A"} is regular. To see this consider the following argument |
2095 "Deriv_lang B A"} is regular. To see this consider the following argument |
2083 using partial derivatives: From @{text A} being regular we know there exists |
2096 using partial derivatives (which we used in Section~\ref{derivatives}): From @{text A} |
|
2097 being regular we know there exists |
2084 a regular expression @{text r} such that @{term "A = lang r"}. We also know |
2098 a regular expression @{text r} such that @{term "A = lang r"}. We also know |
2085 that @{term "pderivs_lang B r"} is finite for every language @{text B} and |
2099 that @{term "pderivs_lang B r"} is finite for every language @{text B} and |
2086 regular expression @{text r} (recall Theorem~\ref{antimirov}). By definition |
2100 regular expression @{text r} (recall Theorem~\ref{antimirov}). By definition |
2087 and \eqref{Derspders} we have |
2101 and \eqref{Derspders} we have |
2088 % |
2102 % |
2251 holds. Now the first part of Theorem~\ref{subseqreg} is a simple consequence of the second part. |
2265 holds. Now the first part of Theorem~\ref{subseqreg} is a simple consequence of the second part. |
2252 |
2266 |
2253 \begin{proof}[of the First Part of Theorem~\ref{subseqreg}] |
2267 \begin{proof}[of the First Part of Theorem~\ref{subseqreg}] |
2254 By the second part, we know the right-hand side of \eqref{compl} |
2268 By the second part, we know the right-hand side of \eqref{compl} |
2255 is regular, which means @{term "- SUBSEQ A"} is regular. But since |
2269 is regular, which means @{term "- SUBSEQ A"} is regular. But since |
2256 we established already that regularity is preserved under complement, also @{term "SUBSEQ A"} |
2270 we established already that regularity is preserved under complement (using Myhill-Nerode), |
2257 must be regular. |
2271 also @{term "SUBSEQ A"} must be regular. |
2258 \end{proof} |
2272 \end{proof} |
2259 |
2273 |
2260 Finally we like to show that the Myhill-Nerode Theorem is also convenient for establishing |
2274 Finally we like to show that the Myhill-Nerode Theorem is also convenient for establishing |
2261 the non-regularity of languages. For this we use the following version of the Continuation |
2275 the non-regularity of languages. For this we use the following version of the Continuation |
2262 Lemma (see for example~\cite{Rosenberg06}). |
2276 Lemma (see for example~\cite{Rosenberg06}). |
2416 Theorem~\ref{antimirov}). However, since partial derivatives use sets of |
2430 Theorem~\ref{antimirov}). However, since partial derivatives use sets of |
2417 regular expressions, one needs to carefully analyse whether the resulting |
2431 regular expressions, one needs to carefully analyse whether the resulting |
2418 algorithm is still executable. Given the infrastructure for |
2432 algorithm is still executable. Given the infrastructure for |
2419 executable sets introduced by \citeN{Haftmann09} in Isabelle/HOL, it should. |
2433 executable sets introduced by \citeN{Haftmann09} in Isabelle/HOL, it should. |
2420 |
2434 |
|
2435 We started out by claiming that in a theorem prover it is eaiser to reason |
|
2436 about regular expressions than about automta. Here are some numbers: |
2421 Our formalisation of the Myhill-Nerode Theorem consists of 780 lines of |
2437 Our formalisation of the Myhill-Nerode Theorem consists of 780 lines of |
2422 Isabelle/Isar code for the first direction and 460 for the second (the one |
2438 Isabelle/Isar code for the first direction and 460 for the second (the one |
2423 based on tagging-functions), plus around 300 lines of standard material |
2439 based on tagging-functions), plus around 300 lines of standard material |
2424 about regular languages. The formalisation of derivatives and partial |
2440 about regular languages. The formalisation of derivatives and partial |
2425 derivatives shown in Section~\ref{derivatives} consists of 390 lines of |
2441 derivatives shown in Section~\ref{derivatives} consists of 390 lines of |
2428 100 lines of code. The Continuation Lemma and the non-regularity of @{text "a\<^sup>n b\<^sup>n"} |
2444 100 lines of code. The Continuation Lemma and the non-regularity of @{text "a\<^sup>n b\<^sup>n"} |
2429 require 70 lines of code. |
2445 require 70 lines of code. |
2430 The algorithm for solving equational systems, which we |
2446 The algorithm for solving equational systems, which we |
2431 used in the first direction, is conceptually relatively simple. Still the |
2447 used in the first direction, is conceptually relatively simple. Still the |
2432 use of sets over which the algorithm operates means it is not as easy to |
2448 use of sets over which the algorithm operates means it is not as easy to |
2433 formalise as one might hope. However, it seems sets cannot be avoided since |
2449 formalise as one might wish. However, it seems sets cannot be avoided since |
2434 the `input' of the algorithm consists of equivalence classes and we cannot |
2450 the `input' of the algorithm consists of equivalence classes and we cannot |
2435 see how to reformulate the theory so that we can use lists or matrices. Lists would be |
2451 see how to reformulate the theory so that we can use lists or matrices. Lists would be |
2436 much easier to reason about, since we can define functions over them by |
2452 much easier to reason about, since we can define functions over them by |
2437 recursion. For sets we have to use set-comprehensions, which is slightly |
2453 recursion. For sets we have to use set-comprehensions, which is slightly |
2438 unwieldy. Matrices would allow us to use the slick formalisation by |
2454 unwieldy. Matrices would allow us to use the slick formalisation by |
2443 formalised the Myhill-Nerode Theorem in Nuprl using automata. They write |
2459 formalised the Myhill-Nerode Theorem in Nuprl using automata. They write |
2444 that their four-member team would need something on the magnitude of 18 months |
2460 that their four-member team would need something on the magnitude of 18 months |
2445 for their formalisation of the first eleven chapters of the textbook by \citeN{HopcroftUllman69}, |
2461 for their formalisation of the first eleven chapters of the textbook by \citeN{HopcroftUllman69}, |
2446 which includes the Myhill-Nerode theorem. It is hard to gauge the size of a |
2462 which includes the Myhill-Nerode theorem. It is hard to gauge the size of a |
2447 formalisation in Nurpl, but from what is shown in the Nuprl Math Library |
2463 formalisation in Nurpl, but from what is shown in the Nuprl Math Library |
2448 about their development it seems substantially larger than ours. We attribute |
2464 about their development it seems \emph{substantially} larger than ours. We attribute |
2449 this to our use of regular expressions, which meant we did not need to `fight' |
2465 this to our use of regular expressions, which meant we did not need to `fight' |
2450 the theorem prover. |
2466 the theorem prover. Recently, \citeN{LammichTuerk12} formalised Hopcroft's |
|
2467 algorithm in Isabelle/HOL (in 7000 lines of code) using an automata |
|
2468 library of 27000 lines of code. |
2451 Also, \citeN{Filliatre97} reports that his formalisation in |
2469 Also, \citeN{Filliatre97} reports that his formalisation in |
2452 Coq of automata theory and Kleene's theorem is ``rather big''. |
2470 Coq of automata theory and Kleene's theorem is ``rather big''. |
2453 More recently, \citeN{Almeidaetal10} reported about another |
2471 \citeN{Almeidaetal10} reported about another |
2454 formalisation of regular languages in Coq. Their |
2472 formalisation of regular languages in Coq. Their |
2455 main result is the |
2473 main result is the |
2456 correctness of Mirkin's construction of an automaton from a regular |
2474 correctness of Mirkin's construction of an automaton from a regular |
2457 expression using partial derivatives. This took approximately 10600 lines |
2475 expression using partial derivatives. This took approximately 10600 lines |
2458 of code. Also \citeN{Braibant12} formalised a large part of regular language |
2476 of code. \citeN{Braibant12} formalised a large part of regular language |
2459 theory and Kleene algebras in Coq. While he is mainly interested |
2477 theory and Kleene algebras in Coq. While he is mainly interested |
2460 in implementing decision procedures for Kleene algebras, his library |
2478 in implementing decision procedures for Kleene algebras, his library |
2461 includes a proof of the Myhill-Nerode theorem. He reckons that our |
2479 includes a proof of the Myhill-Nerode theorem. He reckons that our |
2462 ``development is more concise'' than his one based on matrices \cite[Page 67]{Braibant12}. |
2480 ``development is more concise'' than his one based on matrices \cite[Page 67]{Braibant12}. |
|
2481 He writes that there is no conceptual problems with formally reasoning |
|
2482 about matrices for automata, but notes ``intrinsic difficult[ies]'' when working |
|
2483 with matrices in Coq, which is the sort of `fighting' one would encounter also |
|
2484 in other theorem provers. |
|
2485 |
2463 In terms of time, the estimate for our formalisation is that we |
2486 In terms of time, the estimate for our formalisation is that we |
2464 needed approximately 3 months and this included the time to find our proof |
2487 needed approximately 3 months and this included the time to find our proof |
2465 arguments. Unlike \citeN{Constable00}, who were able to follow the Myhill-Nerode |
2488 arguments. Unlike \citeN{Constable00}, who were able to follow the Myhill-Nerode |
2466 proof by \citeN{HopcroftUllman69}, we had to find our own arguments. So for us the |
2489 proof by \citeN{HopcroftUllman69}, we had to find our own arguments. So for us the |
2467 formalisation was not the bottleneck. The code of |
2490 formalisation was not the bottleneck. The code of |
2468 our formalisation can be found in the Archive of Formal Proofs at |
2491 our formalisation \cite{myhillnerodeafp11} can be found in the Archive of Formal Proofs at |
2469 \mbox{\url{http://afp.sourceforge.net/entries/Myhill-Nerode.shtml}} |
2492 \mbox{\url{http://afp.sourceforge.net/entries/Myhill-Nerode.shtml}}.\smallskip |
2470 \cite{myhillnerodeafp11}.\smallskip |
|
2471 |
2493 |
2472 \noindent |
2494 \noindent |
2473 {\bf Acknowledgements:} |
2495 {\bf Acknowledgements:} |
2474 We are grateful for the comments we received from Larry Paulson. Tobias |
2496 We are grateful for the comments we received from Larry Paulson. Tobias |
2475 Nipkow made us aware of the properties in Theorem~\ref{subseqreg} and Tjark |
2497 Nipkow made us aware of the properties in Theorem~\ref{subseqreg} and Tjark |