regexp: Attic/Myhill.thy@92ca56c1a199


theory Myhill
  imports Myhill_2
begin

section {* Preliminaries \label{sec_prelim}*}

subsection {* Finite automata and \mht  \label{sec_fa_mh} *}

text {*

A {\em determinisitc finite automata (DFA)} $M$ is a 5-tuple 
$(Q, \Sigma, \delta, s, F)$, where:
\begin{enumerate}
  \item $Q$ is a finite set of {\em states}, also denoted $Q_M$.
  \item $\Sigma$ is a finite set of {\em alphabets}, also denoted $\Sigma_M$.
  \item $\delta$ is a {\em transition function} of type @{text "Q \<times> \<Sigma> \<Rightarrow> Q"} (a total function),
           also denoted $\delta_M$.
  \item @{text "s \<in> Q"} is a state called {\em initial state}, also denoted $s_M$.
  \item @{text "F \<subseteq> Q"} is a set of states named {\em accepting states}, also denoted $F_M$.
\end{enumerate}
Therefore, we have $M = (Q_M, \Sigma_M, \delta_M, s_M, F_M)$. Every DFA $M$ can be interpreted as 
a function assigning states to strings, denoted $\dfa{M}$, the  definition of which is as the following:
\begin{equation}
\begin{aligned}
  \dfa{M}([]) &\equiv s_M \\
   \dfa{M}(xa) &\equiv  \delta_M(\dfa{M}(x), a)
\end{aligned}
\end{equation}
A string @{text "x"} is said to be {\em accepted} (or {\em recognized}) by a DFA $M$ if 
$\dfa{M}(x) \in F_M$. The language recoginzed by DFA $M$, denoted
$L(M)$, is defined as:
\begin{equation}
  L(M) \equiv \{x~|~\dfa{M}(x) \in F_M\}
\end{equation}
The standard way of specifying a laugage $\Lang$ as {\em regular} is by stipulating that:
$\Lang = L(M)$ for some DFA $M$.

For any DFA $M$, the DFA obtained by changing initial state to another $p \in Q_M$ is denoted $M_p$, 
which is defined as:
\begin{equation}
    M_p \ \equiv\ (Q_M, \Sigma_M, \delta_M, p, F_M) 
\end{equation}
Two states $p, q \in Q_M$ are said to be {\em equivalent}, denoted $p \approx_M q$, iff.
\begin{equation}\label{m_eq_def}
  L(M_p) = L(M_q)
\end{equation}
It is obvious that $\approx_M$ is an equivalent relation over $Q_M$. and 
the partition induced by $\approx_M$ has $|Q_M|$ equivalent classes.
By overloading $\approx_M$,  an equivalent relation over strings can be defined:
\begin{equation}
  x \approx_M y ~ ~ \equiv ~ ~ \dfa{M}(x) \approx_M \dfa{M}(y)
\end{equation}
It can be proved that the the partition induced by $\approx_M$ also has $|Q_M|$ equivalent classes.
It is also easy to show that: if $x \approx_M y$, then $x \approx_{L(M)} y$, and this means
$\approx_M$ is a more refined equivalent relation than $\approx_{L(M)}$. Since partition induced by
$\approx_M$ is finite, the one induced by $\approx_{L(M)}$ must also be finite, and this is 
one of the two directions of \mht:
\begin{Lem}[\mht , Direction two]
  If a language $\Lang$ is regular (i.e. $\Lang = L(M)$ for some DFA $M$), then 
  the partition induced by $\approx_\Lang$ is finite.
\end{Lem}

The other direction is:
\begin{Lem}[\mht , Direction one]\label{auto_mh_d1}
  If  the partition induced by $\approx_\Lang$ is finite, then
  $\Lang$ is regular (i.e. $\Lang = L(M)$ for some DFA $M$).
\end{Lem}
The $M$ we are seeking when prove lemma \ref{auto_mh_d2} can be constructed out of $\approx_\Lang$,
denoted $M_\Lang$ and defined as the following:
\begin{subequations}
\begin{eqnarray}
  Q_{M_\Lang} ~ & \equiv & ~ \{ \cls{x}{\approx_\Lang}~|~ x \in \Sigma^* \}\\
  \Sigma_{M_\Lang} ~ & \equiv & ~ \Sigma_M \\
  \delta_{M_\Lang} ~ & \equiv & ~ \left (\lambda (\cls{x}{\approx_\Lang}, a).  \cls{xa}{\approx_\Lang} \right) \\
  s_{M_\Lang} ~ & \equiv & ~ \cls{[]}{\approx_\Lang} \\
  F_{M_\Lang} ~ & \equiv & ~ \{ \cls{x}{\approx_\Lang}~|~ x \in \Lang \}
\end{eqnarray}
\end{subequations}
It can be proved that $Q_{M_\Lang}$ is indeed finite and $\Lang = L(M_\Lang)$, so lemma \ref{auto_mh_d1} holds.
It can also be proved that $M_\Lang$ is the minimal DFA (therefore unique) which recoginzes $\Lang$.



*}

subsection {* The objective and the underlying intuition *}

text {*
  It is now obvious from section \ref{sec_fa_mh} that \mht\ can be established easily when
  {\em reglar languages} are defined as ones recognized by finite automata. 
  Under the context where the use of finite automata is forbiden, the situation is quite different.
  The theorem now has to be expressed as:
  \begin{Thm}[\mht , Regular expression version]
      A language $\Lang$ is regular (i.e. $\Lang = L(\re)$ for some {\em regular expression} $\re$)
      iff. the partition induced by $\approx_\Lang$ is finite.
  \end{Thm}
  The proof of this version consists of two directions (if the use of automata are not allowed):
  \begin{description}
    \item [Direction one:] 
      generating a regular expression $\re$ out of the finite partition induced by $\approx_\Lang$,
      such that $\Lang = L(\re)$.
    \item [Direction two:]
          showing the finiteness of the partition induced by $\approx_\Lang$, under the assmption
          that $\Lang$ is recognized by some regular expression $\re$ (i.e. $\Lang = L(\re)$).
  \end{description}
  The development of these two directions consititutes the body of this paper.

*}

section {* Direction @{text "regular language \<Rightarrow>finite partition"} *}

text {*
  Although not used explicitly, the notion of finite autotmata and its relationship with 
  language partition, as outlined in section \ref{sec_fa_mh}, still servers as important intuitive 
  guides in the development of this paper.
  For example, {\em Direction one} follows the {\em Brzozowski algebraic method}
  used to convert finite autotmata to regular expressions, under the intuition that every 
  partition member $\cls{x}{\approx_\Lang}$ is a state in the DFA $M_\Lang$  constructed to prove 
  lemma \ref{auto_mh_d1} of section \ref{sec_fa_mh}.

  The basic idea of Brzozowski method is to extract an equational system out of the 
  transition relationship of the automaton in question. In the equational system, every
  automaton state is represented by an unknown, the solution of which is expected to be 
  a regular expresion characterizing the state in a certain sense. There are two choices of 
  how a automaton state can be characterized.  The first is to characterize by the set of 
  strings leading from the state in question into accepting states. 
  The other choice is to characterize by the set of strings leading from initial state 
  into the state in question. For the second choice, the language recognized the automaton 
  can be characterized by the solution of initial state, while for the second choice, 
  the language recoginzed by the automaton can be characterized by 
  combining solutions of all accepting states by @{text "+"}. Because of the automaton 
  used as our intuitive guide, the $M_\Lang$, the states of which are 
  sets of strings leading from initial state, the second choice is used in this paper.

  Supposing the automaton in Fig \ref{fig_auto_part_rel} is the $M_\Lang$ for some language $\Lang$, 
  and suppose $\Sigma = \{a, b, c, d, e\}$. Under the second choice, the equational system extracted is:
  \begin{subequations}
  \begin{eqnarray}
    X_0 & = & X_1 \cdot c + X_2 \cdot d + \lambda \label{x_0_def_o} \\
    X_1 & = & X_0 \cdot a + X_1 \cdot b + X_2 \cdot d \label{x_1_def_o} \\
    X_2 & = & X_0 \cdot b + X_1 \cdot d + X_2 \cdot a \\
    X_3 & = & \begin{aligned}
                 & X_0 \cdot (c + d + e) + X_1 \cdot (a + e) + X_2 \cdot (b + e) + \\
                 & X_3 \cdot (a + b + c + d + e)
              \end{aligned}
  \end{eqnarray}
  \end{subequations} 

\begin{figure}[h!]
\centering
\begin{minipage}{0.5\textwidth}
\scalebox{0.8}{
\begin{tikzpicture}[ultra thick,>=latex]
  \node[draw,circle,initial] (start) {$X_0$};
  \node[draw,circle,accepting] at ($(start) + (3.5cm,2cm)$) (ac1) {$X_1$};
  \node[draw,circle,accepting] at ($(start) + (3.5cm,-2cm)$) (ac2) {$X_2$};
  \node[draw,circle] at ($(start) + (6.5cm,0cm)$) (ab) {$X_3$};

  \path[->] (start) edge [bend left] node [midway, above] {$a$} (ac1);
  \path[->] (start) edge [bend right] node [midway, below] {$b$} (ac2);
  \path[->] (ac1) edge [loop above] node [midway, above] {$b$} (ac1);
  \path[->] (ac2) edge [loop below] node [midway, below] {$a$} (ac2);
  \path[->] (ac1) edge [bend right] node [midway, left] {$c$} (ac2);
  \path[->] (ac2) edge [bend right] node [midway, right] {$c$} (ac1);
  \path[->] (ac1) edge node [midway, sloped, above] {$d$} (start);
  \path[->] (ac2) edge node [midway, sloped, above] {$d$} (start);

  \path [draw, rounded corners,->,dashed] (start) -- ($(start) + (0cm, 3.7cm)$)
         -- ($(ab) + (0cm, 3.7cm)$)  node[midway,above,sloped]{$\Sigma - \{a, b\}$} -- (ab);
  \path[->,dashed] (ac1) edge node [midway, above, sloped] {$\Sigma - \{b,c,d\}$} (ab);
  \path[->,dashed] (ac2) edge node [midway, below, sloped] {$\Sigma - \{a,c,d\}$} (ab);
  \path[->,dashed] (ab) edge [loop right] node [midway, right] {$\Sigma$} (ab);
\end{tikzpicture}}
\end{minipage}
\caption{An example automaton}\label{fig_auto_part_rel}
\end{figure}

  Every $\cdot$-item on the right side of equations describes some state transtions, except 
  the $\lambda$ in \eqref{x_0_def_o}, which represents empty string @{text "[]"}. 
  The reason is that: every state is characterized by the
  set of incoming strings leading from initial state. For non-initial state, every such
  string can be splitted into a prefix leading into a preceding state and a single character suffix 
  transiting into from the preceding state. The exception happens at
  initial state, where the empty string is a incoming string which can not be splitted. The $\lambda$
  in \eqref{x_0_def_o} is introduce to repsent this indivisible string. There is one and only one
  $\lambda$ in every equational system such obtained, becasue $[]$ can only be contaied in one
  equivalent class (the intial state in $M_\Lang$) and equivalent classes are disjoint. 
  
  Suppose all unknowns ($X_0, X_1, X_2, X_3$) are  solvable, the regular expression charactering 
  laugnage $\Lang$ is $X_1 + X_2$. This paper gives a procedure
  by which arbitrarily picked unknown can be solved. The basic idea to solve $X_i$ is by 
  eliminating all variables other than $X_i$ from the equational system. If
  $X_0$ is the one picked to be solved,  variables $X_1, X_2, X_3$ have to be removed one by 
  one.  The order to remove does not matter as long as the remaing equations are kept valid.
  Suppose $X_1$ is the first one to remove, the action is to replace all occurences of $X_1$ 
  in remaining equations by the right hand side of its characterizing equation, i.e. 
  the $ X_0 \cdot a + X_1 \cdot b + X_2 \cdot d$ in \eqref{x_1_def_o}. However, because
  of the recursive occurence of $X_1$, this replacement does not really removed $X_1$. Arden's
  lemma is invoked to transform recursive equations like \eqref{x_1_def_o} into non-recursive 
  ones. For example, the recursive equation \eqref{x_1_def_o} is transformed into the follwing
  non-recursive one:
  \begin{equation}
     X_1  =  (X_0 \cdot a + X_2 \cdot d) \cdot b^* = X_0 \cdot (a \cdot b^*) + X_2 \cdot (d \cdot b^*)
  \end{equation}
  which, by Arden's lemma, still characterizes $X_1$ correctly. By subsituting
  $(X_0 \cdot a + X_2 \cdot d) \cdot b^*$ for all $X_1$ and removing  \eqref{x_1_def_o},
  we get:
  \begin{subequations}
  \begin{eqnarray}
    X_0 & = & \begin{aligned}
              & (X_0 \cdot (a \cdot b^*) + X_2 \cdot (d \cdot b^*)) \cdot c + 
                X_2 \cdot d + \lambda = \\
              & X_0 \cdot (a \cdot b^* \cdot c) + X_2 \cdot (d \cdot b^* \cdot c) + 
                X_2 \cdot d + \lambda = \\
              &  X_0 \cdot (a \cdot b^* \cdot c) + X_2 \cdot (d \cdot b^* \cdot c + d) + \lambda 
              \end{aligned} \label{x_0_def_1} \\
    X_2 & = & \begin{aligned}
                 & X_0 \cdot b + (X_0 \cdot (a \cdot b^*) + X_2 \cdot (d \cdot b^*)) \cdot d + X_2 \cdot a = \\
                 & X_0 \cdot b + X_0 \cdot (a \cdot b^* \cdot d) + X_2 \cdot (d \cdot b^* \cdot d) + X_2 \cdot a = \\
                 & X_0 \cdot (b + a \cdot b^* \cdot d) + X_2 \cdot (d \cdot b^* \cdot d + a)
              \end{aligned} \\
    X_3 & = & \begin{aligned}
                 & X_0 \cdot (c + d + e) + ((X_0 \cdot a + X_2 \cdot d) \cdot b^*) \cdot (a + e)\\
                 & + X_2 \cdot (b + e) + X_3 \cdot (a + b + c + d + e) \label{x_3_def_1}
              \end{aligned}
  \end{eqnarray}
  \end{subequations}  
Suppose $X_3$ is the one to remove next, since $X_3$ dose not appear in either $X_0$ or $X_2$, 
the removal of equation \eqref{x_3_def_1} changes nothing in the rest equations. Therefore, we get:
 \begin{subequations}
  \begin{eqnarray}
    X_0 & = & X_0 \cdot (a \cdot b^* \cdot c) + X_2 \cdot (d \cdot b^* \cdot c + d) + \lambda  \label{x_0_def_2} \\
    X_2 & = & X_0 \cdot (b + a \cdot b^* \cdot d) + X_2 \cdot (d \cdot b^* \cdot d + a) \label{x_2_def_2}
  \end{eqnarray}
  \end{subequations}  
Actually, since absorbing state like $X_3$ contributes nothing to the language $\Lang$, it could have been removed
at the very beginning of this precedure without affecting the final result. Now, the last unknown to remove 
is $X_2$ and the Arden's transformaton of \eqref{x_2_def_2} is:
\begin{equation} \label{x_2_ardened}
   X_2 ~ = ~  (X_0 \cdot (b + a \cdot b^* \cdot d)) \cdot (d \cdot b^* \cdot d + a)^* =
              X_0 \cdot ((b + a \cdot b^* \cdot d) \cdot (d \cdot b^* \cdot d + a)^*)
\end{equation}
By substituting the right hand side of \eqref{x_2_ardened} into \eqref{x_0_def_2}, we get:
\begin{equation}
\begin{aligned}
  X_0  & = && X_0 \cdot (a \cdot b^* \cdot c) + \\
       &   && X_0 \cdot ((b + a \cdot b^* \cdot d) \cdot (d \cdot b^* \cdot d + a)^*) \cdot 
           (d \cdot b^* \cdot c + d) + \lambda \\
       & =  && X_0 \cdot ((a \cdot b^* \cdot c) + \\
       &   && \hspace{3em} ((b + a \cdot b^* \cdot d) \cdot (d \cdot b^* \cdot d + a)^*) \cdot 
           (d \cdot b^* \cdot c + d)) + \lambda 
\end{aligned}
\end{equation}
By applying Arden's transformation to this, we get the solution of $X_0$ as:
\begin{equation}
\begin{aligned}
  X_0  =  ((a \cdot b^* \cdot c) + 
                ((b + a \cdot b^* \cdot d) \cdot (d \cdot b^* \cdot d + a)^*) \cdot 
                        (d \cdot b^* \cdot c + d))^*
\end{aligned}
\end{equation}
Using the same method, solutions for $X_1$ and $X_2$ can be obtained as well and the
regular expressoin for $\Lang$ is just $X_1 + X_2$. The formalization of this procedure
consititues the first direction of the {\em regular expression} verion of
\mht. Detailed explaination are given in {\bf paper.pdf} and more details and comments
can be found in the formal scripts.
*}

section {* Direction @{text "finite partition \<Rightarrow> regular language"} *}

text {*
  It is well known in the theory of regular languages that
  the existence of finite  language partition amounts to the existence of 
  a minimal automaton, i.e. the $M_\Lang$ constructed in section \ref{sec_prelim}, which recoginzes the 
  same language $\Lang$. The standard way to prove the existence of finite language partition 
  is to construct a automaton out of the regular expression which recoginzes the same language, from
  which the existence of finite language partition follows immediately. As discussed in the introducton of 
  {\bf paper.pdf} as well as in [5], the problem for this approach happens when automata 
  of sub regular expressions are combined to form the automaton of the mother regular expression, 
  no matter what kind of representation is used, the formalization is cubersome, as said 
  by Nipkow in [5]: `{\em a more abstract mathod is clearly desirable}'. In this section, 
  an {\em intrinsically abstract} method is given, which completely avoid the mentioning of automata,
  let along any particular representations.
  *}

text {*
  The main proof structure is a structural induction on regular expressions,
  where base cases (cases for @{const "NULL"}, @{const "EMPTY"}, @{const "CHAR"}) are quite straightforward to
  proof. Real difficulty lies in inductive cases.  By inductive hypothesis, languages defined by
  sub-expressions induce finite partitiions. Under such hypothsis, we need to prove that the language 
  defined by the composite regular expression gives rise to finite partion.  
  The basic idea is to attach a tag @{text "tag(x)"} to every string 
  @{text "x"}. The tagging fuction @{text "tag"} is carefully devised, which returns tags 
  made of equivalent classes of the partitions induced by subexpressoins, and therefore has a finite
  range. Let @{text "Lang"} be the composite language, it is proved that:
  \begin{quote}
  If strings with the same tag are equivalent with respect to @{text "Lang"}, expressed as:
  \[
  @{text "tag(x) = tag(y) \<Longrightarrow> x \<approx>Lang y"}
  \]
  then the partition induced by @{text "Lang"} must be finite.
  \end{quote}
  There are two arguments for this. The first goes as the following:
  \begin{enumerate}
    \item First, the tagging function @{text "tag"} induces an equivalent relation @{text "(=tag=)"} 
          (defiintion of @{text "f_eq_rel"} and lemma @{text "equiv_f_eq_rel"}).
    \item It is shown that: if the range of @{text "tag"} (denoted @{text "range(tag)"}) is finite, 
           the partition given rise by @{text "(=tag=)"} is finite (lemma @{text "finite_eq_f_rel"}).
           Since tags are made from equivalent classes from component partitions, and the inductive
           hypothesis ensures the finiteness of these partitions, it is not difficult to prove
           the finiteness of @{text "range(tag)"}.
    \item It is proved that if equivalent relation @{text "R1"} is more refined than @{text "R2"}
           (expressed as @{text "R1 \<subseteq> R2"}),
           and the partition induced by @{text "R1"} is finite, then the partition induced by @{text "R2"}
           is finite as well (lemma @{text "refined_partition_finite"}).
    \item The injectivity assumption @{text "tag(x) = tag(y) \<Longrightarrow> x \<approx>Lang y"} implies that
            @{text "(=tag=)"} is more refined than @{text "(\<approx>Lang)"}.
    \item Combining the points above, we have: the partition induced by language @{text "Lang"}
          is finite (lemma @{text "tag_finite_imageD"}).
  \end{enumerate}

We could have followed another approach given in  appendix II of Brzozowski's paper [?], where
the set of derivatives of any regular expression can be proved to be finite. 
Since it is easy to show that strings with same derivative are equivalent with respect to the 
language, then the second direction follows. We believe that our
apporoach is easy to formalize, with no need to do complicated derivation calculations
and countings as in [???].
*}


end
author	Christian Urban <christian dot urban at kcl dot ac dot uk>
	Thu, 12 Sep 2013 17:20:48 +0100
changeset 386	92ca56c1a199
parent 170	b1258b7d2789
permissions	-rw-r--r--