afl-material: handouts/ho03.tex@18bef085a7ca

\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../graphics}


\begin{document}

\section*{Handout 3 (Automata)}

Every formal language course I know of bombards you first with
automata and then to a much, much smaller extend with regular
expressions. As you can see, this course is turned upside
down: regular expressions come first. The reason is that
regular expressions are easier to reason about and the notion
of derivatives, although already quite old, only became more
widely known rather recently. Still let us in this lecture
have a closer look at automata and their relation to regular
expressions. This will help us with understanding why the
regular expression matchers in Python and Ruby are so slow
with certain regular expressions. The central definition
is:\medskip 

\noindent 
A \emph{deterministic finite automaton} (DFA), say $A$, is
defined by a four-tuple written $A(Q, q_0, F, \delta)$ where

\begin{itemize}
\item $Q$ is a finite set of states,
\item $q_0 \in Q$ is the start state,
\item $F \subseteq Q$ are the accepting states, and
\item $\delta$ is the transition function.
\end{itemize}

\noindent The transition function determines how to
``transition'' from one state to the next state with respect
to a character. We have the assumption that these transition
functions do not need to be defined everywhere: so it can be
the case that given a character there is no next state, in
which case we need to raise a kind of ``failure exception''. A
typical example of a DFA is

\begin{center}
\begin{tikzpicture}[>=stealth',very thick,auto,
                    every state/.style={minimum size=0pt,
                    inner sep=2pt,draw=blue!50,very thick,
                    fill=blue!20},scale=2]
\node[state,initial]  (q_0)  {$q_0$};
\node[state] (q_1) [right=of q_0] {$q_1$};
\node[state] (q_2) [below right=of q_0] {$q_2$};
\node[state] (q_3) [right=of q_2] {$q_3$};
\node[state, accepting] (q_4) [right=of q_1] {$q_4$};
\path[->] (q_0) edge node [above]  {$a$} (q_1);
\path[->] (q_1) edge node [above]  {$a$} (q_4);
\path[->] (q_4) edge [loop right] node  {$a, b$} ();
\path[->] (q_3) edge node [right]  {$a$} (q_4);
\path[->] (q_2) edge node [above]  {$a$} (q_3);
\path[->] (q_1) edge node [right]  {$b$} (q_2);
\path[->] (q_0) edge node [above]  {$b$} (q_2);
\path[->] (q_2) edge [loop left] node  {$b$} ();
\path[->] (q_3) edge [bend left=95, looseness=1.3] node [below]  {$b$} (q_0);
\end{tikzpicture}
\end{center}

\noindent In this graphical notation, the accepting state
$q_4$ is indicated with double circles. Note that there can be
more than one accepting state. It is also possible that a DFA
has no accepting states at all, or that the starting state is
also an accepting state. In the case above the transition
function is defined everywhere and can be given as a table as
follows:

\[
\begin{array}{lcl}
(q_0, a) &\rightarrow& q_1\\
(q_0, b) &\rightarrow& q_2\\
(q_1, a) &\rightarrow& q_4\\
(q_1, b) &\rightarrow& q_2\\
(q_2, a) &\rightarrow& q_3\\
(q_2, b) &\rightarrow& q_2\\
(q_3, a) &\rightarrow& q_4\\
(q_3, b) &\rightarrow& q_0\\
(q_4, a) &\rightarrow& q_4\\
(q_4, b) &\rightarrow& q_4\\
\end{array}
\]

We need to define the notion of what language is accepted by
an automaton. For this we lift the transition function
$\delta$ from characters to strings as follows:

\[
\begin{array}{lcl}
\hat{\delta}(q, [])        & \dn & q\\
\hat{\delta}(q, c\!::\!s) & \dn & \hat{\delta}(\delta(q, c), s)\\
\end{array}
\]

\noindent This lifted transition function is often called
``delta-hat''. Given a string, we start in the starting state
and take the first character of the string, follow to the next
sate, then take the second character and so on. Once the
string is exhausted and we end up in an accepting state, then
this string is accepted by the automaton. Otherwise it is not
accepted. So $s$ is in the \emph{language accepted by the
automaton} $A(Q, q_0, F, \delta)$ iff

\[
\hat{\delta}(q_0, s) \in F 
\]

\noindent I let you think about a definition that describes
the set of strings accepted by an automaton.
  

While with DFAs it will always clear that given a character
what the next state is (potentially none), it will be useful
to relax this restriction. That means we have several
potential successor states. We even allow ``silent
transitions'', also called epsilon-transitions. They allow us
to go from one state to the next without having a character
consumed. We label such silent transition with the letter
$\epsilon$. The resulting construction is called a
\emph{non-deterministic finite automaton} (NFA) given also as
a four-tuple $A(Q, q_0, F, \rho)$ where

\begin{itemize}
\item $Q$ is a finite set of states
\item $q_0$ is a start state
\item $F$ are some accepting states with $F \subseteq Q$, and
\item $\rho$ is a transition relation.
\end{itemize}

\noindent
Two typical examples of NFAs are
\begin{center}
\begin{tabular}[t]{c@{\hspace{9mm}}c}
\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
                             every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
\node[state,initial]  (q_0)  {$q_0$};
\node[state] (q_1) [above=of q_0] {$q_1$};
\node[state, accepting] (q_2) [below=of q_0] {$q_2$};
\path[->] (q_0) edge node [left]  {$\epsilon$} (q_1);
\path[->] (q_0) edge node [left]  {$\epsilon$} (q_2);
\path[->] (q_0) edge [loop right] node  {$a$} ();
\path[->] (q_1) edge [loop above] node  {$a$} ();
\path[->] (q_2) edge [loop below] node  {$b$} ();
\end{tikzpicture} &

\raisebox{20mm}{
\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
                             every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
\node[state,initial]  (r_1)  {$r_1$};
\node[state] (r_2) [above=of r_1] {$r_2$};
\node[state, accepting] (r_3) [right=of r_1] {$r_3$};
\path[->] (r_1) edge node [below]  {$b$} (r_3);
\path[->] (r_2) edge [bend left] node [above]  {$a$} (r_3);
\path[->] (r_1) edge [bend left] node  [left] {$\epsilon$} (r_2);
\path[->] (r_2) edge [bend left] node  [right] {$a$} (r_1);
\end{tikzpicture}}
\end{tabular}
\end{center}

\noindent There are, however, a number of points you should
note. Every DFA is a NFA, but not vice versa. The $\rho$ in
NFAs is a transition \emph{relation} (DFAs have a transition
function). The difference between a function and a relation is
that a function has always a single output, while a relation
gives, roughly speaking, several outputs. Look at the NFA on
the right-hand side above: if you are currently in the state
$r_2$ and you read a character $a$, then you can transition to
either $r_1$ \emph{or} $r_3$. Which route you take is not
determined. This means if we need to decide whether a string
is accepted by a NFA, we might have to explore all
possibilities. Also there is the special silent transition in
NFAs. As mentioned already this transition means you do not
have to ``consume'' any part of the input string, but
``silently'' change to a different state. In the left picture,
for example, if you are in the starting state, you can 
silently move either to $q_1$ or $q_2$.


\subsection*{Thompson Construction}

The reason for introducing NFAs is that there is a relatively
simple (recursive) translation of regular expressions into
NFAs. Consider the simple regular expressions $\varnothing$,
$\epsilon$ and $c$. They can be translated as follows:

\begin{center}
\begin{tabular}[t]{l@{\hspace{10mm}}l}
\raisebox{1mm}{$\varnothing$} & 
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (q_0)  {$\mbox{}$};
\end{tikzpicture}\\\\
\raisebox{1mm}{$\epsilon$} & 
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial, accepting]  (q_0)  {$\mbox{}$};
\end{tikzpicture}\\\\
\raisebox{2mm}{$c$} & 
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (q_0)  {$\mbox{}$};
\node[state, accepting]  (q_1)  [right=of q_0] {$\mbox{}$};
\path[->] (q_0) edge node [below]  {$c$} (q_1);
\end{tikzpicture}\\\\
\end{tabular}
\end{center}

\noindent The case for the sequence regular expression $r_1
\cdot r_2$ is as follows: We are given by recursion two
automata representing $r_1$ and $r_2$ respectively. 

\begin{center}
\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (q_0)  {$\mbox{}$};
\node (r_1)  [right=of q_0] {$\ldots$};
\node[state, accepting]  (t_1)  [right=of r_1] {$\mbox{}$};
\node[state, accepting]  (t_2)  [above=of t_1] {$\mbox{}$};
\node[state, accepting]  (t_3)  [below=of t_1] {$\mbox{}$};
\node[state, initial]  (a_0)  [right=2.5cm of t_1] {$\mbox{}$};
\node (b_1)  [right=of a_0] {$\ldots$};
\node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
\node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
\node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
\begin{pgfonlayer}{background}
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (r_1) (t_1) (t_2) (t_3)] {};
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (a_0) (b_1) (c_1) (c_2) (c_3)] {};
\node [yshift=2mm] at (1.north) {$r_1$};
\node [yshift=2mm] at (2.north) {$r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent The first automaton has some accepting states. We
obtain an automaton for $r_1\cdot r_2$ by connecting these
accepting states with $\epsilon$-transitions to the starting
state of the second automaton. By doing so we make them
non-accepting like so:

\begin{center}
\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (q_0)  {$\mbox{}$};
\node (r_1)  [right=of q_0] {$\ldots$};
\node[state]  (t_1)  [right=of r_1] {$\mbox{}$};
\node[state]  (t_2)  [above=of t_1] {$\mbox{}$};
\node[state]  (t_3)  [below=of t_1] {$\mbox{}$};
\node[state]  (a_0)  [right=2.5cm of t_1] {$\mbox{}$};
\node (b_1)  [right=of a_0] {$\ldots$};
\node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
\node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
\node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
\path[->] (t_1) edge node [above, pos=0.3]  {$\epsilon$} (a_0);
\path[->] (t_2) edge node [above]  {$\epsilon$} (a_0);
\path[->] (t_3) edge node [below]  {$\epsilon$} (a_0);

\begin{pgfonlayer}{background}
\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (c_1) (c_2) (c_3)] {};
\node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent The case for the choice regular expression $r_1 +
r_2$ is slightly different: We are given by recursion two
automata representing $r_1$ and $r_2$ respectively. 

\begin{center}
\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node at (0,0)  (1)  {$\mbox{}$};
\node[state, initial]  (2)  [above right=16mm of 1] {$\mbox{}$};
\node[state, initial]  (3)  [below right=16mm of 1] {$\mbox{}$};

\node (a)  [right=of 2] {$\ldots$};
\node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
\node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
\node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};

\node (b)  [right=of 3] {$\ldots$};
\node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
\node[state, accepting]  (b2)  [above=of b1] {$\mbox{}$};
\node[state, accepting]  (b3)  [below=of b1] {$\mbox{}$};
\begin{pgfonlayer}{background}
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (3) (b1) (b2) (b3)] {};
\node [yshift=3mm] at (1.north) {$r_1$};
\node [yshift=3mm] at (2.north) {$r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent Each automaton has a single start state and
potentially several accepting states. We obtain a NFA for the
regular expression $r_1 + r_2$ by introducing a new starting
state and connecting it with an $\epsilon$-transition to the
two starting states above, like so

\begin{center}
\hspace{2cm}\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node at (0,0) [state, initial]  (1)  {$\mbox{}$};
\node[state]  (2)  [above right=16mm of 1] {$\mbox{}$};
\node[state]  (3)  [below right=16mm of 1] {$\mbox{}$};

\node (a)  [right=of 2] {$\ldots$};
\node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
\node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
\node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};

\node (b)  [right=of 3] {$\ldots$};
\node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
\node[state, accepting]  (b2)  [above=of b1] {$\mbox{}$};
\node[state, accepting]  (b3)  [below=of b1] {$\mbox{}$};

\path[->] (1) edge node [above]  {$\epsilon$} (2);
\path[->] (1) edge node [below]  {$\epsilon$} (3);

\begin{pgfonlayer}{background}
\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3) (b2) (b3)] {};
\node [yshift=3mm] at (3.north) {$r_1+ r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent 
Finally for the $*$-case we have an automaton for $r$

\begin{center}
\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node at (0,0)  (1)  {$\mbox{}$};
\node[state, initial]  (2)  [right=16mm of 1] {$\mbox{}$};
\node (a)  [right=of 2] {$\ldots$};
\node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
\node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
\node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
\begin{pgfonlayer}{background}
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
\node [yshift=3mm] at (1.north) {$r$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent and connect its accepting states to a new starting
state via $\epsilon$-transitions. This new starting state is
also an accepting state, because $r^*$ can recognise the
empty string. This gives the following automaton for $r^*$:

\begin{center}
\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node at (0,0) [state, initial,accepting]  (1)  {$\mbox{}$};
\node[state]  (2)  [right=16mm of 1] {$\mbox{}$};
\node (a)  [right=of 2] {$\ldots$};
\node[state]  (a1)  [right=of a] {$\mbox{}$};
\node[state]  (a2)  [above=of a1] {$\mbox{}$};
\node[state]  (a3)  [below=of a1] {$\mbox{}$};
\path[->] (1) edge node [above]  {$\epsilon$} (2);
\path[->] (a1) edge [bend left=45] node [above]  {$\epsilon$} (1);
\path[->] (a2) edge [bend right] node [below]  {$\epsilon$} (1);
\path[->] (a3) edge [bend left=45] node [below]  {$\epsilon$} (1);
\begin{pgfonlayer}{background}
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {};
\node [yshift=3mm] at (2.north) {$r^*$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent This construction of a NFA from a regular expression
was invented by Ken Thompson in 1968.


\subsection*{Subset Construction}

What is interesting that for every NFA we can find a DFA which
recognises the same language. This can be done by the
\emph{subset construction}. Consider again the NFA on the 
left, consisting of nodes labeled $0$, $1$ and $2$. 

\begin{center}
\begin{tabular}{c@{\hspace{10mm}}c}
\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
                    every state/.style={minimum size=0pt,
                    draw=blue!50,very thick,fill=blue!20},
                    baseline=0mm]
\node[state,initial]  (q_0)  {$0$};
\node[state] (q_1) [above=of q_0] {$1$};
\node[state, accepting] (q_2) [below=of q_0] {$2$};
\path[->] (q_0) edge node [left]  {$\epsilon$} (q_1);
\path[->] (q_0) edge node [left]  {$\epsilon$} (q_2);
\path[->] (q_0) edge [loop right] node  {$a$} ();
\path[->] (q_1) edge [loop above] node  {$a$} ();
\path[->] (q_2) edge [loop below] node  {$b$} ();
\end{tikzpicture}
&
\begin{tabular}{r|cl}
nodes & $a$ & $b$\\
\hline
$\varnothing\phantom{\star}$ & $\varnothing$ & $\varnothing$\\
$\{0\}\phantom{\star}$       & $\{0,1,2\}$   & $\{2\}$\\
$\{1\}\phantom{\star}$       & $\{1\}$       & $\varnothing$\\
$\{2\}\star$  & $\varnothing$ & $\{2\}$\\
$\{0,1\}\phantom{\star}$     & $\{0,1,2\}$   & $\{2\}$\\
$\{0,2\}\star$ & $\{0,1,2\}$   & $\{2\}$\\
$\{1,2\}\star$ & $\{1\}$       & $\{2\}$\\
s: $\{0,1,2\}\star$ & $\{0,1,2\}$ & $\{2\}$\\
\end{tabular}
\end{tabular}
\end{center}

\noindent The nodes of the DFA are given by calculating all
subsets of the set of nodes of the NFA (seen in the nodes
column on the right). The table shows the transition function
for the DFA. The first row states that $\varnothing$ is the
sink node which has transitions for $a$ and $b$ to itself.
The next three lines are calculated as follows: 

\begin{itemize}
\item suppose you calculate the entry for the transition for
      $a$ and the node $\{0\}$
\item start from the node $0$ in the NFA
\item do as many $\epsilon$-transition as you can obtaining a
      set of nodes, in this case $\{0,1,2\}$
\item filter out all notes that do not allow an $a$-transition
      from this set, this excludes $2$ which does not permit a
      $a$-transition
\item from the remaining set, do as many $\epsilon$-transition
      as you can, this yields $\{0,1,2\}$     
\item the resulting set specifies the transition from $\{0\}$
      when given an $a$ 
\end{itemize}

\noindent Similarly for the other entries in the rows for
$\{0\}$, $\{1\}$ and $\{2\}$. The other rows are calculated by
just taking the union of the single node entries. For example
for $a$ and $\{0,1\}$ we need to take the union of $\{0,1,2\}$
(for $0$) and $\{1\}$ (for $1$). The starting state of the DFA
can be calculated from the starting state of the NFA, that is
$0$, and then do as many $\epsilon$-transitions as possible.
This gives $\{0,1,2\}$ which is the starting state of the DFA.
One terminal states in the DFA are given by all sets that
contain a $2$, which is the terminal state of the NFA. This
completes the subset construction.

There are two points to note: One is that the resulting DFA
contains a number of ``dead'' nodes that are never reachable
from the starting state (that is that the calculated DFA in
this example is not a minimal DFA). Such dead nodes can be
safely removed without changing the language that is
recognised by the DFA. Another point is that in some cases the
subset construction produces a DFA that does \emph{not}
contain any dead nodes\ldots{}that means it calculates a
minimal DFA. Which in turn means that in some cases the number
of nodes by going from NFAs to DFAs exponentially increases,
namely by $2^n$ (which is the number of subsets you can form
for $n$ nodes). 


\subsection*{Brzozowski's Method}

\subsection*{Automata Minimization}

\subsection*{Regular Languages and Automata}

\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End:
author	Christian Urban <christian dot urban at kcl dot ac dot uk>
	Sat, 11 Oct 2014 01:13:13 +0100 (2014-10-11)
changeset 268	18bef085a7ca
parent 251	5b5a68df6d16
child 269	83e6cb90216d
permissions	-rw-r--r--