handouts/ho03.tex
author Christian Urban <christian dot urban at kcl dot ac dot uk>
Mon, 15 Sep 2014 09:36:02 +0100
changeset 251 5b5a68df6d16
parent 217 cd6066f1056a
child 268 18bef085a7ca
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{../style}
\usepackage{../langs}

\usepackage{xcolor}
\usepackage{tikz}
\usetikzlibrary{arrows}
\usetikzlibrary{automata}
\usetikzlibrary{shapes}
\usetikzlibrary{shadows}
\usetikzlibrary{positioning}
\usetikzlibrary{calc}
\usetikzlibrary{fit}
\usetikzlibrary{backgrounds}


\begin{document}

\section*{Handout 3}

Let us have a closer look at automata and their relation to
regular expressions. This will help us to understand why the
regular expression matchers in Python and Ruby are so slow
with certain regular expressions. 

A \emph{deterministic finite automaton} (DFA), say $A$, is
defined by a four-tuple written $A(Q, q_0, F, \delta)$ where

\begin{itemize}
\item $Q$ is a set of states,
\item $q_0 \in Q$ is the start state,
\item $F \subseteq Q$ are the accepting states, and
\item $\delta$ is the transition function.
\end{itemize}

\noindent The transition function determines how to
``transition'' from one state to the next state with respect
to a character. We have the assumption that these functions do
not need to be defined everywhere: so it can be the case that
given a character there is no next state, in which case we
need to raise a kind of ``raise an exception''. A typical
example of a DFA is

\begin{center}
\begin{tikzpicture}[>=stealth',very thick,auto,
                             every state/.style={minimum size=0pt,inner sep=2pt,draw=blue!50,very thick,fill=blue!20},]
\node[state,initial]  (q_0)  {$q_0$};
\node[state] (q_1) [right=of q_0] {$q_1$};
\node[state] (q_2) [below right=of q_0] {$q_2$};
\node[state] (q_3) [right=of q_2] {$q_3$};
\node[state, accepting] (q_4) [right=of q_1] {$q_4$};
\path[->] (q_0) edge node [above]  {$a$} (q_1);
\path[->] (q_1) edge node [above]  {$a$} (q_4);
\path[->] (q_4) edge [loop right] node  {$a, b$} ();
\path[->] (q_3) edge node [right]  {$a$} (q_4);
\path[->] (q_2) edge node [above]  {$a$} (q_3);
\path[->] (q_1) edge node [right]  {$b$} (q_2);
\path[->] (q_0) edge node [above]  {$b$} (q_2);
\path[->] (q_2) edge [loop left] node  {$b$} ();
\path[->] (q_3) edge [bend left=95, looseness=1.3] node [below]  {$b$} (q_0);
\end{tikzpicture}
\end{center}

\noindent The accepting state $q_4$ is indicated with double
circles. It is possible that a DFA has no accepting states at
all, or that the starting state is also an accepting state. In
the case above the transition function is defined everywhere
and can be given as a table as follows:

\[
\begin{array}{lcl}
(q_0, a) &\rightarrow& q_1\\
(q_0, b) &\rightarrow& q_2\\
(q_1, a) &\rightarrow& q_4\\
(q_1, b) &\rightarrow& q_2\\
(q_2, a) &\rightarrow& q_3\\
(q_2, b) &\rightarrow& q_2\\
(q_3, a) &\rightarrow& q_4\\
(q_3, b) &\rightarrow& q_0\\
(q_4, a) &\rightarrow& q_4\\
(q_4, b) &\rightarrow& q_4\\
\end{array}
\]

\noindent We need to define the notion of what language is
accepted by an automaton. For this we lift the transition
function $\delta$ from characters to strings as follows:

\[
\begin{array}{lcl}
\hat{\delta}(q, "")        & \dn & q\\
\hat{\delta}(q, c\!::\!s) & \dn & \hat{\delta}(\delta(q, c), s)\\
\end{array}
\]

\noindent Given a string this means we start in the starting
state and take the first character of the string, follow to
the next sate, then take the second character and so on. Once
the string is exhausted and we end up in an accepting state,
then this string is accepted. Otherwise it is not accepted. So
$s$ in the \emph{language accepted by the automaton} $A(Q,
q_0, F, \delta)$ iff

\[
\hat{\delta}(q_0, s) \in F 
\]
  

While with DFA it will always clear that given a character
what the next state is, it will be useful to relax this
restriction. The resulting construction is called a
\emph{non-deterministic finite automaton} (NFA) given as a
four-tuple $A(Q, q_0, F, \rho)$ where

\begin{itemize}
\item $Q$ is a finite set of states
\item $q_0$ is a start state
\item $F$ are some accepting states with $F \subseteq Q$, and
\item $\rho$ is a transition relation.
\end{itemize}

\noindent
Two typical examples of NFAs are
\begin{center}
\begin{tabular}[t]{c@{\hspace{9mm}}c}
\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
                             every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
\node[state,initial]  (q_0)  {$q_0$};
\node[state] (q_1) [above=of q_0] {$q_1$};
\node[state, accepting] (q_2) [below=of q_0] {$q_2$};
\path[->] (q_0) edge node [left]  {$\epsilon$} (q_1);
\path[->] (q_0) edge node [left]  {$\epsilon$} (q_2);
\path[->] (q_0) edge [loop right] node  {$a$} ();
\path[->] (q_1) edge [loop above] node  {$a$} ();
\path[->] (q_2) edge [loop below] node  {$b$} ();
\end{tikzpicture} &

\raisebox{20mm}{
\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
                             every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
\node[state,initial]  (r_1)  {$r_1$};
\node[state] (r_2) [above=of r_1] {$r_2$};
\node[state, accepting] (r_3) [right=of r_1] {$r_3$};
\path[->] (r_1) edge node [below]  {$b$} (r_3);
\path[->] (r_2) edge [bend left] node [above]  {$a$} (r_3);
\path[->] (r_1) edge [bend left] node  [left] {$\epsilon$} (r_2);
\path[->] (r_2) edge [bend left] node  [right] {$a$} (r_1);
\end{tikzpicture}}
\end{tabular}
\end{center}

\noindent There are a number of points you should note. Every
DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a
transition \emph{relation} (DFAs have a transition function).
The difference between a function and a relation is that a
function has always a single output, while a relation gives,
roughly speaking, several outputs. Look at the NFA on the
right-hand side above: if you are currently in the state $r_2$
and you read a character $a$, then you can transition to $r_1$
\emph{or} $r_3$. Which route you take is not determined. This
means if we need to decide whether a string is accepted by a
NFA, we might have to explore all possibilities. Also there is
a special transition in NFAs which is called
\emph{epsilon-transition} or \emph{silent transition}. This
transition means you do not have to ``consume'' no part of the
input string, but ``silently'' change to a different state.

The reason for introducing NFAs is that there is a relatively
simple (recursive) translation of regular expressions into
NFAs. Consider the simple regular expressions $\varnothing$,
$\epsilon$ and $c$. They can be translated as follows:

\begin{center}
\begin{tabular}[t]{l@{\hspace{10mm}}l}
\raisebox{1mm}{$\varnothing$} & 
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (q_0)  {$\mbox{}$};
\end{tikzpicture}\\\\
\raisebox{1mm}{$\epsilon$} & 
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial, accepting]  (q_0)  {$\mbox{}$};
\end{tikzpicture}\\\\
\raisebox{2mm}{$c$} & 
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (q_0)  {$\mbox{}$};
\node[state, accepting]  (q_1)  [right=of q_0] {$\mbox{}$};
\path[->] (q_0) edge node [below]  {$c$} (q_1);
\end{tikzpicture}\\\\
\end{tabular}
\end{center}

\noindent The case for the sequence regular expression $r_1
\cdot r_2$ is as follows: We are given by recursion two
automata representing $r_1$ and $r_2$ respectively. 

\begin{center}
\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (q_0)  {$\mbox{}$};
\node (r_1)  [right=of q_0] {$\ldots$};
\node[state, accepting]  (t_1)  [right=of r_1] {$\mbox{}$};
\node[state, accepting]  (t_2)  [above=of t_1] {$\mbox{}$};
\node[state, accepting]  (t_3)  [below=of t_1] {$\mbox{}$};
\node[state, initial]  (a_0)  [right=2.5cm of t_1] {$\mbox{}$};
\node (b_1)  [right=of a_0] {$\ldots$};
\node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
\node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
\node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
\begin{pgfonlayer}{background}
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (r_1) (t_1) (t_2) (t_3)] {};
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (a_0) (b_1) (c_1) (c_2) (c_3)] {};
\node [yshift=2mm] at (1.north) {$r_1$};
\node [yshift=2mm] at (2.north) {$r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent The first automaton has some accepting states. We
obtain an automaton for $r_1\cdot r_2$ by connecting these
accepting states with $\epsilon$-transitions to the starting
state of the second automaton. By doing so we make them
non-accepting like so:

\begin{center}
\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (q_0)  {$\mbox{}$};
\node (r_1)  [right=of q_0] {$\ldots$};
\node[state]  (t_1)  [right=of r_1] {$\mbox{}$};
\node[state]  (t_2)  [above=of t_1] {$\mbox{}$};
\node[state]  (t_3)  [below=of t_1] {$\mbox{}$};
\node[state]  (a_0)  [right=2.5cm of t_1] {$\mbox{}$};
\node (b_1)  [right=of a_0] {$\ldots$};
\node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
\node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
\node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
\path[->] (t_1) edge node [above, pos=0.3]  {$\epsilon$} (a_0);
\path[->] (t_2) edge node [above]  {$\epsilon$} (a_0);
\path[->] (t_3) edge node [below]  {$\epsilon$} (a_0);

\begin{pgfonlayer}{background}
\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (c_1) (c_2) (c_3)] {};
\node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent The case for the choice regular expression $r_1 +
r_2$ is slightly different: We are given by recursion two
automata representing $r_1$ and $r_2$ respectively. 

\begin{center}
\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node at (0,0)  (1)  {$\mbox{}$};
\node[state, initial]  (2)  [above right=16mm of 1] {$\mbox{}$};
\node[state, initial]  (3)  [below right=16mm of 1] {$\mbox{}$};

\node (a)  [right=of 2] {$\ldots$};
\node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
\node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
\node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};

\node (b)  [right=of 3] {$\ldots$};
\node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
\node[state, accepting]  (b2)  [above=of b1] {$\mbox{}$};
\node[state, accepting]  (b3)  [below=of b1] {$\mbox{}$};
\begin{pgfonlayer}{background}
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (3) (b1) (b2) (b3)] {};
\node [yshift=3mm] at (1.north) {$r_1$};
\node [yshift=3mm] at (2.north) {$r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent Each automaton has a single start state and
potentially several accepting states. We obtain a NFA for the
regular expression $r_1 + r_2$ by introducing a new starting
state and connecting it with an $\epsilon$-transition to the
two starting states above, like so

\begin{center}
\hspace{2cm}\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node at (0,0) [state, initial]  (1)  {$\mbox{}$};
\node[state]  (2)  [above right=16mm of 1] {$\mbox{}$};
\node[state]  (3)  [below right=16mm of 1] {$\mbox{}$};

\node (a)  [right=of 2] {$\ldots$};
\node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
\node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
\node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};

\node (b)  [right=of 3] {$\ldots$};
\node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
\node[state, accepting]  (b2)  [above=of b1] {$\mbox{}$};
\node[state, accepting]  (b3)  [below=of b1] {$\mbox{}$};

\path[->] (1) edge node [above]  {$\epsilon$} (2);
\path[->] (1) edge node [below]  {$\epsilon$} (3);

\begin{pgfonlayer}{background}
\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3) (b2) (b3)] {};
\node [yshift=3mm] at (3.north) {$r_1+ r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent 
Finally for the $*$-case we have an automaton for $r$

\begin{center}
\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node at (0,0)  (1)  {$\mbox{}$};
\node[state, initial]  (2)  [right=16mm of 1] {$\mbox{}$};
\node (a)  [right=of 2] {$\ldots$};
\node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
\node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
\node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
\begin{pgfonlayer}{background}
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
\node [yshift=3mm] at (1.north) {$r$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent and connect its accepting states to a new starting
state via $\epsilon$-transitions. This new starting state is
also an accepting state, because $r^*$ can also recognise the
empty string. This gives the following automaton for $r^*$:

\begin{center}
\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node at (0,0) [state, initial,accepting]  (1)  {$\mbox{}$};
\node[state]  (2)  [right=16mm of 1] {$\mbox{}$};
\node (a)  [right=of 2] {$\ldots$};
\node[state]  (a1)  [right=of a] {$\mbox{}$};
\node[state]  (a2)  [above=of a1] {$\mbox{}$};
\node[state]  (a3)  [below=of a1] {$\mbox{}$};
\path[->] (1) edge node [above]  {$\epsilon$} (2);
\path[->] (a1) edge [bend left=45] node [above]  {$\epsilon$} (1);
\path[->] (a2) edge [bend right] node [below]  {$\epsilon$} (1);
\path[->] (a3) edge [bend left=45] node [below]  {$\epsilon$} (1);
\begin{pgfonlayer}{background}
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {};
\node [yshift=3mm] at (2.north) {$r^*$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent This construction of a NFA from a regular expression
was invented by Ken Thompson in 1968.

\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: