handouts/ho03.tex
author Christian Urban <christian dot urban at kcl dot ac dot uk>
Sun, 01 Dec 2013 10:17:17 +0000
changeset 217 cd6066f1056a
parent 144 0cb61bed557d
child 251 5b5a68df6d16
permissions -rw-r--r--
updated handouts

\documentclass{article}
\usepackage{hyperref}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage[T1]{fontenc}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{tikz}
\usetikzlibrary{arrows}
\usetikzlibrary{automata}
\usetikzlibrary{shapes}
\usetikzlibrary{shadows}
\usetikzlibrary{positioning}
\usetikzlibrary{calc}
\usetikzlibrary{fit}
\usetikzlibrary{backgrounds}
\usepackage{../langs}

\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%


\begin{document}

\section*{Handout 3}

Let us have a closer look at automata and their relation to regular expressions. This will help us to understand
why the regular expression matchers in Python and Ruby are so slow with certain regular expressions. 

A \emph{deterministic finite automaton} (DFA), say $A$, is defined by  a four-tuple written $A(Q, q_0, F, \delta)$ where

\begin{itemize}
\item $Q$ is a set of states,
\item $q_0 \in Q$ is the start state,
\item $F \subseteq Q$ are the accepting states, and
\item $\delta$ is the transition function.
\end{itemize}

\noindent
The transition function determines how to ``transition'' from one state to the next state with respect to a character.
We have the assumption that these functions do not need to be defined everywhere: so it can be the case that
given a character there is no next state, in which case we need to raise a kind of ``raise an exception''.  A typical 
example of a DFA is

\begin{center}
\begin{tikzpicture}[>=stealth',very thick,auto,
                             every state/.style={minimum size=0pt,inner sep=2pt,draw=blue!50,very thick,fill=blue!20},]
\node[state,initial]  (q_0)  {$q_0$};
\node[state] (q_1) [right=of q_0] {$q_1$};
\node[state] (q_2) [below right=of q_0] {$q_2$};
\node[state] (q_3) [right=of q_2] {$q_3$};
\node[state, accepting] (q_4) [right=of q_1] {$q_4$};
\path[->] (q_0) edge node [above]  {$a$} (q_1);
\path[->] (q_1) edge node [above]  {$a$} (q_4);
\path[->] (q_4) edge [loop right] node  {$a, b$} ();
\path[->] (q_3) edge node [right]  {$a$} (q_4);
\path[->] (q_2) edge node [above]  {$a$} (q_3);
\path[->] (q_1) edge node [right]  {$b$} (q_2);
\path[->] (q_0) edge node [above]  {$b$} (q_2);
\path[->] (q_2) edge [loop left] node  {$b$} ();
\path[->] (q_3) edge [bend left=95, looseness=1.3] node [below]  {$b$} (q_0);
\end{tikzpicture}
\end{center}

\noindent
The accepting state $q_4$ is indicated with double circles. It is possible that a DFA has no
accepting states at all, or that the starting state is also an accepting state.
In the case above the transition function is defined everywhere and can be given as a table
as follows:

\[
\begin{array}{lcl}
(q_0, a) &\rightarrow& q_1\\
(q_0, b) &\rightarrow& q_2\\
(q_1, a) &\rightarrow& q_4\\
(q_1, b) &\rightarrow& q_2\\
(q_2, a) &\rightarrow& q_3\\
(q_2, b) &\rightarrow& q_2\\
(q_3, a) &\rightarrow& q_4\\
(q_3, b) &\rightarrow& q_0\\
(q_4, a) &\rightarrow& q_4\\
(q_4, b) &\rightarrow& q_4\\
\end{array}
\]

\noindent
We need to define the notion of what language is accepted by an automaton. For this we 
lift the transition function $\delta$ from characters to strings as follows:

\[
\begin{array}{lcl}
\hat{\delta}(q, "")        & \dn & q\\
\hat{\delta}(q, c\!::\!s) & \dn & \hat{\delta}(\delta(q, c), s)\\
\end{array}
\]

\noindent
Given a string this means we start in the starting state and take the first character of the string,
follow to the next sate, then take the second character and so on. Once the string is exhausted
and we end up in an accepting state, then this string is accepted. Otherwise it is not accepted. 
So $s$ in the \emph{language accepted by the automaton} $A(Q, q_0, F, \delta)$ iff

\[
\hat{\delta}(q_0, s) \in F 
\]
  

While with DFA it will always clear that given a character what the next state is, it will be useful to relax 
this restriction. The resulting construction is called a \emph{non-deterministic finite automaton} (NFA) given
as a four-tuple $A(Q, q_0, F, \rho)$ where

\begin{itemize}
\item $Q$ is a finite set of states
\item $q_0$ is a start state
\item $F$ are some accepting states with $F \subseteq Q$, and
\item $\rho$ is a transition relation.
\end{itemize}

\noindent
Two typical examples of NFAs are
\begin{center}
\begin{tabular}[t]{c@{\hspace{9mm}}c}
\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
                             every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
\node[state,initial]  (q_0)  {$q_0$};
\node[state] (q_1) [above=of q_0] {$q_1$};
\node[state, accepting] (q_2) [below=of q_0] {$q_2$};
\path[->] (q_0) edge node [left]  {$\epsilon$} (q_1);
\path[->] (q_0) edge node [left]  {$\epsilon$} (q_2);
\path[->] (q_0) edge [loop right] node  {$a$} ();
\path[->] (q_1) edge [loop above] node  {$a$} ();
\path[->] (q_2) edge [loop below] node  {$b$} ();
\end{tikzpicture} &

\raisebox{20mm}{
\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
                             every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
\node[state,initial]  (r_1)  {$r_1$};
\node[state] (r_2) [above=of r_1] {$r_2$};
\node[state, accepting] (r_3) [right=of r_1] {$r_3$};
\path[->] (r_1) edge node [below]  {$b$} (r_3);
\path[->] (r_2) edge [bend left] node [above]  {$a$} (r_3);
\path[->] (r_1) edge [bend left] node  [left] {$\epsilon$} (r_2);
\path[->] (r_2) edge [bend left] node  [right] {$a$} (r_1);
\end{tikzpicture}}
\end{tabular}
\end{center}

\noindent
There are a number of points you should note. Every DFA is a NFA, but not vice versa.
The $\rho$ in NFAs is a transition \emph{relation} 
(DFAs have a transition function). The difference between a function and a relation is that a function 
has always a single output, while a relation gives, roughly speaking, several outputs. Look
at the NFA on the right-hand side above: if you are currently in the state $r_2$ and you read a
character $a$, then you can transition to $r_1$ \emph{or} $r_3$. Which route you take is not
determined. This means if we need to decide whether a string is accepted by a NFA, we might have 
to explore all possibilities. Also there is a special transition in NFAs which is called \emph{epsilon-transition}
or \emph{silent transition}. This transition means you do not have to ``consume'' no part of
the input string, but ``silently'' change to a different state.

The reason for introducing NFAs is that there is a relatively simple (recursive) translation of regular expressions into
NFAs. Consider the simple regular expressions $\varnothing$, $\epsilon$ and $c$. They can be translated
as follows:

\begin{center}
\begin{tabular}[t]{l@{\hspace{10mm}}l}
\raisebox{1mm}{$\varnothing$} & 
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (q_0)  {$\mbox{}$};
\end{tikzpicture}\\\\
\raisebox{1mm}{$\epsilon$} & 
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial, accepting]  (q_0)  {$\mbox{}$};
\end{tikzpicture}\\\\
\raisebox{2mm}{$c$} & 
\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (q_0)  {$\mbox{}$};
\node[state, accepting]  (q_1)  [right=of q_0] {$\mbox{}$};
\path[->] (q_0) edge node [below]  {$c$} (q_1);
\end{tikzpicture}\\\\
\end{tabular}
\end{center}

\noindent
The case for the sequence regular expression $r_1 \cdot r_2$ is as follows: We are given by recursion
two automata representing $r_1$ and $r_2$ respectively. 

\begin{center}
\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (q_0)  {$\mbox{}$};
\node (r_1)  [right=of q_0] {$\ldots$};
\node[state, accepting]  (t_1)  [right=of r_1] {$\mbox{}$};
\node[state, accepting]  (t_2)  [above=of t_1] {$\mbox{}$};
\node[state, accepting]  (t_3)  [below=of t_1] {$\mbox{}$};
\node[state, initial]  (a_0)  [right=2.5cm of t_1] {$\mbox{}$};
\node (b_1)  [right=of a_0] {$\ldots$};
\node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
\node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
\node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
\begin{pgfonlayer}{background}
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (r_1) (t_1) (t_2) (t_3)] {};
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (a_0) (b_1) (c_1) (c_2) (c_3)] {};
\node [yshift=2mm] at (1.north) {$r_1$};
\node [yshift=2mm] at (2.north) {$r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent
The first automaton has some accepting states. We obtain an automaton for $r_1\cdot r_2$ by connecting
these accepting states with $\epsilon$-transitions to the starting state of the second automaton. By doing 
so we make them non-accepting like so:

\begin{center}
\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node[state, initial]  (q_0)  {$\mbox{}$};
\node (r_1)  [right=of q_0] {$\ldots$};
\node[state]  (t_1)  [right=of r_1] {$\mbox{}$};
\node[state]  (t_2)  [above=of t_1] {$\mbox{}$};
\node[state]  (t_3)  [below=of t_1] {$\mbox{}$};
\node[state]  (a_0)  [right=2.5cm of t_1] {$\mbox{}$};
\node (b_1)  [right=of a_0] {$\ldots$};
\node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
\node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
\node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
\path[->] (t_1) edge node [above, pos=0.3]  {$\epsilon$} (a_0);
\path[->] (t_2) edge node [above]  {$\epsilon$} (a_0);
\path[->] (t_3) edge node [below]  {$\epsilon$} (a_0);

\begin{pgfonlayer}{background}
\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (c_1) (c_2) (c_3)] {};
\node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent
The case for the choice regular expression $r_1 + r_2$ is slightly different: We are given by recursion
two automata representing $r_1$ and $r_2$ respectively. 

\begin{center}
\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node at (0,0)  (1)  {$\mbox{}$};
\node[state, initial]  (2)  [above right=16mm of 1] {$\mbox{}$};
\node[state, initial]  (3)  [below right=16mm of 1] {$\mbox{}$};

\node (a)  [right=of 2] {$\ldots$};
\node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
\node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
\node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};

\node (b)  [right=of 3] {$\ldots$};
\node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
\node[state, accepting]  (b2)  [above=of b1] {$\mbox{}$};
\node[state, accepting]  (b3)  [below=of b1] {$\mbox{}$};
\begin{pgfonlayer}{background}
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (3) (b1) (b2) (b3)] {};
\node [yshift=3mm] at (1.north) {$r_1$};
\node [yshift=3mm] at (2.north) {$r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent
Each automaton has a single start state and potentially several accepting states. We obtain a
NFA for the regular expression $r_1 + r_2$ by introducing a new starting state and connecting it
with an $\epsilon$-transition to the two starting states above, like so

\begin{center}
\hspace{2cm}\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node at (0,0) [state, initial]  (1)  {$\mbox{}$};
\node[state]  (2)  [above right=16mm of 1] {$\mbox{}$};
\node[state]  (3)  [below right=16mm of 1] {$\mbox{}$};

\node (a)  [right=of 2] {$\ldots$};
\node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
\node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
\node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};

\node (b)  [right=of 3] {$\ldots$};
\node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
\node[state, accepting]  (b2)  [above=of b1] {$\mbox{}$};
\node[state, accepting]  (b3)  [below=of b1] {$\mbox{}$};

\path[->] (1) edge node [above]  {$\epsilon$} (2);
\path[->] (1) edge node [below]  {$\epsilon$} (3);

\begin{pgfonlayer}{background}
\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3) (b2) (b3)] {};
\node [yshift=3mm] at (3.north) {$r_1+ r_2$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent 
Finally for the $*$-case we have an automaton for $r$

\begin{center}
\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node at (0,0)  (1)  {$\mbox{}$};
\node[state, initial]  (2)  [right=16mm of 1] {$\mbox{}$};
\node (a)  [right=of 2] {$\ldots$};
\node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
\node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
\node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
\begin{pgfonlayer}{background}
\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
\node [yshift=3mm] at (1.north) {$r$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent
and connect its accepting states to a new starting state via $\epsilon$-transitions. This new
starting state is also an accepting state, because $r^*$ can also recognise the empty string.
This gives the following automaton for $r^*$:

\begin{center}
\begin{tikzpicture}[node distance=3mm,
                             >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
\node at (0,0) [state, initial,accepting]  (1)  {$\mbox{}$};
\node[state]  (2)  [right=16mm of 1] {$\mbox{}$};
\node (a)  [right=of 2] {$\ldots$};
\node[state]  (a1)  [right=of a] {$\mbox{}$};
\node[state]  (a2)  [above=of a1] {$\mbox{}$};
\node[state]  (a3)  [below=of a1] {$\mbox{}$};
\path[->] (1) edge node [above]  {$\epsilon$} (2);
\path[->] (a1) edge [bend left=45] node [above]  {$\epsilon$} (1);
\path[->] (a2) edge [bend right] node [below]  {$\epsilon$} (1);
\path[->] (a3) edge [bend left=45] node [below]  {$\epsilon$} (1);
\begin{pgfonlayer}{background}
\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {};
\node [yshift=3mm] at (2.north) {$r^*$};
\end{pgfonlayer}
\end{tikzpicture}
\end{center}

\noindent
This construction of a NFA from a regular expression was invented by Ken Thompson in 1968.

\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: