afl-material: comparison handouts/ho03.tex

equal deleted inserted replaced

-:1aa28135a2da
+:e3fd4c5995ef
 \section*{Handout 3}
 Let us have a closer look at automata and their relation to regular expressions. This will help us to understand
 why the regular expression matchers in Python and Ruby are so slow with certain regular expressions.
-A deterministic finite automaton (DFA), say $A$, is defined by  a four-tuple $A(Q, q_0, F, \delta)$ where
+A \emph{deterministic finite automaton} (DFA), say $A$, is defined by  a four-tuple written $A(Q, q_0, F, \delta)$ where
 \begin{itemize}
 \item $Q$ is a set of states,
 \item $q_0 \in Q$ is the start state,
 \item $F \subseteq Q$ are the accepting states, and
 \item $\delta$ is the transition function.
 \end{itemize}
 \noindent
-The transition function determines
+The transition function determines how to ``transition'' from one state to the next state with respect to a character.
-A typical example of a DFA is
+We have the assumption that these functions do not need to be defined everywhere: so it can be the case that
+given a character there is no next state, in which case we need to raise a kind of ``raise an exception''.  A typical
+example of a DFA is
 \begin{center}
 \begin{tikzpicture}[>=stealth',very thick,auto,
 every state/.style={minimum size=0pt,inner sep=2pt,draw=blue!50,very thick,fill=blue!20},]
 \node[state,initial]  (q_0)  {$q_0$};
 \path[->] (q_2) edge [loop left] node  {$b$} ();
 \path[->] (q_3) edge [bend left=95, looseness=1.3] node [below]  {$b$} (q_0);
 \end{tikzpicture}
 \end{center}
+\noindent
+The accepting state $q_4$ is indicated with double circles. It is possible that a DFA has no
+accepting states at all, or that the starting state is also an accepting state.
+In the case above the transition function is defined everywhere and can be given as a table
+as follows:
+\[
+\begin{array}{lcl}
+(q_0, a) &\rightarrow& q_1\\
+(q_0, b) &\rightarrow& q_2\\
+(q_1, a) &\rightarrow& q_4\\
+(q_1, b) &\rightarrow& q_2\\
+(q_2, a) &\rightarrow& q_3\\
+(q_2, b) &\rightarrow& q_2\\
+(q_3, a) &\rightarrow& q_4\\
+(q_3, b) &\rightarrow& q_0\\
+(q_4, a) &\rightarrow& q_4\\
+(q_4, b) &\rightarrow& q_4\\
+\end{array}
+\]
+\noindent
+We need to define the notion of what language is accepted by an automaton. For this we
+lift the transition function $\delta$ from characters to strings as follows:
+\[
+\begin{array}{lcl}
+\hat{\delta}(q, "")        & \dn & q\\
+\hat{\delta}(q, c\!::\!s) & \dn & \hat{\delta}(\delta(q, c), s)\\
+\end{array}
+\]
+\noindent
+Given a string this means we start in the starting state and take the first character of the string,
+follow to the next sate, then take the second character and so on. Once the string is exhausted
+and we end up in an accepting state, then this string is accepted. Otherwise it is not accepted.
+So $s$ in the \emph{language accepted by the automaton} $A(Q, q_0, F, \delta)$ iff
+\[
+\hat{\delta}(q_0, s) \in F
+\]
+While with DFA it will always clear that given a character what the next state is, it will be useful to relax
+this restriction. The resulting construction is called a \emph{non-deterministic finite automaton} (NFA) given
+as a four-tuple $A(Q, q_0, F, \rho)$ where
+\begin{itemize}
+\item $Q$ is a finite set of states
+\item $q_0$ is a start state
+\item $F$ are some accepting states with $F \subseteq Q$, and
+\item $\rho$ is a transition relation.
+\end{itemize}
+\noindent
+Two typical examples of NFAs are
+\begin{center}
+\begin{tabular}[t]{c@{\hspace{9mm}}c}
+\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
+every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
+\node[state,initial]  (q_0)  {$q_0$};
+\node[state] (q_1) [above=of q_0] {$q_1$};
+\node[state, accepting] (q_2) [below=of q_0] {$q_2$};
+\path[->] (q_0) edge node [left]  {$\epsilon$} (q_1);
+\path[->] (q_0) edge node [left]  {$\epsilon$} (q_2);
+\path[->] (q_0) edge [loop right] node  {$a$} ();
+\path[->] (q_1) edge [loop above] node  {$a$} ();
+\path[->] (q_2) edge [loop below] node  {$b$} ();
+\end{tikzpicture} &
+\raisebox{20mm}{
+\begin{tikzpicture}[scale=0.7,>=stealth',very thick,
+every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
+\node[state,initial]  (r_1)  {$r_1$};
+\node[state] (r_2) [above=of r_1] {$r_2$};
+\node[state, accepting] (r_3) [right=of r_1] {$r_3$};
+\path[->] (r_1) edge node [below]  {$b$} (r_3);
+\path[->] (r_2) edge [bend left] node [above]  {$a$} (r_3);
+\path[->] (r_1) edge [bend left] node  [left] {$\epsilon$} (r_2);
+\path[->] (r_2) edge [bend left] node  [right] {$a$} (r_1);
+\end{tikzpicture}}
+\end{tabular}
+\end{center}
+\noindent
+There are a number of points you should note. Every DFA is a NFA, but not vice versa.
+The $\rho$ in NFAs is a transition \emph{relation}
+(DFAs have a transition function). The difference between a function and a relation is that a function
+has always a single output, while a relation gives, roughly speaking, several outputs. Look
+at the NFA on the right-hand side above: if you are currently in the state $r_2$ and you read a
+character $a$, then you can transition to $r_1$ \emph{or} $r_3$. Which route you take is not
+determined. This means if we need to decide whether a string is accepted by a NFA, we might have
+to explore all possibilities. Also there is a special transition in NFAs which is called \emph{epsilon-transition}
+or \emph{silent transition}. This transition means you do not have to ``consume'' no part of
+the input string, but ``silently'' change to a different state.
+The reason for introducing NFAs is that there is a relatively simple (recursive) translation of regular expressions into
+NFAs. Consider the simple regular expressions $\varnothing$, $\epsilon$ and $c$. They can be translated
+as follows:
+\begin{center}
+\begin{tabular}[t]{l@{\hspace{10mm}}l}
+\raisebox{1mm}{$\varnothing$} &
+\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+\node[state, initial]  (q_0)  {$\mbox{}$};
+\end{tikzpicture}\\\\
+\raisebox{1mm}{$\epsilon$} &
+\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+\node[state, initial, accepting]  (q_0)  {$\mbox{}$};
+\end{tikzpicture}\\\\
+\raisebox{2mm}{$c$} &
+\begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+\node[state, initial]  (q_0)  {$\mbox{}$};
+\node[state, accepting]  (q_1)  [right=of q_0] {$\mbox{}$};
+\path[->] (q_0) edge node [below]  {$c$} (q_1);
+\end{tikzpicture}\\\\
+\end{tabular}
+\end{center}
+\noindent
+The case for the sequence regular expression $r_1 \cdot r_2$ is as follows: We are given by recursion
+two automata representing $r_1$ and $r_2$ respectively.
+\begin{center}
+\begin{tikzpicture}[node distance=3mm,
+>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+\node[state, initial]  (q_0)  {$\mbox{}$};
+\node (r_1)  [right=of q_0] {$\ldots$};
+\node[state, accepting]  (t_1)  [right=of r_1] {$\mbox{}$};
+\node[state, accepting]  (t_2)  [above=of t_1] {$\mbox{}$};
+\node[state, accepting]  (t_3)  [below=of t_1] {$\mbox{}$};
+\node[state, initial]  (a_0)  [right=2.5cm of t_1] {$\mbox{}$};
+\node (b_1)  [right=of a_0] {$\ldots$};
+\node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
+\node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
+\node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
+\begin{pgfonlayer}{background}
+\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (r_1) (t_1) (t_2) (t_3)] {};
+\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (a_0) (b_1) (c_1) (c_2) (c_3)] {};
+\node [yshift=2mm] at (1.north) {$r_1$};
+\node [yshift=2mm] at (2.north) {$r_2$};
+\end{pgfonlayer}
+\end{tikzpicture}
+\end{center}
+\noindent
+The first automaton has some accepting states. We obtain an automaton for $r_1\cdot r_2$ by connecting
+these accepting states with $\epsilon$-transitions to the starting state of the second automaton. By doing
+so we make them non-accepting like so:
+\begin{center}
+\begin{tikzpicture}[node distance=3mm,
+>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+\node[state, initial]  (q_0)  {$\mbox{}$};
+\node (r_1)  [right=of q_0] {$\ldots$};
+\node[state]  (t_1)  [right=of r_1] {$\mbox{}$};
+\node[state]  (t_2)  [above=of t_1] {$\mbox{}$};
+\node[state]  (t_3)  [below=of t_1] {$\mbox{}$};
+\node[state]  (a_0)  [right=2.5cm of t_1] {$\mbox{}$};
+\node (b_1)  [right=of a_0] {$\ldots$};
+\node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
+\node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
+\node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
+\path[->] (t_1) edge node [above, pos=0.3]  {$\epsilon$} (a_0);
+\path[->] (t_2) edge node [above]  {$\epsilon$} (a_0);
+\path[->] (t_3) edge node [below]  {$\epsilon$} (a_0);
+\begin{pgfonlayer}{background}
+\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (c_1) (c_2) (c_3)] {};
+\node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
+\end{pgfonlayer}
+\end{tikzpicture}
+\end{center}
+\noindent
+The case for the choice regular expression $r_1 + r_2$ is slightly different: We are given by recursion
+two automata representing $r_1$ and $r_2$ respectively.
+\begin{center}
+\begin{tikzpicture}[node distance=3mm,
+>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+\node at (0,0)  (1)  {$\mbox{}$};
+\node[state, initial]  (2)  [above right=16mm of 1] {$\mbox{}$};
+\node[state, initial]  (3)  [below right=16mm of 1] {$\mbox{}$};
+\node (a)  [right=of 2] {$\ldots$};
+\node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
+\node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
+\node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
+\node (b)  [right=of 3] {$\ldots$};
+\node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
+\node[state, accepting]  (b2)  [above=of b1] {$\mbox{}$};
+\node[state, accepting]  (b3)  [below=of b1] {$\mbox{}$};
+\begin{pgfonlayer}{background}
+\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
+\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (3) (b1) (b2) (b3)] {};
+\node [yshift=3mm] at (1.north) {$r_1$};
+\node [yshift=3mm] at (2.north) {$r_2$};
+\end{pgfonlayer}
+\end{tikzpicture}
+\end{center}
+\noindent
+Each automaton has a single start state and potentially several accepting states. We obtain a
+NFA for the regular expression $r_1 + r_2$ by introducing a new starting state and connecting it
+with an $\epsilon$-transition to the two starting states above, like so
+\begin{center}
+\hspace{2cm}\begin{tikzpicture}[node distance=3mm,
+>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+\node at (0,0) [state, initial]  (1)  {$\mbox{}$};
+\node[state]  (2)  [above right=16mm of 1] {$\mbox{}$};
+\node[state]  (3)  [below right=16mm of 1] {$\mbox{}$};
+\node (a)  [right=of 2] {$\ldots$};
+\node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
+\node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
+\node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
+\node (b)  [right=of 3] {$\ldots$};
+\node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
+\node[state, accepting]  (b2)  [above=of b1] {$\mbox{}$};
+\node[state, accepting]  (b3)  [below=of b1] {$\mbox{}$};
+\path[->] (1) edge node [above]  {$\epsilon$} (2);
+\path[->] (1) edge node [below]  {$\epsilon$} (3);
+\begin{pgfonlayer}{background}
+\node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3) (b2) (b3)] {};
+\node [yshift=3mm] at (3.north) {$r_1+ r_2$};
+\end{pgfonlayer}
+\end{tikzpicture}
+\end{center}
+\noindent
+Finally for the $*$-case we have an automaton for $r$
+\begin{center}
+\begin{tikzpicture}[node distance=3mm,
+>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+\node at (0,0)  (1)  {$\mbox{}$};
+\node[state, initial]  (2)  [right=16mm of 1] {$\mbox{}$};
+\node (a)  [right=of 2] {$\ldots$};
+\node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
+\node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
+\node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
+\begin{pgfonlayer}{background}
+\node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
+\node [yshift=3mm] at (1.north) {$r$};
+\end{pgfonlayer}
+\end{tikzpicture}
+\end{center}
+\noindent
+and connect its accepting states to a new starting state via $\epsilon$-transitions. This new
+starting state is also an accepting state, because $r^*$ can also recognise the empty string.
+This gives the following automaton for $r^*$:
+\begin{center}
+\begin{tikzpicture}[node distance=3mm,
+>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
+\node at (0,0) [state, initial,accepting]  (1)  {$\mbox{}$};
+\node[state]  (2)  [right=16mm of 1] {$\mbox{}$};
+\node (a)  [right=of 2] {$\ldots$};
+\node[state]  (a1)  [right=of a] {$\mbox{}$};
+\node[state]  (a2)  [above=of a1] {$\mbox{}$};
+\node[state]  (a3)  [below=of a1] {$\mbox{}$};
+\path[->] (1) edge node [above]  {$\epsilon$} (2);
+\path[->] (a1) edge [bend left=45] node [above]  {$\epsilon$} (1);
+\path[->] (a2) edge [bend right] node [below]  {$\epsilon$} (1);
+\path[->] (a3) edge [bend left=45] node [below]  {$\epsilon$} (1);
+\begin{pgfonlayer}{background}
+\node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {};
+\node [yshift=3mm] at (2.north) {$r^*$};
+\end{pgfonlayer}
+\end{tikzpicture}
+\end{center}
 \end{document}
 %%% Local Variables:
 %%% mode: latex
 %%% TeX-master: t

changeset 143	e3fd4c5995ef
parent 142	1aa28135a2da
child 144	0cb61bed557d