handouts/ho03.tex
changeset 143 e3fd4c5995ef
parent 142 1aa28135a2da
child 144 0cb61bed557d
equal deleted inserted replaced
142:1aa28135a2da 143:e3fd4c5995ef
    59 \section*{Handout 3}
    59 \section*{Handout 3}
    60 
    60 
    61 Let us have a closer look at automata and their relation to regular expressions. This will help us to understand
    61 Let us have a closer look at automata and their relation to regular expressions. This will help us to understand
    62 why the regular expression matchers in Python and Ruby are so slow with certain regular expressions. 
    62 why the regular expression matchers in Python and Ruby are so slow with certain regular expressions. 
    63 
    63 
    64 A deterministic finite automaton (DFA), say $A$, is defined by  a four-tuple $A(Q, q_0, F, \delta)$ where
    64 A \emph{deterministic finite automaton} (DFA), say $A$, is defined by  a four-tuple written $A(Q, q_0, F, \delta)$ where
    65 
    65 
    66 \begin{itemize}
    66 \begin{itemize}
    67 \item $Q$ is a set of states,
    67 \item $Q$ is a set of states,
    68 \item $q_0 \in Q$ is the start state,
    68 \item $q_0 \in Q$ is the start state,
    69 \item $F \subseteq Q$ are the accepting states, and
    69 \item $F \subseteq Q$ are the accepting states, and
    70 \item $\delta$ is the transition function.
    70 \item $\delta$ is the transition function.
    71 \end{itemize}
    71 \end{itemize}
    72 
    72 
    73 \noindent
    73 \noindent
    74 The transition function determines 
    74 The transition function determines how to ``transition'' from one state to the next state with respect to a character.
    75 A typical example of a DFA is
    75 We have the assumption that these functions do not need to be defined everywhere: so it can be the case that
       
    76 given a character there is no next state, in which case we need to raise a kind of ``raise an exception''.  A typical 
       
    77 example of a DFA is
    76 
    78 
    77 \begin{center}
    79 \begin{center}
    78 \begin{tikzpicture}[>=stealth',very thick,auto,
    80 \begin{tikzpicture}[>=stealth',very thick,auto,
    79                              every state/.style={minimum size=0pt,inner sep=2pt,draw=blue!50,very thick,fill=blue!20},]
    81                              every state/.style={minimum size=0pt,inner sep=2pt,draw=blue!50,very thick,fill=blue!20},]
    80 \node[state,initial]  (q_0)  {$q_0$};
    82 \node[state,initial]  (q_0)  {$q_0$};
    92 \path[->] (q_2) edge [loop left] node  {$b$} ();
    94 \path[->] (q_2) edge [loop left] node  {$b$} ();
    93 \path[->] (q_3) edge [bend left=95, looseness=1.3] node [below]  {$b$} (q_0);
    95 \path[->] (q_3) edge [bend left=95, looseness=1.3] node [below]  {$b$} (q_0);
    94 \end{tikzpicture}
    96 \end{tikzpicture}
    95 \end{center}
    97 \end{center}
    96 
    98 
       
    99 \noindent
       
   100 The accepting state $q_4$ is indicated with double circles. It is possible that a DFA has no
       
   101 accepting states at all, or that the starting state is also an accepting state.
       
   102 In the case above the transition function is defined everywhere and can be given as a table
       
   103 as follows:
       
   104 
       
   105 \[
       
   106 \begin{array}{lcl}
       
   107 (q_0, a) &\rightarrow& q_1\\
       
   108 (q_0, b) &\rightarrow& q_2\\
       
   109 (q_1, a) &\rightarrow& q_4\\
       
   110 (q_1, b) &\rightarrow& q_2\\
       
   111 (q_2, a) &\rightarrow& q_3\\
       
   112 (q_2, b) &\rightarrow& q_2\\
       
   113 (q_3, a) &\rightarrow& q_4\\
       
   114 (q_3, b) &\rightarrow& q_0\\
       
   115 (q_4, a) &\rightarrow& q_4\\
       
   116 (q_4, b) &\rightarrow& q_4\\
       
   117 \end{array}
       
   118 \]
       
   119 
       
   120 \noindent
       
   121 We need to define the notion of what language is accepted by an automaton. For this we 
       
   122 lift the transition function $\delta$ from characters to strings as follows:
       
   123 
       
   124 \[
       
   125 \begin{array}{lcl}
       
   126 \hat{\delta}(q, "")        & \dn & q\\
       
   127 \hat{\delta}(q, c\!::\!s) & \dn & \hat{\delta}(\delta(q, c), s)\\
       
   128 \end{array}
       
   129 \]
       
   130 
       
   131 \noindent
       
   132 Given a string this means we start in the starting state and take the first character of the string,
       
   133 follow to the next sate, then take the second character and so on. Once the string is exhausted
       
   134 and we end up in an accepting state, then this string is accepted. Otherwise it is not accepted. 
       
   135 So $s$ in the \emph{language accepted by the automaton} $A(Q, q_0, F, \delta)$ iff
       
   136 
       
   137 \[
       
   138 \hat{\delta}(q_0, s) \in F 
       
   139 \]
       
   140   
       
   141 
       
   142 While with DFA it will always clear that given a character what the next state is, it will be useful to relax 
       
   143 this restriction. The resulting construction is called a \emph{non-deterministic finite automaton} (NFA) given
       
   144 as a four-tuple $A(Q, q_0, F, \rho)$ where
       
   145 
       
   146 \begin{itemize}
       
   147 \item $Q$ is a finite set of states
       
   148 \item $q_0$ is a start state
       
   149 \item $F$ are some accepting states with $F \subseteq Q$, and
       
   150 \item $\rho$ is a transition relation.
       
   151 \end{itemize}
       
   152 
       
   153 \noindent
       
   154 Two typical examples of NFAs are
       
   155 \begin{center}
       
   156 \begin{tabular}[t]{c@{\hspace{9mm}}c}
       
   157 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
       
   158                              every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
       
   159 \node[state,initial]  (q_0)  {$q_0$};
       
   160 \node[state] (q_1) [above=of q_0] {$q_1$};
       
   161 \node[state, accepting] (q_2) [below=of q_0] {$q_2$};
       
   162 \path[->] (q_0) edge node [left]  {$\epsilon$} (q_1);
       
   163 \path[->] (q_0) edge node [left]  {$\epsilon$} (q_2);
       
   164 \path[->] (q_0) edge [loop right] node  {$a$} ();
       
   165 \path[->] (q_1) edge [loop above] node  {$a$} ();
       
   166 \path[->] (q_2) edge [loop below] node  {$b$} ();
       
   167 \end{tikzpicture} &
       
   168 
       
   169 \raisebox{20mm}{
       
   170 \begin{tikzpicture}[scale=0.7,>=stealth',very thick,
       
   171                              every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
       
   172 \node[state,initial]  (r_1)  {$r_1$};
       
   173 \node[state] (r_2) [above=of r_1] {$r_2$};
       
   174 \node[state, accepting] (r_3) [right=of r_1] {$r_3$};
       
   175 \path[->] (r_1) edge node [below]  {$b$} (r_3);
       
   176 \path[->] (r_2) edge [bend left] node [above]  {$a$} (r_3);
       
   177 \path[->] (r_1) edge [bend left] node  [left] {$\epsilon$} (r_2);
       
   178 \path[->] (r_2) edge [bend left] node  [right] {$a$} (r_1);
       
   179 \end{tikzpicture}}
       
   180 \end{tabular}
       
   181 \end{center}
       
   182 
       
   183 \noindent
       
   184 There are a number of points you should note. Every DFA is a NFA, but not vice versa.
       
   185 The $\rho$ in NFAs is a transition \emph{relation} 
       
   186 (DFAs have a transition function). The difference between a function and a relation is that a function 
       
   187 has always a single output, while a relation gives, roughly speaking, several outputs. Look
       
   188 at the NFA on the right-hand side above: if you are currently in the state $r_2$ and you read a
       
   189 character $a$, then you can transition to $r_1$ \emph{or} $r_3$. Which route you take is not
       
   190 determined. This means if we need to decide whether a string is accepted by a NFA, we might have 
       
   191 to explore all possibilities. Also there is a special transition in NFAs which is called \emph{epsilon-transition}
       
   192 or \emph{silent transition}. This transition means you do not have to ``consume'' no part of
       
   193 the input string, but ``silently'' change to a different state.
       
   194 
       
   195 The reason for introducing NFAs is that there is a relatively simple (recursive) translation of regular expressions into
       
   196 NFAs. Consider the simple regular expressions $\varnothing$, $\epsilon$ and $c$. They can be translated
       
   197 as follows:
       
   198 
       
   199 \begin{center}
       
   200 \begin{tabular}[t]{l@{\hspace{10mm}}l}
       
   201 \raisebox{1mm}{$\varnothing$} & 
       
   202 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
       
   203 \node[state, initial]  (q_0)  {$\mbox{}$};
       
   204 \end{tikzpicture}\\\\
       
   205 \raisebox{1mm}{$\epsilon$} & 
       
   206 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
       
   207 \node[state, initial, accepting]  (q_0)  {$\mbox{}$};
       
   208 \end{tikzpicture}\\\\
       
   209 \raisebox{2mm}{$c$} & 
       
   210 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
       
   211 \node[state, initial]  (q_0)  {$\mbox{}$};
       
   212 \node[state, accepting]  (q_1)  [right=of q_0] {$\mbox{}$};
       
   213 \path[->] (q_0) edge node [below]  {$c$} (q_1);
       
   214 \end{tikzpicture}\\\\
       
   215 \end{tabular}
       
   216 \end{center}
       
   217 
       
   218 \noindent
       
   219 The case for the sequence regular expression $r_1 \cdot r_2$ is as follows: We are given by recursion
       
   220 two automata representing $r_1$ and $r_2$ respectively. 
       
   221 
       
   222 \begin{center}
       
   223 \begin{tikzpicture}[node distance=3mm,
       
   224                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
       
   225 \node[state, initial]  (q_0)  {$\mbox{}$};
       
   226 \node (r_1)  [right=of q_0] {$\ldots$};
       
   227 \node[state, accepting]  (t_1)  [right=of r_1] {$\mbox{}$};
       
   228 \node[state, accepting]  (t_2)  [above=of t_1] {$\mbox{}$};
       
   229 \node[state, accepting]  (t_3)  [below=of t_1] {$\mbox{}$};
       
   230 \node[state, initial]  (a_0)  [right=2.5cm of t_1] {$\mbox{}$};
       
   231 \node (b_1)  [right=of a_0] {$\ldots$};
       
   232 \node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
       
   233 \node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
       
   234 \node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
       
   235 \begin{pgfonlayer}{background}
       
   236 \node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (r_1) (t_1) (t_2) (t_3)] {};
       
   237 \node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (a_0) (b_1) (c_1) (c_2) (c_3)] {};
       
   238 \node [yshift=2mm] at (1.north) {$r_1$};
       
   239 \node [yshift=2mm] at (2.north) {$r_2$};
       
   240 \end{pgfonlayer}
       
   241 \end{tikzpicture}
       
   242 \end{center}
       
   243 
       
   244 \noindent
       
   245 The first automaton has some accepting states. We obtain an automaton for $r_1\cdot r_2$ by connecting
       
   246 these accepting states with $\epsilon$-transitions to the starting state of the second automaton. By doing 
       
   247 so we make them non-accepting like so:
       
   248 
       
   249 \begin{center}
       
   250 \begin{tikzpicture}[node distance=3mm,
       
   251                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
       
   252 \node[state, initial]  (q_0)  {$\mbox{}$};
       
   253 \node (r_1)  [right=of q_0] {$\ldots$};
       
   254 \node[state]  (t_1)  [right=of r_1] {$\mbox{}$};
       
   255 \node[state]  (t_2)  [above=of t_1] {$\mbox{}$};
       
   256 \node[state]  (t_3)  [below=of t_1] {$\mbox{}$};
       
   257 \node[state]  (a_0)  [right=2.5cm of t_1] {$\mbox{}$};
       
   258 \node (b_1)  [right=of a_0] {$\ldots$};
       
   259 \node[state, accepting]  (c_1)  [right=of b_1] {$\mbox{}$};
       
   260 \node[state, accepting]  (c_2)  [above=of c_1] {$\mbox{}$};
       
   261 \node[state, accepting]  (c_3)  [below=of c_1] {$\mbox{}$};
       
   262 \path[->] (t_1) edge node [above, pos=0.3]  {$\epsilon$} (a_0);
       
   263 \path[->] (t_2) edge node [above]  {$\epsilon$} (a_0);
       
   264 \path[->] (t_3) edge node [below]  {$\epsilon$} (a_0);
       
   265 
       
   266 \begin{pgfonlayer}{background}
       
   267 \node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (q_0) (c_1) (c_2) (c_3)] {};
       
   268 \node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
       
   269 \end{pgfonlayer}
       
   270 \end{tikzpicture}
       
   271 \end{center}
       
   272 
       
   273 \noindent
       
   274 The case for the choice regular expression $r_1 + r_2$ is slightly different: We are given by recursion
       
   275 two automata representing $r_1$ and $r_2$ respectively. 
       
   276 
       
   277 \begin{center}
       
   278 \begin{tikzpicture}[node distance=3mm,
       
   279                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
       
   280 \node at (0,0)  (1)  {$\mbox{}$};
       
   281 \node[state, initial]  (2)  [above right=16mm of 1] {$\mbox{}$};
       
   282 \node[state, initial]  (3)  [below right=16mm of 1] {$\mbox{}$};
       
   283 
       
   284 \node (a)  [right=of 2] {$\ldots$};
       
   285 \node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
       
   286 \node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
       
   287 \node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
       
   288 
       
   289 \node (b)  [right=of 3] {$\ldots$};
       
   290 \node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
       
   291 \node[state, accepting]  (b2)  [above=of b1] {$\mbox{}$};
       
   292 \node[state, accepting]  (b3)  [below=of b1] {$\mbox{}$};
       
   293 \begin{pgfonlayer}{background}
       
   294 \node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
       
   295 \node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (3) (b1) (b2) (b3)] {};
       
   296 \node [yshift=3mm] at (1.north) {$r_1$};
       
   297 \node [yshift=3mm] at (2.north) {$r_2$};
       
   298 \end{pgfonlayer}
       
   299 \end{tikzpicture}
       
   300 \end{center}
       
   301 
       
   302 \noindent
       
   303 Each automaton has a single start state and potentially several accepting states. We obtain a
       
   304 NFA for the regular expression $r_1 + r_2$ by introducing a new starting state and connecting it
       
   305 with an $\epsilon$-transition to the two starting states above, like so
       
   306 
       
   307 \begin{center}
       
   308 \hspace{2cm}\begin{tikzpicture}[node distance=3mm,
       
   309                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
       
   310 \node at (0,0) [state, initial]  (1)  {$\mbox{}$};
       
   311 \node[state]  (2)  [above right=16mm of 1] {$\mbox{}$};
       
   312 \node[state]  (3)  [below right=16mm of 1] {$\mbox{}$};
       
   313 
       
   314 \node (a)  [right=of 2] {$\ldots$};
       
   315 \node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
       
   316 \node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
       
   317 \node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
       
   318 
       
   319 \node (b)  [right=of 3] {$\ldots$};
       
   320 \node[state, accepting]  (b1)  [right=of b] {$\mbox{}$};
       
   321 \node[state, accepting]  (b2)  [above=of b1] {$\mbox{}$};
       
   322 \node[state, accepting]  (b3)  [below=of b1] {$\mbox{}$};
       
   323 
       
   324 \path[->] (1) edge node [above]  {$\epsilon$} (2);
       
   325 \path[->] (1) edge node [below]  {$\epsilon$} (3);
       
   326 
       
   327 \begin{pgfonlayer}{background}
       
   328 \node (3) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3) (b2) (b3)] {};
       
   329 \node [yshift=3mm] at (3.north) {$r_1+ r_2$};
       
   330 \end{pgfonlayer}
       
   331 \end{tikzpicture}
       
   332 \end{center}
       
   333 
       
   334 \noindent 
       
   335 Finally for the $*$-case we have an automaton for $r$
       
   336 
       
   337 \begin{center}
       
   338 \begin{tikzpicture}[node distance=3mm,
       
   339                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
       
   340 \node at (0,0)  (1)  {$\mbox{}$};
       
   341 \node[state, initial]  (2)  [right=16mm of 1] {$\mbox{}$};
       
   342 \node (a)  [right=of 2] {$\ldots$};
       
   343 \node[state, accepting]  (a1)  [right=of a] {$\mbox{}$};
       
   344 \node[state, accepting]  (a2)  [above=of a1] {$\mbox{}$};
       
   345 \node[state, accepting]  (a3)  [below=of a1] {$\mbox{}$};
       
   346 \begin{pgfonlayer}{background}
       
   347 \node (1) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (2) (a1) (a2) (a3)] {};
       
   348 \node [yshift=3mm] at (1.north) {$r$};
       
   349 \end{pgfonlayer}
       
   350 \end{tikzpicture}
       
   351 \end{center}
       
   352 
       
   353 \noindent
       
   354 and connect its accepting states to a new starting state via $\epsilon$-transitions. This new
       
   355 starting state is also an accepting state, because $r^*$ can also recognise the empty string.
       
   356 This gives the following automaton for $r^*$:
       
   357 
       
   358 \begin{center}
       
   359 \begin{tikzpicture}[node distance=3mm,
       
   360                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
       
   361 \node at (0,0) [state, initial,accepting]  (1)  {$\mbox{}$};
       
   362 \node[state]  (2)  [right=16mm of 1] {$\mbox{}$};
       
   363 \node (a)  [right=of 2] {$\ldots$};
       
   364 \node[state]  (a1)  [right=of a] {$\mbox{}$};
       
   365 \node[state]  (a2)  [above=of a1] {$\mbox{}$};
       
   366 \node[state]  (a3)  [below=of a1] {$\mbox{}$};
       
   367 \path[->] (1) edge node [above]  {$\epsilon$} (2);
       
   368 \path[->] (a1) edge [bend left=45] node [above]  {$\epsilon$} (1);
       
   369 \path[->] (a2) edge [bend right] node [below]  {$\epsilon$} (1);
       
   370 \path[->] (a3) edge [bend left=45] node [below]  {$\epsilon$} (1);
       
   371 \begin{pgfonlayer}{background}
       
   372 \node (2) [rounded corners, inner sep=1mm, thick, draw=black!60, fill=black!20, fit= (1) (a2) (a3)] {};
       
   373 \node [yshift=3mm] at (2.north) {$r^*$};
       
   374 \end{pgfonlayer}
       
   375 \end{tikzpicture}
       
   376 \end{center}
       
   377 
       
   378 
       
   379 
    97 \end{document}
   380 \end{document}
    98 
   381 
    99 %%% Local Variables: 
   382 %%% Local Variables: 
   100 %%% mode: latex
   383 %%% mode: latex
   101 %%% TeX-master: t
   384 %%% TeX-master: t