handouts/ho03.tex
changeset 251 5b5a68df6d16
parent 217 cd6066f1056a
child 268 18bef085a7ca
equal deleted inserted replaced
250:b79e704acb72 251:5b5a68df6d16
     1 \documentclass{article}
     1 \documentclass{article}
     2 \usepackage{hyperref}
     2 \usepackage{../style}
     3 \usepackage{amssymb}
     3 \usepackage{../langs}
     4 \usepackage{amsmath}
     4 
     5 \usepackage[T1]{fontenc}
       
     6 \usepackage{listings}
       
     7 \usepackage{xcolor}
     5 \usepackage{xcolor}
     8 \usepackage{tikz}
     6 \usepackage{tikz}
     9 \usetikzlibrary{arrows}
     7 \usetikzlibrary{arrows}
    10 \usetikzlibrary{automata}
     8 \usetikzlibrary{automata}
    11 \usetikzlibrary{shapes}
     9 \usetikzlibrary{shapes}
    12 \usetikzlibrary{shadows}
    10 \usetikzlibrary{shadows}
    13 \usetikzlibrary{positioning}
    11 \usetikzlibrary{positioning}
    14 \usetikzlibrary{calc}
    12 \usetikzlibrary{calc}
    15 \usetikzlibrary{fit}
    13 \usetikzlibrary{fit}
    16 \usetikzlibrary{backgrounds}
    14 \usetikzlibrary{backgrounds}
    17 \usepackage{../langs}
       
    18 
       
    19 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
       
    20 
    15 
    21 
    16 
    22 \begin{document}
    17 \begin{document}
    23 
    18 
    24 \section*{Handout 3}
    19 \section*{Handout 3}
    25 
    20 
    26 Let us have a closer look at automata and their relation to regular expressions. This will help us to understand
    21 Let us have a closer look at automata and their relation to
    27 why the regular expression matchers in Python and Ruby are so slow with certain regular expressions. 
    22 regular expressions. This will help us to understand why the
    28 
    23 regular expression matchers in Python and Ruby are so slow
    29 A \emph{deterministic finite automaton} (DFA), say $A$, is defined by  a four-tuple written $A(Q, q_0, F, \delta)$ where
    24 with certain regular expressions. 
       
    25 
       
    26 A \emph{deterministic finite automaton} (DFA), say $A$, is
       
    27 defined by a four-tuple written $A(Q, q_0, F, \delta)$ where
    30 
    28 
    31 \begin{itemize}
    29 \begin{itemize}
    32 \item $Q$ is a set of states,
    30 \item $Q$ is a set of states,
    33 \item $q_0 \in Q$ is the start state,
    31 \item $q_0 \in Q$ is the start state,
    34 \item $F \subseteq Q$ are the accepting states, and
    32 \item $F \subseteq Q$ are the accepting states, and
    35 \item $\delta$ is the transition function.
    33 \item $\delta$ is the transition function.
    36 \end{itemize}
    34 \end{itemize}
    37 
    35 
    38 \noindent
    36 \noindent The transition function determines how to
    39 The transition function determines how to ``transition'' from one state to the next state with respect to a character.
    37 ``transition'' from one state to the next state with respect
    40 We have the assumption that these functions do not need to be defined everywhere: so it can be the case that
    38 to a character. We have the assumption that these functions do
    41 given a character there is no next state, in which case we need to raise a kind of ``raise an exception''.  A typical 
    39 not need to be defined everywhere: so it can be the case that
       
    40 given a character there is no next state, in which case we
       
    41 need to raise a kind of ``raise an exception''. A typical
    42 example of a DFA is
    42 example of a DFA is
    43 
    43 
    44 \begin{center}
    44 \begin{center}
    45 \begin{tikzpicture}[>=stealth',very thick,auto,
    45 \begin{tikzpicture}[>=stealth',very thick,auto,
    46                              every state/.style={minimum size=0pt,inner sep=2pt,draw=blue!50,very thick,fill=blue!20},]
    46                              every state/.style={minimum size=0pt,inner sep=2pt,draw=blue!50,very thick,fill=blue!20},]
    59 \path[->] (q_2) edge [loop left] node  {$b$} ();
    59 \path[->] (q_2) edge [loop left] node  {$b$} ();
    60 \path[->] (q_3) edge [bend left=95, looseness=1.3] node [below]  {$b$} (q_0);
    60 \path[->] (q_3) edge [bend left=95, looseness=1.3] node [below]  {$b$} (q_0);
    61 \end{tikzpicture}
    61 \end{tikzpicture}
    62 \end{center}
    62 \end{center}
    63 
    63 
    64 \noindent
    64 \noindent The accepting state $q_4$ is indicated with double
    65 The accepting state $q_4$ is indicated with double circles. It is possible that a DFA has no
    65 circles. It is possible that a DFA has no accepting states at
    66 accepting states at all, or that the starting state is also an accepting state.
    66 all, or that the starting state is also an accepting state. In
    67 In the case above the transition function is defined everywhere and can be given as a table
    67 the case above the transition function is defined everywhere
    68 as follows:
    68 and can be given as a table as follows:
    69 
    69 
    70 \[
    70 \[
    71 \begin{array}{lcl}
    71 \begin{array}{lcl}
    72 (q_0, a) &\rightarrow& q_1\\
    72 (q_0, a) &\rightarrow& q_1\\
    73 (q_0, b) &\rightarrow& q_2\\
    73 (q_0, b) &\rightarrow& q_2\\
    80 (q_4, a) &\rightarrow& q_4\\
    80 (q_4, a) &\rightarrow& q_4\\
    81 (q_4, b) &\rightarrow& q_4\\
    81 (q_4, b) &\rightarrow& q_4\\
    82 \end{array}
    82 \end{array}
    83 \]
    83 \]
    84 
    84 
    85 \noindent
    85 \noindent We need to define the notion of what language is
    86 We need to define the notion of what language is accepted by an automaton. For this we 
    86 accepted by an automaton. For this we lift the transition
    87 lift the transition function $\delta$ from characters to strings as follows:
    87 function $\delta$ from characters to strings as follows:
    88 
    88 
    89 \[
    89 \[
    90 \begin{array}{lcl}
    90 \begin{array}{lcl}
    91 \hat{\delta}(q, "")        & \dn & q\\
    91 \hat{\delta}(q, "")        & \dn & q\\
    92 \hat{\delta}(q, c\!::\!s) & \dn & \hat{\delta}(\delta(q, c), s)\\
    92 \hat{\delta}(q, c\!::\!s) & \dn & \hat{\delta}(\delta(q, c), s)\\
    93 \end{array}
    93 \end{array}
    94 \]
    94 \]
    95 
    95 
    96 \noindent
    96 \noindent Given a string this means we start in the starting
    97 Given a string this means we start in the starting state and take the first character of the string,
    97 state and take the first character of the string, follow to
    98 follow to the next sate, then take the second character and so on. Once the string is exhausted
    98 the next sate, then take the second character and so on. Once
    99 and we end up in an accepting state, then this string is accepted. Otherwise it is not accepted. 
    99 the string is exhausted and we end up in an accepting state,
   100 So $s$ in the \emph{language accepted by the automaton} $A(Q, q_0, F, \delta)$ iff
   100 then this string is accepted. Otherwise it is not accepted. So
       
   101 $s$ in the \emph{language accepted by the automaton} $A(Q,
       
   102 q_0, F, \delta)$ iff
   101 
   103 
   102 \[
   104 \[
   103 \hat{\delta}(q_0, s) \in F 
   105 \hat{\delta}(q_0, s) \in F 
   104 \]
   106 \]
   105   
   107   
   106 
   108 
   107 While with DFA it will always clear that given a character what the next state is, it will be useful to relax 
   109 While with DFA it will always clear that given a character
   108 this restriction. The resulting construction is called a \emph{non-deterministic finite automaton} (NFA) given
   110 what the next state is, it will be useful to relax this
   109 as a four-tuple $A(Q, q_0, F, \rho)$ where
   111 restriction. The resulting construction is called a
       
   112 \emph{non-deterministic finite automaton} (NFA) given as a
       
   113 four-tuple $A(Q, q_0, F, \rho)$ where
   110 
   114 
   111 \begin{itemize}
   115 \begin{itemize}
   112 \item $Q$ is a finite set of states
   116 \item $Q$ is a finite set of states
   113 \item $q_0$ is a start state
   117 \item $q_0$ is a start state
   114 \item $F$ are some accepting states with $F \subseteq Q$, and
   118 \item $F$ are some accepting states with $F \subseteq Q$, and
   143 \path[->] (r_2) edge [bend left] node  [right] {$a$} (r_1);
   147 \path[->] (r_2) edge [bend left] node  [right] {$a$} (r_1);
   144 \end{tikzpicture}}
   148 \end{tikzpicture}}
   145 \end{tabular}
   149 \end{tabular}
   146 \end{center}
   150 \end{center}
   147 
   151 
   148 \noindent
   152 \noindent There are a number of points you should note. Every
   149 There are a number of points you should note. Every DFA is a NFA, but not vice versa.
   153 DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a
   150 The $\rho$ in NFAs is a transition \emph{relation} 
   154 transition \emph{relation} (DFAs have a transition function).
   151 (DFAs have a transition function). The difference between a function and a relation is that a function 
   155 The difference between a function and a relation is that a
   152 has always a single output, while a relation gives, roughly speaking, several outputs. Look
   156 function has always a single output, while a relation gives,
   153 at the NFA on the right-hand side above: if you are currently in the state $r_2$ and you read a
   157 roughly speaking, several outputs. Look at the NFA on the
   154 character $a$, then you can transition to $r_1$ \emph{or} $r_3$. Which route you take is not
   158 right-hand side above: if you are currently in the state $r_2$
   155 determined. This means if we need to decide whether a string is accepted by a NFA, we might have 
   159 and you read a character $a$, then you can transition to $r_1$
   156 to explore all possibilities. Also there is a special transition in NFAs which is called \emph{epsilon-transition}
   160 \emph{or} $r_3$. Which route you take is not determined. This
   157 or \emph{silent transition}. This transition means you do not have to ``consume'' no part of
   161 means if we need to decide whether a string is accepted by a
   158 the input string, but ``silently'' change to a different state.
   162 NFA, we might have to explore all possibilities. Also there is
   159 
   163 a special transition in NFAs which is called
   160 The reason for introducing NFAs is that there is a relatively simple (recursive) translation of regular expressions into
   164 \emph{epsilon-transition} or \emph{silent transition}. This
   161 NFAs. Consider the simple regular expressions $\varnothing$, $\epsilon$ and $c$. They can be translated
   165 transition means you do not have to ``consume'' no part of the
   162 as follows:
   166 input string, but ``silently'' change to a different state.
       
   167 
       
   168 The reason for introducing NFAs is that there is a relatively
       
   169 simple (recursive) translation of regular expressions into
       
   170 NFAs. Consider the simple regular expressions $\varnothing$,
       
   171 $\epsilon$ and $c$. They can be translated as follows:
   163 
   172 
   164 \begin{center}
   173 \begin{center}
   165 \begin{tabular}[t]{l@{\hspace{10mm}}l}
   174 \begin{tabular}[t]{l@{\hspace{10mm}}l}
   166 \raisebox{1mm}{$\varnothing$} & 
   175 \raisebox{1mm}{$\varnothing$} & 
   167 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   176 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   178 \path[->] (q_0) edge node [below]  {$c$} (q_1);
   187 \path[->] (q_0) edge node [below]  {$c$} (q_1);
   179 \end{tikzpicture}\\\\
   188 \end{tikzpicture}\\\\
   180 \end{tabular}
   189 \end{tabular}
   181 \end{center}
   190 \end{center}
   182 
   191 
   183 \noindent
   192 \noindent The case for the sequence regular expression $r_1
   184 The case for the sequence regular expression $r_1 \cdot r_2$ is as follows: We are given by recursion
   193 \cdot r_2$ is as follows: We are given by recursion two
   185 two automata representing $r_1$ and $r_2$ respectively. 
   194 automata representing $r_1$ and $r_2$ respectively. 
   186 
   195 
   187 \begin{center}
   196 \begin{center}
   188 \begin{tikzpicture}[node distance=3mm,
   197 \begin{tikzpicture}[node distance=3mm,
   189                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   198                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   190 \node[state, initial]  (q_0)  {$\mbox{}$};
   199 \node[state, initial]  (q_0)  {$\mbox{}$};
   204 \node [yshift=2mm] at (2.north) {$r_2$};
   213 \node [yshift=2mm] at (2.north) {$r_2$};
   205 \end{pgfonlayer}
   214 \end{pgfonlayer}
   206 \end{tikzpicture}
   215 \end{tikzpicture}
   207 \end{center}
   216 \end{center}
   208 
   217 
   209 \noindent
   218 \noindent The first automaton has some accepting states. We
   210 The first automaton has some accepting states. We obtain an automaton for $r_1\cdot r_2$ by connecting
   219 obtain an automaton for $r_1\cdot r_2$ by connecting these
   211 these accepting states with $\epsilon$-transitions to the starting state of the second automaton. By doing 
   220 accepting states with $\epsilon$-transitions to the starting
   212 so we make them non-accepting like so:
   221 state of the second automaton. By doing so we make them
       
   222 non-accepting like so:
   213 
   223 
   214 \begin{center}
   224 \begin{center}
   215 \begin{tikzpicture}[node distance=3mm,
   225 \begin{tikzpicture}[node distance=3mm,
   216                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   226                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   217 \node[state, initial]  (q_0)  {$\mbox{}$};
   227 \node[state, initial]  (q_0)  {$\mbox{}$};
   233 \node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
   243 \node [yshift=2mm] at (3.north) {$r_1\cdot r_2$};
   234 \end{pgfonlayer}
   244 \end{pgfonlayer}
   235 \end{tikzpicture}
   245 \end{tikzpicture}
   236 \end{center}
   246 \end{center}
   237 
   247 
   238 \noindent
   248 \noindent The case for the choice regular expression $r_1 +
   239 The case for the choice regular expression $r_1 + r_2$ is slightly different: We are given by recursion
   249 r_2$ is slightly different: We are given by recursion two
   240 two automata representing $r_1$ and $r_2$ respectively. 
   250 automata representing $r_1$ and $r_2$ respectively. 
   241 
   251 
   242 \begin{center}
   252 \begin{center}
   243 \begin{tikzpicture}[node distance=3mm,
   253 \begin{tikzpicture}[node distance=3mm,
   244                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   254                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   245 \node at (0,0)  (1)  {$\mbox{}$};
   255 \node at (0,0)  (1)  {$\mbox{}$};
   262 \node [yshift=3mm] at (2.north) {$r_2$};
   272 \node [yshift=3mm] at (2.north) {$r_2$};
   263 \end{pgfonlayer}
   273 \end{pgfonlayer}
   264 \end{tikzpicture}
   274 \end{tikzpicture}
   265 \end{center}
   275 \end{center}
   266 
   276 
   267 \noindent
   277 \noindent Each automaton has a single start state and
   268 Each automaton has a single start state and potentially several accepting states. We obtain a
   278 potentially several accepting states. We obtain a NFA for the
   269 NFA for the regular expression $r_1 + r_2$ by introducing a new starting state and connecting it
   279 regular expression $r_1 + r_2$ by introducing a new starting
   270 with an $\epsilon$-transition to the two starting states above, like so
   280 state and connecting it with an $\epsilon$-transition to the
       
   281 two starting states above, like so
   271 
   282 
   272 \begin{center}
   283 \begin{center}
   273 \hspace{2cm}\begin{tikzpicture}[node distance=3mm,
   284 \hspace{2cm}\begin{tikzpicture}[node distance=3mm,
   274                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   285                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   275 \node at (0,0) [state, initial]  (1)  {$\mbox{}$};
   286 \node at (0,0) [state, initial]  (1)  {$\mbox{}$};
   313 \node [yshift=3mm] at (1.north) {$r$};
   324 \node [yshift=3mm] at (1.north) {$r$};
   314 \end{pgfonlayer}
   325 \end{pgfonlayer}
   315 \end{tikzpicture}
   326 \end{tikzpicture}
   316 \end{center}
   327 \end{center}
   317 
   328 
   318 \noindent
   329 \noindent and connect its accepting states to a new starting
   319 and connect its accepting states to a new starting state via $\epsilon$-transitions. This new
   330 state via $\epsilon$-transitions. This new starting state is
   320 starting state is also an accepting state, because $r^*$ can also recognise the empty string.
   331 also an accepting state, because $r^*$ can also recognise the
   321 This gives the following automaton for $r^*$:
   332 empty string. This gives the following automaton for $r^*$:
   322 
   333 
   323 \begin{center}
   334 \begin{center}
   324 \begin{tikzpicture}[node distance=3mm,
   335 \begin{tikzpicture}[node distance=3mm,
   325                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   336                              >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},]
   326 \node at (0,0) [state, initial,accepting]  (1)  {$\mbox{}$};
   337 \node at (0,0) [state, initial,accepting]  (1)  {$\mbox{}$};
   338 \node [yshift=3mm] at (2.north) {$r^*$};
   349 \node [yshift=3mm] at (2.north) {$r^*$};
   339 \end{pgfonlayer}
   350 \end{pgfonlayer}
   340 \end{tikzpicture}
   351 \end{tikzpicture}
   341 \end{center}
   352 \end{center}
   342 
   353 
   343 \noindent
   354 \noindent This construction of a NFA from a regular expression
   344 This construction of a NFA from a regular expression was invented by Ken Thompson in 1968.
   355 was invented by Ken Thompson in 1968.
   345 
   356 
   346 \end{document}
   357 \end{document}
   347 
   358 
   348 %%% Local Variables: 
   359 %%% Local Variables: 
   349 %%% mode: latex
   360 %%% mode: latex