1 \documentclass{article} |
1 \documentclass{article} |
2 \usepackage{hyperref} |
2 \usepackage{../style} |
3 \usepackage{amssymb} |
3 \usepackage{../langs} |
4 \usepackage{amsmath} |
4 |
5 \usepackage[T1]{fontenc} |
|
6 \usepackage{listings} |
|
7 \usepackage{xcolor} |
5 \usepackage{xcolor} |
8 \usepackage{tikz} |
6 \usepackage{tikz} |
9 \usetikzlibrary{arrows} |
7 \usetikzlibrary{arrows} |
10 \usetikzlibrary{automata} |
8 \usetikzlibrary{automata} |
11 \usetikzlibrary{shapes} |
9 \usetikzlibrary{shapes} |
12 \usetikzlibrary{shadows} |
10 \usetikzlibrary{shadows} |
13 \usetikzlibrary{positioning} |
11 \usetikzlibrary{positioning} |
14 \usetikzlibrary{calc} |
12 \usetikzlibrary{calc} |
15 \usetikzlibrary{fit} |
13 \usetikzlibrary{fit} |
16 \usetikzlibrary{backgrounds} |
14 \usetikzlibrary{backgrounds} |
17 \usepackage{../langs} |
|
18 |
|
19 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}% |
|
20 |
15 |
21 |
16 |
22 \begin{document} |
17 \begin{document} |
23 |
18 |
24 \section*{Handout 3} |
19 \section*{Handout 3} |
25 |
20 |
26 Let us have a closer look at automata and their relation to regular expressions. This will help us to understand |
21 Let us have a closer look at automata and their relation to |
27 why the regular expression matchers in Python and Ruby are so slow with certain regular expressions. |
22 regular expressions. This will help us to understand why the |
28 |
23 regular expression matchers in Python and Ruby are so slow |
29 A \emph{deterministic finite automaton} (DFA), say $A$, is defined by a four-tuple written $A(Q, q_0, F, \delta)$ where |
24 with certain regular expressions. |
|
25 |
|
26 A \emph{deterministic finite automaton} (DFA), say $A$, is |
|
27 defined by a four-tuple written $A(Q, q_0, F, \delta)$ where |
30 |
28 |
31 \begin{itemize} |
29 \begin{itemize} |
32 \item $Q$ is a set of states, |
30 \item $Q$ is a set of states, |
33 \item $q_0 \in Q$ is the start state, |
31 \item $q_0 \in Q$ is the start state, |
34 \item $F \subseteq Q$ are the accepting states, and |
32 \item $F \subseteq Q$ are the accepting states, and |
35 \item $\delta$ is the transition function. |
33 \item $\delta$ is the transition function. |
36 \end{itemize} |
34 \end{itemize} |
37 |
35 |
38 \noindent |
36 \noindent The transition function determines how to |
39 The transition function determines how to ``transition'' from one state to the next state with respect to a character. |
37 ``transition'' from one state to the next state with respect |
40 We have the assumption that these functions do not need to be defined everywhere: so it can be the case that |
38 to a character. We have the assumption that these functions do |
41 given a character there is no next state, in which case we need to raise a kind of ``raise an exception''. A typical |
39 not need to be defined everywhere: so it can be the case that |
|
40 given a character there is no next state, in which case we |
|
41 need to raise a kind of ``raise an exception''. A typical |
42 example of a DFA is |
42 example of a DFA is |
43 |
43 |
44 \begin{center} |
44 \begin{center} |
45 \begin{tikzpicture}[>=stealth',very thick,auto, |
45 \begin{tikzpicture}[>=stealth',very thick,auto, |
46 every state/.style={minimum size=0pt,inner sep=2pt,draw=blue!50,very thick,fill=blue!20},] |
46 every state/.style={minimum size=0pt,inner sep=2pt,draw=blue!50,very thick,fill=blue!20},] |
59 \path[->] (q_2) edge [loop left] node {$b$} (); |
59 \path[->] (q_2) edge [loop left] node {$b$} (); |
60 \path[->] (q_3) edge [bend left=95, looseness=1.3] node [below] {$b$} (q_0); |
60 \path[->] (q_3) edge [bend left=95, looseness=1.3] node [below] {$b$} (q_0); |
61 \end{tikzpicture} |
61 \end{tikzpicture} |
62 \end{center} |
62 \end{center} |
63 |
63 |
64 \noindent |
64 \noindent The accepting state $q_4$ is indicated with double |
65 The accepting state $q_4$ is indicated with double circles. It is possible that a DFA has no |
65 circles. It is possible that a DFA has no accepting states at |
66 accepting states at all, or that the starting state is also an accepting state. |
66 all, or that the starting state is also an accepting state. In |
67 In the case above the transition function is defined everywhere and can be given as a table |
67 the case above the transition function is defined everywhere |
68 as follows: |
68 and can be given as a table as follows: |
69 |
69 |
70 \[ |
70 \[ |
71 \begin{array}{lcl} |
71 \begin{array}{lcl} |
72 (q_0, a) &\rightarrow& q_1\\ |
72 (q_0, a) &\rightarrow& q_1\\ |
73 (q_0, b) &\rightarrow& q_2\\ |
73 (q_0, b) &\rightarrow& q_2\\ |
80 (q_4, a) &\rightarrow& q_4\\ |
80 (q_4, a) &\rightarrow& q_4\\ |
81 (q_4, b) &\rightarrow& q_4\\ |
81 (q_4, b) &\rightarrow& q_4\\ |
82 \end{array} |
82 \end{array} |
83 \] |
83 \] |
84 |
84 |
85 \noindent |
85 \noindent We need to define the notion of what language is |
86 We need to define the notion of what language is accepted by an automaton. For this we |
86 accepted by an automaton. For this we lift the transition |
87 lift the transition function $\delta$ from characters to strings as follows: |
87 function $\delta$ from characters to strings as follows: |
88 |
88 |
89 \[ |
89 \[ |
90 \begin{array}{lcl} |
90 \begin{array}{lcl} |
91 \hat{\delta}(q, "") & \dn & q\\ |
91 \hat{\delta}(q, "") & \dn & q\\ |
92 \hat{\delta}(q, c\!::\!s) & \dn & \hat{\delta}(\delta(q, c), s)\\ |
92 \hat{\delta}(q, c\!::\!s) & \dn & \hat{\delta}(\delta(q, c), s)\\ |
93 \end{array} |
93 \end{array} |
94 \] |
94 \] |
95 |
95 |
96 \noindent |
96 \noindent Given a string this means we start in the starting |
97 Given a string this means we start in the starting state and take the first character of the string, |
97 state and take the first character of the string, follow to |
98 follow to the next sate, then take the second character and so on. Once the string is exhausted |
98 the next sate, then take the second character and so on. Once |
99 and we end up in an accepting state, then this string is accepted. Otherwise it is not accepted. |
99 the string is exhausted and we end up in an accepting state, |
100 So $s$ in the \emph{language accepted by the automaton} $A(Q, q_0, F, \delta)$ iff |
100 then this string is accepted. Otherwise it is not accepted. So |
|
101 $s$ in the \emph{language accepted by the automaton} $A(Q, |
|
102 q_0, F, \delta)$ iff |
101 |
103 |
102 \[ |
104 \[ |
103 \hat{\delta}(q_0, s) \in F |
105 \hat{\delta}(q_0, s) \in F |
104 \] |
106 \] |
105 |
107 |
106 |
108 |
107 While with DFA it will always clear that given a character what the next state is, it will be useful to relax |
109 While with DFA it will always clear that given a character |
108 this restriction. The resulting construction is called a \emph{non-deterministic finite automaton} (NFA) given |
110 what the next state is, it will be useful to relax this |
109 as a four-tuple $A(Q, q_0, F, \rho)$ where |
111 restriction. The resulting construction is called a |
|
112 \emph{non-deterministic finite automaton} (NFA) given as a |
|
113 four-tuple $A(Q, q_0, F, \rho)$ where |
110 |
114 |
111 \begin{itemize} |
115 \begin{itemize} |
112 \item $Q$ is a finite set of states |
116 \item $Q$ is a finite set of states |
113 \item $q_0$ is a start state |
117 \item $q_0$ is a start state |
114 \item $F$ are some accepting states with $F \subseteq Q$, and |
118 \item $F$ are some accepting states with $F \subseteq Q$, and |
143 \path[->] (r_2) edge [bend left] node [right] {$a$} (r_1); |
147 \path[->] (r_2) edge [bend left] node [right] {$a$} (r_1); |
144 \end{tikzpicture}} |
148 \end{tikzpicture}} |
145 \end{tabular} |
149 \end{tabular} |
146 \end{center} |
150 \end{center} |
147 |
151 |
148 \noindent |
152 \noindent There are a number of points you should note. Every |
149 There are a number of points you should note. Every DFA is a NFA, but not vice versa. |
153 DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a |
150 The $\rho$ in NFAs is a transition \emph{relation} |
154 transition \emph{relation} (DFAs have a transition function). |
151 (DFAs have a transition function). The difference between a function and a relation is that a function |
155 The difference between a function and a relation is that a |
152 has always a single output, while a relation gives, roughly speaking, several outputs. Look |
156 function has always a single output, while a relation gives, |
153 at the NFA on the right-hand side above: if you are currently in the state $r_2$ and you read a |
157 roughly speaking, several outputs. Look at the NFA on the |
154 character $a$, then you can transition to $r_1$ \emph{or} $r_3$. Which route you take is not |
158 right-hand side above: if you are currently in the state $r_2$ |
155 determined. This means if we need to decide whether a string is accepted by a NFA, we might have |
159 and you read a character $a$, then you can transition to $r_1$ |
156 to explore all possibilities. Also there is a special transition in NFAs which is called \emph{epsilon-transition} |
160 \emph{or} $r_3$. Which route you take is not determined. This |
157 or \emph{silent transition}. This transition means you do not have to ``consume'' no part of |
161 means if we need to decide whether a string is accepted by a |
158 the input string, but ``silently'' change to a different state. |
162 NFA, we might have to explore all possibilities. Also there is |
159 |
163 a special transition in NFAs which is called |
160 The reason for introducing NFAs is that there is a relatively simple (recursive) translation of regular expressions into |
164 \emph{epsilon-transition} or \emph{silent transition}. This |
161 NFAs. Consider the simple regular expressions $\varnothing$, $\epsilon$ and $c$. They can be translated |
165 transition means you do not have to ``consume'' no part of the |
162 as follows: |
166 input string, but ``silently'' change to a different state. |
|
167 |
|
168 The reason for introducing NFAs is that there is a relatively |
|
169 simple (recursive) translation of regular expressions into |
|
170 NFAs. Consider the simple regular expressions $\varnothing$, |
|
171 $\epsilon$ and $c$. They can be translated as follows: |
163 |
172 |
164 \begin{center} |
173 \begin{center} |
165 \begin{tabular}[t]{l@{\hspace{10mm}}l} |
174 \begin{tabular}[t]{l@{\hspace{10mm}}l} |
166 \raisebox{1mm}{$\varnothing$} & |
175 \raisebox{1mm}{$\varnothing$} & |
167 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},] |
176 \begin{tikzpicture}[scale=0.7,>=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},] |
178 \path[->] (q_0) edge node [below] {$c$} (q_1); |
187 \path[->] (q_0) edge node [below] {$c$} (q_1); |
179 \end{tikzpicture}\\\\ |
188 \end{tikzpicture}\\\\ |
180 \end{tabular} |
189 \end{tabular} |
181 \end{center} |
190 \end{center} |
182 |
191 |
183 \noindent |
192 \noindent The case for the sequence regular expression $r_1 |
184 The case for the sequence regular expression $r_1 \cdot r_2$ is as follows: We are given by recursion |
193 \cdot r_2$ is as follows: We are given by recursion two |
185 two automata representing $r_1$ and $r_2$ respectively. |
194 automata representing $r_1$ and $r_2$ respectively. |
186 |
195 |
187 \begin{center} |
196 \begin{center} |
188 \begin{tikzpicture}[node distance=3mm, |
197 \begin{tikzpicture}[node distance=3mm, |
189 >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},] |
198 >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},] |
190 \node[state, initial] (q_0) {$\mbox{}$}; |
199 \node[state, initial] (q_0) {$\mbox{}$}; |
204 \node [yshift=2mm] at (2.north) {$r_2$}; |
213 \node [yshift=2mm] at (2.north) {$r_2$}; |
205 \end{pgfonlayer} |
214 \end{pgfonlayer} |
206 \end{tikzpicture} |
215 \end{tikzpicture} |
207 \end{center} |
216 \end{center} |
208 |
217 |
209 \noindent |
218 \noindent The first automaton has some accepting states. We |
210 The first automaton has some accepting states. We obtain an automaton for $r_1\cdot r_2$ by connecting |
219 obtain an automaton for $r_1\cdot r_2$ by connecting these |
211 these accepting states with $\epsilon$-transitions to the starting state of the second automaton. By doing |
220 accepting states with $\epsilon$-transitions to the starting |
212 so we make them non-accepting like so: |
221 state of the second automaton. By doing so we make them |
|
222 non-accepting like so: |
213 |
223 |
214 \begin{center} |
224 \begin{center} |
215 \begin{tikzpicture}[node distance=3mm, |
225 \begin{tikzpicture}[node distance=3mm, |
216 >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},] |
226 >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},] |
217 \node[state, initial] (q_0) {$\mbox{}$}; |
227 \node[state, initial] (q_0) {$\mbox{}$}; |
233 \node [yshift=2mm] at (3.north) {$r_1\cdot r_2$}; |
243 \node [yshift=2mm] at (3.north) {$r_1\cdot r_2$}; |
234 \end{pgfonlayer} |
244 \end{pgfonlayer} |
235 \end{tikzpicture} |
245 \end{tikzpicture} |
236 \end{center} |
246 \end{center} |
237 |
247 |
238 \noindent |
248 \noindent The case for the choice regular expression $r_1 + |
239 The case for the choice regular expression $r_1 + r_2$ is slightly different: We are given by recursion |
249 r_2$ is slightly different: We are given by recursion two |
240 two automata representing $r_1$ and $r_2$ respectively. |
250 automata representing $r_1$ and $r_2$ respectively. |
241 |
251 |
242 \begin{center} |
252 \begin{center} |
243 \begin{tikzpicture}[node distance=3mm, |
253 \begin{tikzpicture}[node distance=3mm, |
244 >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},] |
254 >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},] |
245 \node at (0,0) (1) {$\mbox{}$}; |
255 \node at (0,0) (1) {$\mbox{}$}; |
262 \node [yshift=3mm] at (2.north) {$r_2$}; |
272 \node [yshift=3mm] at (2.north) {$r_2$}; |
263 \end{pgfonlayer} |
273 \end{pgfonlayer} |
264 \end{tikzpicture} |
274 \end{tikzpicture} |
265 \end{center} |
275 \end{center} |
266 |
276 |
267 \noindent |
277 \noindent Each automaton has a single start state and |
268 Each automaton has a single start state and potentially several accepting states. We obtain a |
278 potentially several accepting states. We obtain a NFA for the |
269 NFA for the regular expression $r_1 + r_2$ by introducing a new starting state and connecting it |
279 regular expression $r_1 + r_2$ by introducing a new starting |
270 with an $\epsilon$-transition to the two starting states above, like so |
280 state and connecting it with an $\epsilon$-transition to the |
|
281 two starting states above, like so |
271 |
282 |
272 \begin{center} |
283 \begin{center} |
273 \hspace{2cm}\begin{tikzpicture}[node distance=3mm, |
284 \hspace{2cm}\begin{tikzpicture}[node distance=3mm, |
274 >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},] |
285 >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},] |
275 \node at (0,0) [state, initial] (1) {$\mbox{}$}; |
286 \node at (0,0) [state, initial] (1) {$\mbox{}$}; |
313 \node [yshift=3mm] at (1.north) {$r$}; |
324 \node [yshift=3mm] at (1.north) {$r$}; |
314 \end{pgfonlayer} |
325 \end{pgfonlayer} |
315 \end{tikzpicture} |
326 \end{tikzpicture} |
316 \end{center} |
327 \end{center} |
317 |
328 |
318 \noindent |
329 \noindent and connect its accepting states to a new starting |
319 and connect its accepting states to a new starting state via $\epsilon$-transitions. This new |
330 state via $\epsilon$-transitions. This new starting state is |
320 starting state is also an accepting state, because $r^*$ can also recognise the empty string. |
331 also an accepting state, because $r^*$ can also recognise the |
321 This gives the following automaton for $r^*$: |
332 empty string. This gives the following automaton for $r^*$: |
322 |
333 |
323 \begin{center} |
334 \begin{center} |
324 \begin{tikzpicture}[node distance=3mm, |
335 \begin{tikzpicture}[node distance=3mm, |
325 >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},] |
336 >=stealth',very thick, every state/.style={minimum size=3pt,draw=blue!50,very thick,fill=blue!20},] |
326 \node at (0,0) [state, initial,accepting] (1) {$\mbox{}$}; |
337 \node at (0,0) [state, initial,accepting] (1) {$\mbox{}$}; |