31 |
30 |
32 The central definition is:\medskip |
31 The central definition is:\medskip |
33 |
32 |
34 \noindent |
33 \noindent |
35 A \emph{deterministic finite automaton} (DFA), say $A$, is |
34 A \emph{deterministic finite automaton} (DFA), say $A$, is |
36 given by a five-tuple written $A(\Sigma, Qs, Q_0, F, \delta)$ where |
35 given by a five-tuple written ${\cal A}(\varSigma, Qs, Q_0, F, \delta)$ where |
37 |
36 |
38 \begin{itemize} |
37 \begin{itemize} |
39 \item $\Sigma$ is an alphabet, |
38 \item $\varSigma$ is an alphabet, |
40 \item $Qs$ is a finite set of states, |
39 \item $Qs$ is a finite set of states, |
41 \item $Q_0 \in Qs$ is the start state, |
40 \item $Q_0 \in Qs$ is the start state, |
42 \item $F \subseteq Qs$ are the accepting states, and |
41 \item $F \subseteq Qs$ are the accepting states, and |
43 \item $\delta$ is the transition function. |
42 \item $\delta$ is the transition function. |
44 \end{itemize} |
43 \end{itemize} |
47 from one state to the next state with respect to a character. We have |
46 from one state to the next state with respect to a character. We have |
48 the assumption that these transition functions do not need to be |
47 the assumption that these transition functions do not need to be |
49 defined everywhere: so it can be the case that given a character there |
48 defined everywhere: so it can be the case that given a character there |
50 is no next state, in which case we need to raise a kind of ``failure |
49 is no next state, in which case we need to raise a kind of ``failure |
51 exception''. That means actually we have \emph{partial} functions as |
50 exception''. That means actually we have \emph{partial} functions as |
52 transitions---see the implementation later on. A typical example of a |
51 transitions---see the Scala implementation of DFAs later on. A |
53 DFA is |
52 typical example of a DFA is |
54 |
53 |
55 \begin{center} |
54 \begin{center} |
56 \begin{tikzpicture}[>=stealth',very thick,auto, |
55 \begin{tikzpicture}[>=stealth',very thick,auto, |
57 every state/.style={minimum size=0pt, |
56 every state/.style={minimum size=0pt, |
58 inner sep=2pt,draw=blue!50,very thick, |
57 inner sep=2pt,draw=blue!50,very thick, |
100 an automaton. For this we lift the transition function |
99 an automaton. For this we lift the transition function |
101 $\delta$ from characters to strings as follows: |
100 $\delta$ from characters to strings as follows: |
102 |
101 |
103 \[ |
102 \[ |
104 \begin{array}{lcl} |
103 \begin{array}{lcl} |
105 \hat{\delta}(q, []) & \dn & q\\ |
104 \widehat{\delta}(q, []) & \dn & q\\ |
106 \hat{\delta}(q, c\!::\!s) & \dn & \hat{\delta}(\delta(q, c), s)\\ |
105 \widehat{\delta}(q, c\!::\!s) & \dn & \widehat{\delta}(\delta(q, c), s)\\ |
107 \end{array} |
106 \end{array} |
108 \] |
107 \] |
109 |
108 |
110 \noindent This lifted transition function is often called |
109 \noindent This lifted transition function is often called |
111 ``delta-hat''. Given a string, we start in the starting state and take |
110 ``delta-hat''. Given a string, we start in the starting state and take |
112 the first character of the string, follow to the next sate, then take |
111 the first character of the string, follow to the next state, then take |
113 the second character and so on. Once the string is exhausted and we |
112 the second character and so on. Once the string is exhausted and we |
114 end up in an accepting state, then this string is accepted by the |
113 end up in an accepting state, then this string is accepted by the |
115 automaton. Otherwise it is not accepted. This also means that if along |
114 automaton. Otherwise it is not accepted. This also means that if along |
116 the way we hit the case where the transition function $\delta$ is not |
115 the way we hit the case where the transition function $\delta$ is not |
117 defined, we need to raise an error. In our implementation we will deal |
116 defined, we need to raise an error. In our implementation we will deal |
118 with this case elegantly by using Scala's \texttt{Try}. So a string |
117 with this case elegantly by using Scala's \texttt{Try}. So a string |
119 $s$ is in the \emph{language accepted by the automaton} $A(\Sigma, Q, Q_0, F, |
118 $s$ is in the \emph{language accepted by the automaton} ${\cal |
120 \delta)$ iff |
119 A}(\varSigma, Q, Q_0, F, \delta)$ iff |
121 |
120 |
122 \[ |
121 \[ |
123 \hat{\delta}(Q_0, s) \in F |
122 \widehat{\delta}(Q_0, s) \in F |
124 \] |
123 \] |
125 |
124 |
126 \noindent I let you think about a definition that describes |
125 \noindent I let you think about a definition that describes |
127 the set of all strings accepted by an automaton. |
126 the set of all strings accepted by an automaton. |
128 |
127 |
140 returns \texttt{false}---that means the string is not accepted. |
139 returns \texttt{false}---that means the string is not accepted. |
141 The example \texttt{delta} implements the DFA example shown |
140 The example \texttt{delta} implements the DFA example shown |
142 earlier in the handout.\label{dfa}} |
141 earlier in the handout.\label{dfa}} |
143 \end{figure} |
142 \end{figure} |
144 |
143 |
145 A simple Scala implementation for DFAs is given in |
144 My take of a simple Scala implementation for DFAs is given in |
146 Figure~\ref{dfa}. As you can see, there are many features of the |
145 Figure~\ref{dfa}. As you can see, there are many features of the |
147 mathematical definition that are quite closely reflected in the |
146 mathematical definition that are quite closely reflected in the |
148 code. In the DFA-class, there is a starting state, called |
147 code. In the DFA-class, there is a starting state, called |
149 \texttt{start}, with the polymorphic type \texttt{A}. There is a |
148 \texttt{start}, with the polymorphic type \texttt{A}. There is a |
150 partial function \texttt{delta} for specifying the transitions---these |
149 partial function \texttt{delta} for specifying the transitions---these |
151 partial functions take a state (of polymorphic type \texttt{A}) and an |
150 partial functions take a state (of polymorphic type \texttt{A}) and an |
152 input (of polymorphic type \texttt{C}) and produce a new state (of |
151 input (of polymorphic type \texttt{C}) and produce a new state (of |
153 type \texttt{A}). For the moment it is OK to assume that \texttt{A} is |
152 type \texttt{A}). For the moment it is OK to assume that \texttt{A} is |
154 some arbitrary type for states and the input is just characters. (The |
153 some arbitrary type for states and the input is just characters. (The |
155 reason for having a polymorphic types for the states and the input of |
154 reason for having polymorphic types for the states and the input of |
156 DFAs will become clearer later on.) |
155 DFAs will become clearer later on.) |
157 |
156 |
158 I used partial functions for representing the transitions in Scala; |
157 The most important point in this implemnetation is that I use Scala's |
159 alternatives would have been \texttt{Maps} or even \texttt{Lists}. One |
158 partial functions for representing the transitions; alternatives would |
160 of the main advantages of using partial functions is that transitions |
159 have been \texttt{Maps} or even \texttt{Lists}. One of the main |
161 can be quite nicely defined by a series of \texttt{case} statements |
160 advantages of using partial functions is that transitions can be quite |
162 (see Lines 28 -- 38 for an example). If you need to represent an |
161 nicely defined by a series of \texttt{case} statements (see Lines 28 |
163 automaton with a sink state (catch-all-state), you can use Scala's |
162 -- 38 for an example). If you need to represent an automaton with a |
164 pattern matching and write something like |
163 sink state (catch-all-state), you can use Scala's pattern matching and |
|
164 write something like |
165 |
165 |
166 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor= |
166 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor= |
167 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
167 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
168 abstract class State |
168 abstract class State |
169 ... |
169 ... |
174 case (S1, 'a') => S2 |
174 case (S1, 'a') => S2 |
175 case _ => Sink |
175 case _ => Sink |
176 } |
176 } |
177 \end{lstlisting}} |
177 \end{lstlisting}} |
178 |
178 |
179 \noindent The DFA-class has also an argument for specifying final |
179 \noindent I let you think what this DFA looks like in the graphical |
180 states. In the implementation it not a set of states, as in the |
180 notation. |
181 matemathical definition, but is a function from states to booleans |
181 |
182 (this function is supposed to return true whenever a state is meant to |
182 The DFA-class has also an argument for specifying final states. In the |
183 be final; false otherwise). While this boolean function is different |
183 implementation it not a set of states, as in the matemathical |
184 from the sets of states, Scala allows to use sets for such functions |
184 definition, but a function from states to booleans (this function is |
185 (see Line 40 where the DFA is initialised). Again it will become clear |
185 supposed to return true whenever a state is final; false |
186 later on why I use functions for final states, rather than sets. |
186 otherwise). While this boolean function is different from the sets of |
|
187 states, Scala allows to use sets for such functions (see Line 40 where |
|
188 the DFA is initialised). Again it will become clear later on why I use |
|
189 functions for final states, rather than sets. |
187 |
190 |
188 I let you ponder whether this is a good implementation of DFAs. In |
191 I let you ponder whether this is a good implementation of DFAs. In |
189 doing so I hope you notice that the $Qs$ component (the set of finite |
192 doing so I hope you notice that the $\varSigma$ and $Qs$ components (the |
190 states) is missing from the class definition. This means that the |
193 alphabet and the set of finite states, respectively) are missing from |
191 implementation allows you to do some fishy things you are not meant to |
194 the class definition. This means that the implementation allows you to |
192 do with DFAs. Which fishy things could that be? |
195 do some fishy things you are not meant to do with DFAs. Which fishy |
|
196 things could that be? |
193 |
197 |
194 |
198 |
195 |
199 |
196 \subsection*{Non-Deterministic Finite Automata} |
200 \subsection*{Non-Deterministic Finite Automata} |
197 |
201 |
198 While with DFAs it will always be clear that given a state and a |
202 While with DFAs it is always be clear that given a state and a |
199 character what the next state is (potentially none), it will be useful |
203 character what the next state is (potentially none), it will be useful |
200 to relax this restriction. That means we allow to have several |
204 to relax this restriction. That means we allow states to have several |
201 potential successor states. We even allow more than one starting |
205 potential successor states. We even allow more than one starting |
202 state. The resulting construction is called a \emph{non-deterministic |
206 state. The resulting construction is called a \emph{Non-Deterministic |
203 finite automaton} (NFA) given also as a five-tuple $A(\Sigma, Qs, Q_{0s}, F, |
207 Finite Automaton} (NFA) given also as a five-tuple ${\cal A}(\varSigma, |
204 \rho)$ where |
208 Qs, Q_{0s}, F, \rho)$ where |
205 |
209 |
206 \begin{itemize} |
210 \begin{itemize} |
207 \item $\Sigma$ is an alphabet, |
211 \item $\varSigma$ is an alphabet, |
208 \item $Qs$ is a finite set of states |
212 \item $Qs$ is a finite set of states |
209 \item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$) |
213 \item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$) |
210 \item $F$ are some accepting states with $F \subseteq Qs$, and |
214 \item $F$ are some accepting states with $F \subseteq Qs$, and |
211 \item $\rho$ is a transition relation. |
215 \item $\rho$ is a transition relation. |
212 \end{itemize} |
216 \end{itemize} |
233 |
237 |
234 \noindent |
238 \noindent |
235 This NFA happens to have only one starting state, but in general there |
239 This NFA happens to have only one starting state, but in general there |
236 could be more. Notice that in state $Q_0$ we might go to state $Q_1$ |
240 could be more. Notice that in state $Q_0$ we might go to state $Q_1$ |
237 \emph{or} to state $Q_2$ when receiving an $a$. Similarly in state |
241 \emph{or} to state $Q_2$ when receiving an $a$. Similarly in state |
238 $Q_1$ and receiving an $a$, we can stay in state $Q_1$ or go to |
242 $Q_1$ and receiving an $a$, we can stay in state $Q_1$ \emph{or} go to |
239 $Q_2$. This kind of choice is not allowed with DFAs. The downside is |
243 $Q_2$. This kind of choice is not allowed with DFAs. The downside of |
240 that when it comes to deciding whether a string is accepted by a NFA |
244 this choice is that when it comes to deciding whether a string is |
241 we potentially have to explore all possibilities. I let you think |
245 accepted by a NFA we potentially have to explore all possibilities. I |
242 which kind of strings the above NFA accepts. |
246 let you think which kind of strings the above NFA accepts. |
243 |
247 |
244 |
248 |
245 There are a number of additional points you should note with |
249 There are a number of additional points you should note with |
246 NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a |
250 NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a |
247 transition \emph{relation} (DFAs have a transition function). The |
251 transition \emph{relation} (DFAs have a transition function). The |
258 and I also do not use transition functions that return sets of states |
262 and I also do not use transition functions that return sets of states |
259 (another popular choice for implementing NFAs). For reasons that |
263 (another popular choice for implementing NFAs). For reasons that |
260 become clear in a moment, I use sets of partial functions |
264 become clear in a moment, I use sets of partial functions |
261 instead---see Line 7 in Figure~\ref{nfa}. DFAs have only one such |
265 instead---see Line 7 in Figure~\ref{nfa}. DFAs have only one such |
262 partial function; my NFAs have a set. Another parameter, |
266 partial function; my NFAs have a set. Another parameter, |
263 \texttt{Starts}, is in NFAs a set of states; \texttt{fins} is again a |
267 \texttt{starts}, is in NFAs a set of states; \texttt{fins} is again a |
264 function from states to booleans. The \texttt{next} function |
268 function from states to booleans. The \texttt{next} function |
265 calculates the set of next states reachable from a single state |
269 calculates the set of next states reachable from a single state |
266 \texttt{q} by a character \texttt{c}---this is calculated by going |
270 \texttt{q} by a character~\texttt{c}---this is calculated by going |
267 through all the partial functions in the \texttt{delta}-set and apply |
271 through all the partial functions in the \texttt{delta}-set and apply |
268 \texttt{q} and \texttt{c} (see Line 13). This gives us a set of |
272 \texttt{q} and \texttt{c} (see Line 13). This gives a set of |
269 \texttt{Some}s (in case the application succeeded) and possibly some |
273 \texttt{Some}s (in case the application succeeded) and possibly some |
270 \texttt{None}s (in case the partial function is not defined or produces an |
274 \texttt{None}s (in case the partial function is not defined or produces an |
271 error). The \texttt{None}s are filtered out by the \texttt{flatMap}, |
275 error). The \texttt{None}s are filtered out by the \texttt{flatMap}, |
272 leaving the values inside the \texttt{Some}s. The function |
276 leaving the values inside the \texttt{Some}s. The function |
273 \texttt{nexts} just lifts this function to sets of |
277 \texttt{nexts} just lifts this function to sets of |
274 states. \texttt{Deltas} and \texttt{accept} are similar to the DFA |
278 states. \texttt{Deltas} and \texttt{accept} are similar to the DFA |
275 definitions. |
279 definitions. |
276 |
280 |
277 \begin{figure}[t] |
281 \begin{figure}[p] |
278 \small |
282 \small |
279 \lstinputlisting[numbers=left,linebackgroundcolor= |
283 \lstinputlisting[numbers=left,linebackgroundcolor= |
280 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
284 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
281 {../progs/nfa.scala} |
285 {../progs/nfa.scala} |
282 \caption{A Scala implementation of NFAs using sets of partial functions. |
286 \caption{A Scala implementation of NFAs using sets of partial functions. |
286 acceptance of a string in a breath-first fashion. This can be costly |
290 acceptance of a string in a breath-first fashion. This can be costly |
287 way of deciding whether a string is accepted in practical contexts.\label{nfa}} |
291 way of deciding whether a string is accepted in practical contexts.\label{nfa}} |
288 \end{figure} |
292 \end{figure} |
289 |
293 |
290 The reason for using sets of partial functions for specifying the |
294 The reason for using sets of partial functions for specifying the |
291 transitions in NFAs has to do with examples like this one: a |
295 transitions in NFAs has to do with pattern matching. Consider the |
292 popular benchmark regular expression is $(.)^*\cdot a\cdot |
296 following example: a popular benchmark regular expression is |
293 (.)^{\{n\}}\cdot b\cdot c$. A NFA that accepts the same strings |
297 $(.)^*\cdot a\cdot (.)^{\{n\}}\cdot b\cdot c$. The important point to |
294 (for $n=3$) is as follows: |
298 note is that it uses $.$ in order to represent the regular expression |
|
299 that accepts any character. A NFA that accepts the same strings as |
|
300 this regular expression (for $n=3$) is as follows: |
295 |
301 |
296 \begin{center} |
302 \begin{center} |
297 \begin{tikzpicture}[>=stealth',very thick, auto, node distance=7mm, |
303 \begin{tikzpicture}[>=stealth',very thick, auto, node distance=7mm, |
298 every state/.style={minimum size=0pt,inner sep=1pt, |
304 every state/.style={minimum size=0pt,inner sep=1pt, |
299 draw=blue!50,very thick,fill=blue!20},scale=0.5] |
305 draw=blue!50,very thick,fill=blue!20},scale=0.5] |
340 \noindent |
346 \noindent |
341 where the $.$-transitions translate into a |
347 where the $.$-transitions translate into a |
342 underscore-pattern-matching. Recall that in $Q_0$ if we read an $a$ we |
348 underscore-pattern-matching. Recall that in $Q_0$ if we read an $a$ we |
343 can go to $Q_1$ (by the first partial function in the set) and also |
349 can go to $Q_1$ (by the first partial function in the set) and also |
344 stay in $Q_0$ (by the second partial function). Representing such |
350 stay in $Q_0$ (by the second partial function). Representing such |
345 transitions in any other way is somehow awkward; the set of partial |
351 transitions in any other way in Scala seems to be somehow awkward; the |
346 function representation makes them easy to implement. |
352 set of partial function representation makes them easy to implement. |
347 |
353 |
348 Look very careful again at the \texttt{accepts} and \texttt{deltas} |
354 Look very careful again at the \texttt{accepts} and \texttt{deltas} |
349 functions in NFAs and remember that when accepting a string by an NFA |
355 functions in NFAs and remember that when accepting a string by an NFA |
350 we might have to explore all possible transitions (which state to go |
356 we might have to explore all possible transitions (recall which state |
351 to is not unique anymore). The shown implementation achieves this |
357 to go to is not unique anymore with NFAs). The implementation achieves |
352 exploration in a \emph{breath-first search}. This is fine for very |
358 this exploration in a \emph{breadth-first search} manner. This is fine |
353 small NFAs, but can lead to problems when the NFAs are bigger. Take |
359 for very small NFAs, but can lead to problems when the NFAs are |
354 for example the regular expression $(.)^*\cdot a\cdot (.)^{\{n\}}\cdot |
360 bigger. Take for example the regular expression $(.)^*\cdot a\cdot |
355 b\cdot c$ from above. If $n$ is large, say 100 or 1000, then the |
361 (.)^{\{n\}}\cdot b\cdot c$ from above. If $n$ is large, say 100 or |
356 corresponding NFA will have 104, respectively 1004, nodes. The problem |
362 1000, then the corresponding NFA will have 104, respectively 1004, |
357 is that with certain strings this can lead to 1000 ``active'' nodes |
363 nodes. The problem is that with certain strings this can lead to 1000 |
358 which we need to analyse when determining the next states. This can be |
364 ``active'' nodes in the breadth-first search, all of which we need to |
359 a real memory strain in practical applications. As a result, some |
365 analyse when determining the next states. This can be a real memory |
360 regular expression matching engines resort to a \emph{depth-first |
366 strain in practical applications. As result, some regular expression |
361 search} with \emph{backtracking} in unsuccessful cases. In our |
367 matching engines resort to a \emph{depth-first search} with |
362 implementation we could implement a depth-first version of |
368 \emph{backtracking} in unsuccessful cases. In our implementation we |
363 \texttt{accepts} using Scala's \texttt{exists} as follows: |
369 can implement a depth-first version of \texttt{accepts} using Scala's |
|
370 \texttt{exists} as follows: |
364 |
371 |
365 |
372 |
366 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor= |
373 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor= |
367 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
374 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
368 def search(q: A, s: List[C]) : Boolean = s match { |
375 def search(q: A, s: List[C]) : Boolean = s match { |
369 case Nil => fins(q) |
376 case Nil => fins(q) |
370 case c::cs => |
377 case c::cs => |
371 delta.exists((d) => Try(search(d(q, c), cs)) getOrElse false) |
378 delta.exists(d => Try(search(d(q, c), cs)) getOrElse false) |
372 } |
379 } |
373 |
380 |
374 def accepts(s: List[C]) : Boolean = |
381 def accepts(s: List[C]) : Boolean = |
375 starts.exists(search(_, s)) |
382 starts.exists(search(_, s)) |
376 \end{lstlisting}} |
383 \end{lstlisting}} |
377 |
384 |
378 \noindent |
385 \noindent |
379 This depth-first way of exploration seems to work efficiently in many |
386 This depth-first way of exploration seems to work efficiently in many |
380 examples and is much less of strain on memory. The problem is that the |
387 examples and is much less of strain on memory. The problem is that the |
381 backtracking can get catastrophic in some examples---remember the |
388 backtracking can get ``catastrophic'' in some examples---remember the |
382 catastrophic backtracking from earlier lectures. This depth-first |
389 catastrophic backtracking from earlier lectures. This depth-first |
383 search with backtracking is the reason for the abysmal performance of |
390 search with backtracking is the reason for the abysmal performance of |
384 regular expression macthing in Java, Ruby and Python. |
391 some regular expression macthings in Java, Ruby and Python. I like to |
|
392 show you this next. |
385 |
393 |
386 %This means if |
394 %This means if |
387 %we need to decide whether a string is accepted by a NFA, we might have |
395 %we need to decide whether a string is accepted by a NFA, we might have |
388 %to explore all possibilities. Also there is the special silent |
396 %to explore all possibilities. Also there is the special silent |
389 %transition in NFAs. As mentioned already this transition means you do |
397 %transition in NFAs. As mentioned already this transition means you do |
393 %%$Q_2$. This silent transition is also often called |
401 %%$Q_2$. This silent transition is also often called |
394 %\emph{$\epsilon$-transition}. |
402 %\emph{$\epsilon$-transition}. |
395 |
403 |
396 |
404 |
397 \subsubsection*{Thompson Construction} |
405 \subsubsection*{Thompson Construction} |
|
406 |
|
407 In order to get an idea what calculations are done in Java \& friends, |
|
408 we need a method for translating regular expressions into |
|
409 automata. The simplest and most well-known method is called |
|
410 \emph{Thompson Construction}, after the Turing Award winner Ken |
|
411 Thompson who implemented this method in early versions of grep???? |
|
412 |
398 |
413 |
399 The reason for introducing NFAs is that there is a relatively |
414 The reason for introducing NFAs is that there is a relatively |
400 simple (recursive) translation of regular expressions into |
415 simple (recursive) translation of regular expressions into |
401 NFAs. Consider the simple regular expressions $\ZERO$, |
416 NFAs. Consider the simple regular expressions $\ZERO$, |
402 $\ONE$ and $c$. They can be translated as follows: |
417 $\ONE$ and $c$. They can be translated as follows: |