113 end up in an accepting state, then this string is accepted by the |
114 end up in an accepting state, then this string is accepted by the |
114 automaton. Otherwise it is not accepted. This also means that if along |
115 automaton. Otherwise it is not accepted. This also means that if along |
115 the way we hit the case where the transition function $\delta$ is not |
116 the way we hit the case where the transition function $\delta$ is not |
116 defined, we need to raise an error. In our implementation we will deal |
117 defined, we need to raise an error. In our implementation we will deal |
117 with this case elegantly by using Scala's \texttt{Try}. So a string |
118 with this case elegantly by using Scala's \texttt{Try}. So a string |
118 $s$ is in the \emph{language accepted by the automaton} $A(Q, Q_0, F, |
119 $s$ is in the \emph{language accepted by the automaton} $A(\Sigma, Q, Q_0, F, |
119 \delta)$ iff |
120 \delta)$ iff |
120 |
121 |
121 \[ |
122 \[ |
122 \hat{\delta}(Q_0, s) \in F |
123 \hat{\delta}(Q_0, s) \in F |
123 \] |
124 \] |
124 |
125 |
125 \noindent I let you think about a definition that describes |
126 \noindent I let you think about a definition that describes |
126 the set of strings accepted by an automaton. |
127 the set of all strings accepted by an automaton. |
127 |
128 |
128 \begin{figure}[p] |
129 \begin{figure}[p] |
129 \small |
130 \small |
130 \lstinputlisting[numbers=left,linebackgroundcolor= |
131 \lstinputlisting[numbers=left,linebackgroundcolor= |
131 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
132 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
132 {../progs/dfa.scala} |
133 {../progs/dfa.scala} |
133 \caption{A Scala implementation of DFAs using partial functions. |
134 \caption{A Scala implementation of DFAs using partial functions. |
134 Notice some subtleties: \texttt{deltas} implements the delta-hat |
135 Notice some subtleties: \texttt{deltas} implements the delta-hat |
135 construction by lifting the transition (partial) function to |
136 construction by lifting the transition (partial) function to |
136 lists of \texttt{C}haracters. Since \texttt{delta} is given |
137 lists of characters. Since \texttt{delta} is given |
137 as partial function, it can obviously go ``wrong'' in which |
138 as a partial function, it can obviously go ``wrong'' in which |
138 case the \texttt{Try} in \texttt{accepts} catches the error and |
139 case the \texttt{Try} in \texttt{accepts} catches the error and |
139 returns \texttt{false}---that means the string is not accepted. |
140 returns \texttt{false}---that means the string is not accepted. |
140 The example \texttt{delta} implements the DFA example shown |
141 The example \texttt{delta} implements the DFA example shown |
141 earlier in the handout.\label{dfa}} |
142 earlier in the handout.\label{dfa}} |
142 \end{figure} |
143 \end{figure} |
143 |
144 |
144 A simple Scala implementation for DFAs is given in |
145 A simple Scala implementation for DFAs is given in |
145 Figure~\ref{dfa}. As you can see, there many features of the |
146 Figure~\ref{dfa}. As you can see, there are many features of the |
146 mathematical definition that are quite closely reflected in the |
147 mathematical definition that are quite closely reflected in the |
147 code. In the DFA-class, there is a starting state, called |
148 code. In the DFA-class, there is a starting state, called |
148 \texttt{start}, with the polymorphic type \texttt{A}. (The reason for |
149 \texttt{start}, with the polymorphic type \texttt{A}. There is a |
149 having a polymorphic types for the states and the input of DFAs will |
150 partial function \texttt{delta} for specifying the transitions---these |
150 become clearer later.) There is a partial function \texttt{delta} for |
151 partial functions take a state (of polymorphic type \texttt{A}) and an |
151 specifying the transitions. I used partial functions for representing |
152 input (of polymorphic type \texttt{C}) and produce a new state (of |
152 the transitions in Scala; alternatives would have been \texttt{Maps} |
153 type \texttt{A}). For the moment it is OK to assume that \texttt{A} is |
153 or even \texttt{Lists}. One of the main advantages of using partial |
154 some arbitrary type for states and the input is just characters. (The |
154 functions is that transitions can be quite nicely defined by a series |
155 reason for having a polymorphic types for the states and the input of |
155 of \texttt{case} statements (see Lines 28 -- 38). If you need to |
156 DFAs will become clearer later on.) |
156 represent an automaton with a sink state (catch-all-state), you can |
157 |
157 write something like |
158 I used partial functions for representing the transitions in Scala; |
|
159 alternatives would have been \texttt{Maps} or even \texttt{Lists}. One |
|
160 of the main advantages of using partial functions is that transitions |
|
161 can be quite nicely defined by a series of \texttt{case} statements |
|
162 (see Lines 28 -- 38 for an example). If you need to represent an |
|
163 automaton with a sink state (catch-all-state), you can use Scala's |
|
164 pattern matching and write something like |
158 |
165 |
159 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor= |
166 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor= |
160 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
167 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
161 abstract class State |
168 abstract class State |
162 ... |
169 ... |
189 While with DFAs it will always be clear that given a state and a |
198 While with DFAs it will always be clear that given a state and a |
190 character what the next state is (potentially none), it will be useful |
199 character what the next state is (potentially none), it will be useful |
191 to relax this restriction. That means we allow to have several |
200 to relax this restriction. That means we allow to have several |
192 potential successor states. We even allow more than one starting |
201 potential successor states. We even allow more than one starting |
193 state. The resulting construction is called a \emph{non-deterministic |
202 state. The resulting construction is called a \emph{non-deterministic |
194 finite automaton} (NFA) given also as a four-tuple $A(Qs, Q_{0s}, F, |
203 finite automaton} (NFA) given also as a five-tuple $A(\Sigma, Qs, Q_{0s}, F, |
195 \rho)$ where |
204 \rho)$ where |
196 |
205 |
197 \begin{itemize} |
206 \begin{itemize} |
|
207 \item $\Sigma$ is an alphabet, |
198 \item $Qs$ is a finite set of states |
208 \item $Qs$ is a finite set of states |
199 \item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$) |
209 \item $Q_{0s}$ is a set of start states ($Q_{0s} \subseteq Qs$) |
200 \item $F$ are some accepting states with $F \subseteq Qs$, and |
210 \item $F$ are some accepting states with $F \subseteq Qs$, and |
201 \item $\rho$ is a transition relation. |
211 \item $\rho$ is a transition relation. |
202 \end{itemize} |
212 \end{itemize} |
203 |
213 |
204 \noindent |
214 \noindent |
205 A typical example of an NFA is |
215 A typical example of a NFA is |
206 |
216 |
207 % A NFA for (ab* + b)*a |
217 % A NFA for (ab* + b)*a |
208 \begin{center} |
218 \begin{center} |
209 \begin{tikzpicture}[scale=0.8,>=stealth',very thick, |
219 \begin{tikzpicture}[>=stealth',very thick, auto, |
210 every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},] |
220 every state/.style={minimum size=0pt,inner sep=3pt, |
|
221 draw=blue!50,very thick,fill=blue!20},scale=2] |
211 \node[state,initial] (Q_0) {$Q_0$}; |
222 \node[state,initial] (Q_0) {$Q_0$}; |
212 \node[state] (Q_1) [right=of Q_0] {$Q_1$}; |
223 \node[state] (Q_1) [right=of Q_0] {$Q_1$}; |
213 \node[state, accepting] (Q_2) [right=of Q_1] {$Q_2$}; |
224 \node[state, accepting] (Q_2) [right=of Q_1] {$Q_2$}; |
214 \path[->] (Q_0) edge [loop above] node {$b$} (); |
225 \path[->] (Q_0) edge [loop above] node {$b$} (); |
215 \path[<-] (Q_0) edge node [below] {$b$} (Q_1); |
226 \path[<-] (Q_0) edge node [below] {$b$} (Q_1); |
221 \end{center} |
232 \end{center} |
222 |
233 |
223 \noindent |
234 \noindent |
224 This NFA happens to have only one starting state, but in general there |
235 This NFA happens to have only one starting state, but in general there |
225 could be more. Notice that in state $Q_0$ we might go to state $Q_1$ |
236 could be more. Notice that in state $Q_0$ we might go to state $Q_1$ |
226 \emph{or} to state $Q_2$ when receiving an $a$. Similarly in state $Q_1$ |
237 \emph{or} to state $Q_2$ when receiving an $a$. Similarly in state |
227 and receiving an $a$, we can stay in state $Q_1$ or go to $Q_2$. This |
238 $Q_1$ and receiving an $a$, we can stay in state $Q_1$ or go to |
228 kind of choice is not allowed with DFAs. When it comes to deciding |
239 $Q_2$. This kind of choice is not allowed with DFAs. The downside is |
229 whether a string is accepted by an NFA we potentially need to explore |
240 that when it comes to deciding whether a string is accepted by a NFA |
230 all possibilities. I let you think which kind of strings this NFA |
241 we potentially have to explore all possibilities. I let you think |
231 accepts. |
242 which kind of strings the above NFA accepts. |
232 |
243 |
233 |
244 |
234 There are however a number of additional points you should note. Every |
245 There are a number of additional points you should note with |
235 DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a transition |
246 NFAs. Every DFA is a NFA, but not vice versa. The $\rho$ in NFAs is a |
236 \emph{relation} (DFAs have a transition function). The difference |
247 transition \emph{relation} (DFAs have a transition function). The |
237 between a function and a relation is that a function has always a |
248 difference between a function and a relation is that a function has |
238 single output, while a relation gives, roughly speaking, several |
249 always a single output, while a relation gives, roughly speaking, |
239 outputs. Look again at the NFA above: if you are currently in the |
250 several outputs. Look again at the NFA above: if you are currently in |
240 state $Q_1$ and you read a character $b$, then you can transition to |
251 the state $Q_1$ and you read a character $b$, then you can transition |
241 either $Q_0$ \emph{or} $Q_2$. Which route, or output, you take is not |
252 to either $Q_0$ \emph{or} $Q_2$. Which route, or output, you take is |
242 determined. This non-determinism can be represented by a relation. |
253 not determined. This non-determinism can be represented by a |
243 |
254 relation. |
244 My implementation of NFAs is shown in Figure~\ref{nfa}. Perhaps |
255 |
245 interestingly, I do not use relations for my implementation of NFAs in |
256 My implementation of NFAs in Scala is shown in Figure~\ref{nfa}. |
246 Scala, and I also do not use transition functions which return a set |
257 Perhaps interestingly, I do not actually use relations for my NFAs, |
247 of states (another popular choice for implementing NFAs). For reasons |
258 and I also do not use transition functions that return sets of states |
248 that become clear in a moment, I use sets of partial functions---see |
259 (another popular choice for implementing NFAs). For reasons that |
249 Line 7 in Figure~\ref{nfa}. \texttt{Starts} is now a set of states; |
260 become clear in a moment, I use sets of partial functions |
250 \texttt{fins} is again a function from states to booleans. The |
261 instead---see Line 7 in Figure~\ref{nfa}. DFAs have only one such |
251 \texttt{next} function calculates the set of next states reachable |
262 partial function; my NFAs have a set. Another parameter, |
252 from a state \texttt{q} by a character \texttt{c}---we go through all |
263 \texttt{Starts}, is in NFAs a set of states; \texttt{fins} is again a |
253 partial functions in the \texttt{delta}-set, lift it (this transformes |
264 function from states to booleans. The \texttt{next} function |
254 the partial function into the corresponding \texttt{Option}-function |
265 calculates the set of next states reachable from a single state |
255 and then apply \texttt{q} and \texttt{c}. This gives us a set of |
266 \texttt{q} by a character \texttt{c}---this is calculated by going |
256 \texttt{Some}s and \texttt{None}s, where the \texttt{Some}s are |
267 through all the partial functions in the \texttt{delta}-set and apply |
257 filtered out by the \texttt{flatMap}. Teh function \texttt{nexts} just |
268 \texttt{q} and \texttt{c} (see Line 13). This gives us a set of |
258 lifts this to sets of states. \texttt{Deltas} and \texttt{accept} are |
269 \texttt{Some}s (in case the application succeeded) and possibly some |
259 similar to the DFA definitions. |
270 \texttt{None}s (in case the partial function is not defined or produces an |
|
271 error). The \texttt{None}s are filtered out by the \texttt{flatMap}, |
|
272 leaving the values inside the \texttt{Some}s. The function |
|
273 \texttt{nexts} just lifts this function to sets of |
|
274 states. \texttt{Deltas} and \texttt{accept} are similar to the DFA |
|
275 definitions. |
260 |
276 |
261 \begin{figure}[t] |
277 \begin{figure}[t] |
262 \small |
278 \small |
263 \lstinputlisting[numbers=left,linebackgroundcolor= |
279 \lstinputlisting[numbers=left,linebackgroundcolor= |
264 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
280 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
265 {../progs/nfa.scala} |
281 {../progs/nfa.scala} |
266 \caption{A Scala implementation of NFAs using sets of partial functions. |
282 \caption{A Scala implementation of NFAs using sets of partial functions. |
267 Notice some subtleties: \texttt{deltas} implements the delta-hat |
283 Notice some subtleties: Since \texttt{delta} is given |
268 construction by lifting the transition (partial) function to |
284 as a set of partial functions, each of them can obviously go ``wrong'' in which |
269 lists of \texttt{C}haracters. Since \texttt{delta} is given |
285 case the \texttt{Try}. The function \texttt{accepts} implements the |
270 as partial function, it can obviously go ``wrong'' in which |
286 acceptance of a string in a breath-first fashion. This can be costly |
271 case the \texttt{Try} in \texttt{accepts} catches the error and |
287 way of deciding whether a string is accepted in practical contexts.\label{nfa}} |
272 returns \texttt{false}---that means the string is not accepted. |
|
273 The example \texttt{delta} implements the DFA example shown |
|
274 earlier in the handout.\label{nfa}} |
|
275 \end{figure} |
288 \end{figure} |
276 |
289 |
277 |
290 The reason for using sets of partial functions for specifying the |
|
291 transitions in NFAs has to do with examples like this one: a |
|
292 popular benchmark regular expression is $(.)^*\cdot a\cdot |
|
293 (.)^{\{n\}}\cdot b\cdot c$. A NFA that accepts the same strings |
|
294 (for $n=3$) is as follows: |
|
295 |
|
296 \begin{center} |
|
297 \begin{tikzpicture}[>=stealth',very thick, auto, node distance=7mm, |
|
298 every state/.style={minimum size=0pt,inner sep=1pt, |
|
299 draw=blue!50,very thick,fill=blue!20},scale=0.5] |
|
300 \node[state,initial] (Q_0) {$Q_0$}; |
|
301 \node[state] (Q_1) [right=of Q_0] {$Q_1$}; |
|
302 \node[state] (Q_2) [right=of Q_1] {$Q_2$}; |
|
303 \node[state] (Q_3) [right=of Q_2] {$Q_3$}; |
|
304 \node[state] (Q_4) [right=of Q_3] {$Q_4$}; |
|
305 \node[state] (Q_5) [right=of Q_4] {$Q_5$}; |
|
306 \node[state,accepting] (Q_6) [right=of Q_5] {$Q_6$}; |
|
307 \path[->] (Q_0) edge [loop above] node {$.$} (); |
|
308 \path[->] (Q_0) edge node [above] {$a$} (Q_1); |
|
309 \path[->] (Q_1) edge node [above] {$.$} (Q_2); |
|
310 \path[->] (Q_2) edge node [above] {$.$} (Q_3); |
|
311 \path[->] (Q_3) edge node [above] {$.$} (Q_4); |
|
312 \path[->] (Q_4) edge node [above] {$b$} (Q_5); |
|
313 \path[->] (Q_5) edge node [above] {$c$} (Q_6); |
|
314 \end{tikzpicture} |
|
315 \end{center} |
|
316 |
|
317 \noindent |
|
318 The $.$ stands for accepting any single character: for example if we |
|
319 are in $Q_0$ and read an $a$ we can either stay in $Q_0$ (since any |
|
320 character will do for this) or advance to $Q_1$ (but only if it is an |
|
321 $a$). Why this is a good benchmark regular expression is irrelevant |
|
322 here. The point is that this NFA can be conveniently represented by |
|
323 the code: |
|
324 |
|
325 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor= |
|
326 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
|
327 val delta = Set[(State, Char) :=> State]( |
|
328 { case (Q0, 'a') => Q1 }, |
|
329 { case (Q0, _) => Q0 }, |
|
330 { case (Q1, _) => Q2 }, |
|
331 { case (Q2, _) => Q3 }, |
|
332 { case (Q3, _) => Q4 }, |
|
333 { case (Q4, 'b') => Q5 }, |
|
334 { case (Q5, 'c') => Q6 } |
|
335 ) |
|
336 |
|
337 NFA(Set[State](Q0), delta, Set[State](Q6)) |
|
338 \end{lstlisting}} |
|
339 |
|
340 \noindent |
|
341 where the $.$-transitions translate into a |
|
342 underscore-pattern-matching. Recall that in $Q_0$ if we read an $a$ we |
|
343 can go to $Q_1$ (by the first partial function in the set) and also |
|
344 stay in $Q_0$ (by the second partial function). Representing such |
|
345 transitions in any other way is somehow awkward; the set of partial |
|
346 function representation makes them easy to implement. |
|
347 |
|
348 Look very careful again at the \texttt{accepts} and \texttt{deltas} |
|
349 functions in NFAs and remember that when accepting a string by an NFA |
|
350 we might have to explore all possible transitions (which state to go |
|
351 to is not unique anymore). The shown implementation achieves this |
|
352 exploration in a \emph{breath-first search}. This is fine for very |
|
353 small NFAs, but can lead to problems when the NFAs are bigger. Take |
|
354 for example the regular expression $(.)^*\cdot a\cdot (.)^{\{n\}}\cdot |
|
355 b\cdot c$ from above. If $n$ is large, say 100 or 1000, then the |
|
356 corresponding NFA will have 104, respectively 1004, nodes. The problem |
|
357 is that with certain strings this can lead to 1000 ``active'' nodes |
|
358 which we need to analyse when determining the next states. This can be |
|
359 a real memory strain in practical applications. As a result, some |
|
360 regular expression matching engines resort to a \emph{depth-first |
|
361 search} with \emph{backtracking} in unsuccessful cases. In our |
|
362 implementation we could implement a depth-first version of |
|
363 \texttt{accepts} using Scala's \texttt{exists} as follows: |
|
364 |
|
365 |
|
366 {\small\begin{lstlisting}[language=Scala,linebackgroundcolor= |
|
367 {\ifodd\value{lstnumber}\color{capri!3}\fi}] |
|
368 def search(q: A, s: List[C]) : Boolean = s match { |
|
369 case Nil => fins(q) |
|
370 case c::cs => |
|
371 delta.exists((d) => Try(search(d(q, c), cs)) getOrElse false) |
|
372 } |
|
373 |
|
374 def accepts(s: List[C]) : Boolean = |
|
375 starts.exists(search(_, s)) |
|
376 \end{lstlisting}} |
|
377 |
|
378 \noindent |
|
379 This depth-first way of exploration seems to work efficiently in many |
|
380 examples and is much less of strain on memory. The problem is that the |
|
381 backtracking can get catastrophic in some examples---remember the |
|
382 catastrophic backtracking from earlier lectures. This depth-first |
|
383 search with backtracking is the reason for the abysmal performance of |
|
384 regular expression macthing in Java, Ruby and Python. |
278 |
385 |
279 %This means if |
386 %This means if |
280 %we need to decide whether a string is accepted by a NFA, we might have |
387 %we need to decide whether a string is accepted by a NFA, we might have |
281 %to explore all possibilities. Also there is the special silent |
388 %to explore all possibilities. Also there is the special silent |
282 %transition in NFAs. As mentioned already this transition means you do |
389 %transition in NFAs. As mentioned already this transition means you do |