129 \noindent I let you think about a definition that describes the set of |
129 \noindent I let you think about a definition that describes the set of |
130 all strings accepted by a deterministic finite automaton. |
130 all strings accepted by a deterministic finite automaton. |
131 |
131 |
132 \begin{figure}[p] |
132 \begin{figure}[p] |
133 \small |
133 \small |
134 \lstinputlisting[numbers=left]{../progs/display/dfa.scala} |
134 \lstinputlisting[numbers=left,lastline=43]{../progs/automata/dfa.sc} |
135 \caption{An implementation of DFAs in Scala using partial functions. |
135 \caption{An implementation of DFAs in Scala using partial functions. |
136 Note some subtleties: \texttt{deltas} implements the delta-hat |
136 Note some subtleties: \texttt{deltas} implements the delta-hat |
137 construction by lifting the (partial) transition function to lists |
137 construction by lifting the (partial) transition function to lists |
138 of characters. Since \texttt{delta} is given as a partial function, |
138 of characters. Since \texttt{delta} is given as a partial function, |
139 it can obviously go ``wrong'' in which case the \texttt{Try} in |
139 it can obviously go ``wrong'' in which case the \texttt{Try} in |
140 \texttt{accepts} catches the error and returns \texttt{false}---that |
140 \texttt{accepts} catches the error and returns \texttt{false}---that |
141 means the string is not accepted. The example \texttt{delta} in |
141 means the string is not accepted. The example \texttt{delta} in |
142 Line 28--38 implements the DFA example shown earlier in the |
142 Line 22--43 implements the DFA example shown earlier in the |
143 handout.\label{dfa}} |
143 handout.\label{dfa}} |
144 \end{figure} |
144 \end{figure} |
145 |
145 |
146 My take on an implementation of DFAs in Scala is given in |
146 My take on an implementation of DFAs in Scala is given in |
147 Figure~\ref{dfa}. As you can see, there are many features of the |
147 Figure~\ref{dfa}. As you can see, there are many features of the |
159 The DFA-class has also an argument for specifying final states. In the |
159 The DFA-class has also an argument for specifying final states. In the |
160 implementation it is not a set of states, as in the mathematical |
160 implementation it is not a set of states, as in the mathematical |
161 definition, but a function from states to booleans (this function is |
161 definition, but a function from states to booleans (this function is |
162 supposed to return true whenever a state is final; false |
162 supposed to return true whenever a state is final; false |
163 otherwise). While this boolean function is different from the sets of |
163 otherwise). While this boolean function is different from the sets of |
164 states, Scala allows us to use sets for such functions (see Line 40 where |
164 states, Scala allows us to use sets for such functions (see Line 41 where |
165 the DFA is initialised). Again it will become clear later on why I use |
165 the DFA is initialised). Again it will become clear later on why I use |
166 functions for final states, rather than sets. |
166 functions for final states, rather than sets. |
167 |
167 |
168 The most important point in the implementation is that I use Scala's |
168 The most important point in the implementation is that I use Scala's |
169 partial functions for representing the transitions; alternatives would |
169 partial functions for representing the transitions; alternatives would |
170 have been \texttt{Maps} or even \texttt{Lists}. One of the main |
170 have been \texttt{Maps} or even \texttt{Lists}. One of the main |
171 advantages of using partial functions is that transitions can be quite |
171 advantages of using partial functions is that transitions can be quite |
172 nicely defined by a series of \texttt{case} statements (see Lines 28 |
172 nicely defined by a series of \texttt{case} statements (see Lines 29 |
173 -- 38 for an example). If you need to represent an automaton with a |
173 -- 39 for an example). If you need to represent an automaton with a |
174 sink state (catch-all-state), you can use Scala's pattern matching and |
174 sink state (catch-all-state), you can use Scala's pattern matching and |
175 write something like |
175 write something like |
176 |
176 |
177 {\small\begin{lstlisting}[language=Scala] |
177 {\small\begin{lstlisting}[language=Scala] |
178 abstract class State |
178 abstract class State |
283 NFAs a set of states; \texttt{fins} is again a function from states to |
283 NFAs a set of states; \texttt{fins} is again a function from states to |
284 booleans. The \texttt{next} function calculates the set of next states |
284 booleans. The \texttt{next} function calculates the set of next states |
285 reachable from a single state \texttt{q} by a character~\texttt{c}. In |
285 reachable from a single state \texttt{q} by a character~\texttt{c}. In |
286 case there is no such state---the partial transition function is |
286 case there is no such state---the partial transition function is |
287 undefined---the empty set is returned (see function |
287 undefined---the empty set is returned (see function |
288 \texttt{applyOrElse} in Lines 9 and 10). The function \texttt{nexts} |
288 \texttt{applyOrElse} in Lines 11 and 12). The function \texttt{nexts} |
289 just lifts this function to sets of states. |
289 just lifts this function to sets of states. |
290 |
290 |
291 \begin{figure}[p] |
291 \begin{figure}[p] |
292 \small |
292 \small |
293 \lstinputlisting[numbers=left]{../progs/display/nfa.scala} |
293 \lstinputlisting[numbers=left,lastline=43]{../progs/automata/nfa.sc} |
294 \caption{A Scala implementation of NFAs using partial functions. |
294 \caption{A Scala implementation of NFAs using partial functions. |
295 Notice that the function \texttt{accepts} implements the |
295 Notice that the function \texttt{accepts} implements the |
296 acceptance of a string in a breadth-first search fashion. This can be a costly |
296 acceptance of a string in a breadth-first search fashion. This can be a costly |
297 way of deciding whether a string is accepted or not in applications that need to handle |
297 way of deciding whether a string is accepted or not in applications that need to handle |
298 large NFAs and large inputs.\label{nfa}} |
298 large NFAs and large inputs.\label{nfa}} |
299 \end{figure} |
299 \end{figure} |
300 |
300 |
301 Look very careful at the \texttt{accepts} and \texttt{deltas} |
301 Look very careful at the \texttt{accepts} and \texttt{deltas} |
302 functions in NFAs and remember that when accepting a string by a NFA |
302 functions in NFAs and remember that when accepting a string by a NFA |
303 we might have to explore all possible transitions (recall which state |
303 we might have to explore all possible transitions (recall which state |
304 to go to is not unique anymore with NFAs\ldots{}we need to explore |
304 to go to is not unique any more with NFAs\ldots{}we need to explore |
305 potentially all next states). The implementation achieves this |
305 potentially all next states). The implementation achieves this |
306 exploration through a \emph{breadth-first search}. This is fine for |
306 exploration through a \emph{breadth-first search}. This is fine for |
307 small NFAs, but can lead to real memory problems when the NFAs are |
307 small NFAs, but can lead to real memory problems when the NFAs are |
308 bigger and larger strings need to be processed. As result, some |
308 bigger and larger strings need to be processed. As result, some |
309 regular expression matching engines resort to a \emph{depth-first |
309 regular expression matching engines resort to a \emph{depth-first |
334 |
334 |
335 \subsection*{Epsilon NFAs} |
335 \subsection*{Epsilon NFAs} |
336 |
336 |
337 In order to get an idea what calculations are performed by Java \& |
337 In order to get an idea what calculations are performed by Java \& |
338 friends, we need a method for transforming a regular expression into |
338 friends, we need a method for transforming a regular expression into |
339 an automaton. This automaton should accept exactly those strings that |
339 a corresponding automaton. This automaton should accept exactly those strings that |
340 are accepted by the regular expression. The simplest and most |
340 are accepted by the regular expression. The simplest and most |
341 well-known method for this is called \emph{Thompson Construction}, |
341 well-known method for this is called the \emph{Thompson Construction}, |
342 after the Turing Award winner Ken Thompson. This method is by |
342 after the Turing Award winner Ken Thompson. This method is by |
343 recursion over regular expressions and depends on the non-determinism |
343 recursion over regular expressions and depends on the non-determinism |
344 in NFAs described in the previous section. You will see shortly why |
344 in NFAs described in the previous section. You will see shortly why |
345 this construction works well with NFAs, but is not so straightforward |
345 this construction works well with NFAs, but is not so straightforward |
346 with DFAs. |
346 with DFAs. |
407 unfortunately. If we were to follow the many textbooks on the |
407 unfortunately. If we were to follow the many textbooks on the |
408 subject, we would now start with defining what $\epsilon$NFAs |
408 subject, we would now start with defining what $\epsilon$NFAs |
409 are---that would require extending the transition relation of |
409 are---that would require extending the transition relation of |
410 NFAs. Next, we would show that the $\epsilon$NFAs are equivalent to |
410 NFAs. Next, we would show that the $\epsilon$NFAs are equivalent to |
411 NFAs and so on. Once we have done all this on paper, we would need to |
411 NFAs and so on. Once we have done all this on paper, we would need to |
412 implement $\epsilon$NFAs. Lets try to take a shortcut instead. We are |
412 implement $\epsilon$NFAs. Let's try to take a shortcut instead. We are |
413 not really interested in $\epsilon$NFAs; they are only a convenient |
413 not really interested in $\epsilon$NFAs; they are only a convenient |
414 tool for translating regular expressions into automata. So we are not |
414 tool for translating regular expressions into automata. So we are not |
415 going to implementing them explicitly, but translate them immediately |
415 going to implementing them explicitly, but translate them immediately |
416 into NFAs (in a sense $\epsilon$NFAs are just a convenient API for |
416 into NFAs (in a sense $\epsilon$NFAs are just a convenient API for |
417 lazy people ;o). How does the translation work? Well we have to find |
417 lazy people ;o). How does this translation work? Well we have to find |
418 all transitions of the form |
418 all transitions of the form |
419 |
419 |
420 \[ |
420 \[ |
421 q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} |
421 q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} |
422 \;\stackrel{a}{\longrightarrow}\; |
422 \;\stackrel{a}{\longrightarrow}\; |
423 \stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q' |
423 \stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q' |
424 \] |
424 \] |
425 |
425 |
426 \noindent where somewhere in the ``middle'' is an $a$-transition. We |
426 \noindent where somewhere in the ``middle'' is an $a$-transition (for |
427 replace them with $q \stackrel{a}{\longrightarrow} q'$. Doing this to |
427 a character, say, $a$). We replace them with |
428 the $\epsilon$NFA on the right-hand side above gives the NFA |
428 $q \stackrel{a}{\longrightarrow} q'$. Doing this to the $\epsilon$NFA |
|
429 on the right-hand side above gives the NFA |
429 |
430 |
430 \begin{center} |
431 \begin{center} |
431 \begin{tikzpicture}[>=stealth',very thick, |
432 \begin{tikzpicture}[>=stealth',very thick, |
432 every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},] |
433 every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},] |
433 \node[state,initial] (r_1) {$R_1$}; |
434 \node[state,initial] (r_1) {$R_1$}; |
447 and verify that I did not forget any transition. |
448 and verify that I did not forget any transition. |
448 |
449 |
449 So in what follows, whenever we are given an $\epsilon$NFA we will |
450 So in what follows, whenever we are given an $\epsilon$NFA we will |
450 replace it by an equivalent NFA. The Scala code for this translation |
451 replace it by an equivalent NFA. The Scala code for this translation |
451 is given in Figure~\ref{enfa}. The main workhorse in this code is a |
452 is given in Figure~\ref{enfa}. The main workhorse in this code is a |
452 function that calculates a fixpoint of function (Lines 5--10). This |
453 function that calculates a fixpoint of function (Lines 6--12). This |
453 function is used for ``discovering'' which states are reachable by |
454 function is used for ``discovering'' which states are reachable by |
454 $\epsilon$-transitions. Once no new state is discovered, a fixpoint is |
455 $\epsilon$-transitions. Once no new state is discovered, a fixpoint is |
455 reached. This is used for example when calculating the starting states |
456 reached. This is used for example when calculating the starting states |
456 of an equivalent NFA (see Line 36): we start with all starting states |
457 of an equivalent NFA (see Line 28): we start with all starting states |
457 of the $\epsilon$NFA and then look for all additional states that can |
458 of the $\epsilon$NFA and then look for all additional states that can |
458 be reached by $\epsilon$-transitions. We keep on doing this until no |
459 be reached by $\epsilon$-transitions. We keep on doing this until no |
459 new state can be reached. This is what the $\epsilon$-closure, named |
460 new state can be reached. This is what the $\epsilon$-closure, named |
460 in the code \texttt{ecl}, calculates. Similarly, an accepting state of |
461 in the code \texttt{ecl}, calculates. Similarly, an accepting state of |
461 the NFA is when we can reach an accepting state of the $\epsilon$NFA |
462 the NFA is when we can reach an accepting state of the $\epsilon$NFA |
462 by $\epsilon$-transitions. |
463 by $\epsilon$-transitions. |
463 |
464 |
464 |
465 |
465 \begin{figure}[p] |
466 \begin{figure}[p] |
466 \small |
467 \small |
467 \lstinputlisting[numbers=left]{../progs/display/enfa.scala} |
468 \lstinputlisting[numbers=left,lastline=43]{../progs/automata/enfa.sc} |
468 |
469 |
469 \caption{A Scala function that translates $\epsilon$NFA into NFAs. The |
470 \caption{A Scala function that translates $\epsilon$NFA into NFAs. The |
470 transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}. |
471 transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}. |
471 \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)} |
472 \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)} |
472 for a ``proper'' transition consuming a character. The functions in |
473 for a ``proper'' transition consuming a character. The functions in |
473 Lines 18--26 calculate |
474 Lines 19--24 calculate |
474 all states reachable by one or more $\epsilon$-transition for a given |
475 all states reachable by one or more $\epsilon$-transition for a given |
475 set of states. The NFA is constructed in Lines 36--38. |
476 set of states. The NFA is constructed in Lines 30--34. |
476 Note the interesting commands in Lines 5 and 6: their purpose is |
477 Note the interesting commands in Lines 7 and 8: their purpose is |
477 to ensure that \texttt{fixpT} is the tail-recursive version of |
478 to ensure that \texttt{fixpT} is the tail-recursive version of |
478 the fixpoint construction; otherwise we would quickly get a |
479 the fixpoint construction; otherwise we would quickly get a |
479 stack-overflow exception, even on small examples, due to limitations |
480 stack-overflow exception, even on small examples, due to limitations |
480 of the JVM. |
481 of the JVM. |
481 \label{enfa}} |
482 \label{enfa}} |
544 increases this counter whenever a new state is created.\footnote{You might |
545 increases this counter whenever a new state is created.\footnote{You might |
545 have to read up what \emph{companion objects} do in Scala.} |
546 have to read up what \emph{companion objects} do in Scala.} |
546 |
547 |
547 \begin{figure}[p] |
548 \begin{figure}[p] |
548 \small |
549 \small |
549 \lstinputlisting[numbers=left]{../progs/display/thompson1.scala} |
550 \lstinputlisting[numbers=left,linerange={1-20}]{../progs/automata/thompson.sc} |
550 \caption{The first part of the Thompson Construction. Lines 7--16 |
551 \hspace{5mm}\texttt{\dots} |
|
552 \lstinputlisting[numbers=left,linerange={28-45},firstnumber=28]{../progs/automata/thompson.sc} |
|
553 \caption{The first part of the Thompson Construction. Lines 10--19 |
551 implement a way of how to create new states that are all |
554 implement a way of how to create new states that are all |
552 distinct by virtue of a counter. This counter is |
555 distinct by virtue of a counter. This counter is |
553 increased in the companion object of \texttt{TState} |
556 increased in the companion object of \texttt{TState} |
554 whenever a new state is created. The code in Lines 24--40 |
557 whenever a new state is created. The code in Lines 38--45 |
555 constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$. |
558 constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$. |
556 Compare this code with the pictures given in \eqref{simplecases} on |
559 Compare this code with the pictures given in \eqref{simplecases} on |
557 Page~\pageref{simplecases}. |
560 Page~\pageref{simplecases}. |
558 \label{thompson1}} |
561 \label{thompson1}} |
559 \end{figure} |
562 \end{figure} |
560 |
563 |
561 \begin{figure}[p] |
564 \begin{figure}[p] |
562 \small |
565 \small |
563 \lstinputlisting[numbers=left]{../progs/display/thompson2.scala} |
566 \lstinputlisting[numbers=left,firstline=48,firstnumber=48,lastline=85]{../progs/automata/thompson.sc} |
564 \caption{The second part of the Thompson Construction implementing |
567 \caption{The second part of the Thompson Construction implementing |
565 the composition of NFAs according to $\cdot$, $+$ and ${}^*$. |
568 the composition of NFAs according to $\cdot$, $+$ and ${}^*$. |
566 The implicit class about rich partial functions |
569 The implicit class (Lines 48--54) about rich partial functions |
567 implements the infix operation \texttt{+++} which |
570 implements the infix operation \texttt{+++} which |
568 combines an $\epsilon$NFA transition with a NFA transition |
571 combines an $\epsilon$NFA transition with an NFA transition |
569 (both are given as partial functions---but with different type!).\label{thompson2}} |
572 (both are given as partial functions---but with different type!).\label{thompson2}} |
570 \end{figure} |
573 \end{figure} |
571 |
574 |
572 The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more |
575 The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more |
573 complicated: Say, we are given by recursion two NFAs representing the regular |
576 complicated: Say, we are given by recursion two NFAs representing the regular |
649 \noindent The idea behind this construction is that the start of any |
652 \noindent The idea behind this construction is that the start of any |
650 string is first recognised by the first NFA, then we silently change |
653 string is first recognised by the first NFA, then we silently change |
651 to the second NFA; the ending of the string is recognised by the |
654 to the second NFA; the ending of the string is recognised by the |
652 second NFA...just like matching of a string by the regular expression |
655 second NFA...just like matching of a string by the regular expression |
653 $r_1\cdot r_2$. The Scala code for this construction is given in |
656 $r_1\cdot r_2$. The Scala code for this construction is given in |
654 Figure~\ref{thompson2} in Lines 16--23. The starting states of the |
657 Figure~\ref{thompson2} in Lines 57--65. The starting states of the |
655 $\epsilon$NFA are the starting states of the first NFA (corresponding |
658 $\epsilon$NFA are the starting states of the first NFA (corresponding |
656 to $r_1$); the accepting function is the accepting function of the |
659 to $r_1$); the accepting function is the accepting function of the |
657 second NFA (corresponding to $r_2$). The new transition function is |
660 second NFA (corresponding to $r_2$). The new transition function is |
658 all the ``old'' transitions plus the $\epsilon$-transitions connecting |
661 all the ``old'' transitions plus the $\epsilon$-transitions connecting |
659 the accepting states of the first NFA to the starting states of the |
662 the accepting states of the first NFA to the starting states of the |
660 first NFA (Lines 18 and 19). The $\epsilon$NFA is then immediately |
663 second NFA (Lines 59 and 60). The $\epsilon$NFA is then immediately |
661 translated in a NFA. |
664 translated in a NFA. |
662 |
665 |
663 |
666 |
664 The case for the alternative regular expression $r_1 + r_2$ is |
667 The case for the alternative regular expression $r_1 + r_2$ is |
665 slightly different: We are given by recursion two NFAs representing |
668 slightly different: We are given by recursion two NFAs representing |
666 $r_1$ and $r_2$ respectively. Each NFA has some starting states and |
669 $r_1$ and $r_2$ respectively. Each NFA has some starting states and |
667 some accepting states. We obtain a NFA for the regular expression $r_1 |
670 some accepting states. We obtain a NFA for the regular expression |
668 + r_2$ by composing the transition functions (this crucially depends |
671 $r_1 + r_2$ by composing the transition functions (this crucially |
669 on knowing that the states of each component NFA are distinct---recall |
672 depends on knowing that the states of each component NFA are |
670 we implemented for this to hold some bespoke code for states). We also |
673 distinct---recall we implemented for this to hold by some bespoke code |
671 need to combine the starting states and accepting functions |
674 for \texttt{TState}s). We also need to combine the starting states and |
672 appropriately. |
675 accepting functions appropriately. |
673 |
676 |
674 \begin{center} |
677 \begin{center} |
675 \begin{tabular}[t]{ccc} |
678 \begin{tabular}[t]{ccc} |
676 \begin{tikzpicture}[node distance=3mm, |
679 \begin{tikzpicture}[node distance=3mm, |
677 >=stealth',very thick, |
680 >=stealth',very thick, |
786 \end{tikzpicture} |
789 \end{tikzpicture} |
787 \end{tabular} |
790 \end{tabular} |
788 \end{center} |
791 \end{center} |
789 |
792 |
790 \noindent |
793 \noindent |
791 The corresponding code is in Figure~\ref{thompson2} in Lines 35--43) |
794 The corresponding code is in Figure~\ref{thompson2} in Lines 77--85) |
792 |
795 |
793 To sum up, you can see in the sequence and star cases the need of |
796 To sum up, you can see in the sequence and star cases the need for |
794 having silent $\epsilon$-transitions. Similarly the alternative case |
797 silent $\epsilon$-transitions. Otherwise this construction just |
795 shows the need of the NFA-nondeterminism. It seems awkward to form the |
798 becomes awkward. Similarly the alternative case shows the need of the |
796 `alternative' composition of two DFAs, because DFA do not allow |
799 NFA-nondeterminism. It looks non-obvious to form the `alternative' |
797 several starting and successor states. All these constructions can now |
800 composition of two DFAs, because DFA do not allow several starting and |
798 be put together in the following recursive function: |
801 successor states. All these constructions can now be put together in |
|
802 the following recursive function: |
799 |
803 |
800 |
804 |
801 {\small\begin{lstlisting}[language=Scala] |
805 {\small\begin{lstlisting}[language=Scala] |
802 def thompson(r: Rexp) : NFAt = r match { |
806 def thompson(r: Rexp) : NFAt = r match { |
803 case ZERO => NFA_ZERO() |
807 case ZERO => NFA_ZERO() |
811 |
815 |
812 \noindent |
816 \noindent |
813 It calculates a NFA from a regular expressions. At last we can run |
817 It calculates a NFA from a regular expressions. At last we can run |
814 NFAs for the our evil regular expression examples. The graph on the |
818 NFAs for the our evil regular expression examples. The graph on the |
815 left shows that when translating a regular expression such as |
819 left shows that when translating a regular expression such as |
816 $a^{\{n\}}$ into a NFA, the size can blow up and then even the |
820 $a^{?\{n\}} \cdot a^{\{n\}}$ into a NFA, the size can blow up and then |
817 relative fast (on small examples) breadth-first search can be |
821 even the relative fast (on small examples) breadth-first search can be |
818 slow. The graph on the right shows that with `evil' regular |
822 slow\ldots the red line maxes out at about 15 $n$s. |
819 expressions the depth-first search can be abysmally slow. Even if the |
823 |
820 graphs not completely overlap with the curves of Python, Ruby and |
824 |
821 Java, they are similar enough. OK\ldots now you know why regular |
825 The graph on the right shows that with `evil' regular expressions also |
822 expression matchers in those languages are so slow. |
826 the depth-first search can be abysmally slow. Even if the graphs not |
|
827 completely overlap with the curves of Python, Ruby and Java, they are |
|
828 similar enough. |
823 |
829 |
824 |
830 |
825 \begin{center} |
831 \begin{center} |
826 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}} |
832 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}} |
827 \begin{tikzpicture} |
833 \begin{tikzpicture} |
880 ytick={0,5,...,30}, |
886 ytick={0,5,...,30}, |
881 scaled ticks=false, |
887 scaled ticks=false, |
882 axis lines=left, |
888 axis lines=left, |
883 width=5.5cm, |
889 width=5.5cm, |
884 height=4cm, |
890 height=4cm, |
885 legend entries={Python, Java, depth-first NFA}, |
891 legend entries={Python, Java 8, depth-first NFA}, |
886 legend style={at={(0.5,-0.25)},anchor=north,font=\small}, |
892 legend style={at={(0.5,-0.25)},anchor=north,font=\small}, |
887 legend cell align=left] |
893 legend cell align=left] |
888 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
894 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
889 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
895 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
890 % depth-first search in NFAs |
896 % depth-first search in NFAs |
929 a single state. So with DFAs we do not have to search at all. What is |
937 a single state. So with DFAs we do not have to search at all. What is |
930 perhaps interesting is the fact that for every NFA we can find a DFA |
938 perhaps interesting is the fact that for every NFA we can find a DFA |
931 that also recognises the same language. This might sound a bit |
939 that also recognises the same language. This might sound a bit |
932 paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA |
940 paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA |
933 $\rightarrow$ decision easy. But this \emph{is} true\ldots but of course |
941 $\rightarrow$ decision easy. But this \emph{is} true\ldots but of course |
934 there is always a caveat---nothing ever is for free in life. |
942 there is always a caveat---nothing ever is for free in life. Let's see what this |
|
943 actually means. |
935 |
944 |
936 There are actually a number of methods for transforming a NFA into |
945 There are actually a number of methods for transforming a NFA into |
937 an equivalent DFA, but the most famous one is the \emph{subset |
946 an equivalent DFA, but the most famous one is the \emph{subset |
938 construction}. Consider the following NFA where the states are |
947 construction}. Consider the following NFA where the states are |
939 labelled with $0$, $1$ and $2$. |
948 labelled with $0$, $1$ and $2$. |
1194 accepting ($Q_4$) and a state that is non-accepting (the other |
1203 accepting ($Q_4$) and a state that is non-accepting (the other |
1195 states). |
1204 states). |
1196 |
1205 |
1197 In Step 3 we need to fill in more stars according whether |
1206 In Step 3 we need to fill in more stars according whether |
1198 one of the next-state pairs are marked. We have to do this |
1207 one of the next-state pairs are marked. We have to do this |
1199 for every unmarked field until there is no change anymore. |
1208 for every unmarked field until there is no change any more. |
1200 This gives the triangle |
1209 This gives the triangle |
1201 |
1210 |
1202 \begin{center} |
1211 \begin{center} |
1203 \begin{tikzpicture}[scale=0.6,line width=0.8mm] |
1212 \begin{tikzpicture}[scale=0.6,line width=0.8mm] |
1204 \draw (0,0) -- (4,0); |
1213 \draw (0,0) -- (4,0); |
1253 \path[->] (Q_4) edge [loop above] node {$a, b$} (); |
1262 \path[->] (Q_4) edge [loop above] node {$a, b$} (); |
1254 \end{tikzpicture} |
1263 \end{tikzpicture} |
1255 \end{center} |
1264 \end{center} |
1256 |
1265 |
1257 |
1266 |
|
1267 \noindent This minimised DFA is certainly fast when it comes deciding |
|
1268 whether a string is accepted or not. But this is not universally the |
|
1269 case. Suppose you count the nodes in a regular expression (when |
|
1270 represented as tree). If you look carefully at the Thompson |
|
1271 Construction you can see that the constructed NFA has states that grow |
|
1272 linearly in terms of the size of the regular expression. This is good, |
|
1273 but as we have seen earlier deciding whether a string is matched by an |
|
1274 NFA is hard. Translating an NFA into a DFA means deciding whether a |
|
1275 string is matched by a DFA is easy, but the number of states can grow |
|
1276 exponentially, even after minimisation. Say a NFA has $n$ states, then |
|
1277 in the worst case the corresponding minimal DFA that can match the |
|
1278 same language as the NFA might contain $2^n$ of states. Unfortunately |
|
1279 in many interesting cases this worst case bound is the dominant |
|
1280 factor. |
|
1281 |
|
1282 |
1258 By the way, we are not bothering with implementing the above |
1283 By the way, we are not bothering with implementing the above |
1259 minimisation algorithm: while up to now all the transformations used |
1284 minimisation algorithm: while up to now all the transformations used |
1260 some clever composition of functions, the minimisation algorithm |
1285 some clever composition of functions, the minimisation algorithm |
1261 cannot be implemented by just composing some functions. For this we |
1286 cannot be implemented by just composing some functions. For this we |
1262 would require a more concrete representation of the transition |
1287 would require a more concrete representation of the transition |
1263 function (like maps). If we did this, however, then many advantages of |
1288 function (like maps). If we did this, however, then many advantages of |
1264 the functions would be thrown away. So the compromise is to not being |
1289 the functions would be thrown away. So the compromise is to not being |
1265 able to minimise (easily) our DFAs. |
1290 able to minimise (easily) our DFAs. We want to use regular expressions |
|
1291 directly anyway. |
1266 |
1292 |
1267 \subsection*{Brzozowski's Method} |
1293 \subsection*{Brzozowski's Method} |
1268 |
1294 |
1269 I know this handout is already a long, long rant: but after all it is |
1295 I know this handout is already a long, long rant: but after all it is |
1270 a topic that has been researched for more than 60 years. If you |
1296 a topic that has been researched for more than 60 years. If you |