handouts/ho03.tex
changeset 753 d94fdbef1a4f
parent 698 7c7854feccb5
child 764 6718ef6143b8
equal deleted inserted replaced
752:c0bdd4ad69ca 753:d94fdbef1a4f
     4 \usepackage{../langs}
     4 \usepackage{../langs}
     5 \usepackage{../graphics}
     5 \usepackage{../graphics}
     6 
     6 
     7 
     7 
     8 \begin{document}
     8 \begin{document}
     9 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
     9 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2020}
    10 
    10 
    11 \section*{Handout 3 (Finite Automata)}
    11 \section*{Handout 3 (Finite Automata)}
    12 
    12 
    13 
    13 
    14 Every formal language and compiler course I know of bombards you first
    14 Every formal language and compiler course I know of bombards you first
   129 \noindent I let you think about a definition that describes the set of
   129 \noindent I let you think about a definition that describes the set of
   130 all strings accepted by a deterministic finite automaton.
   130 all strings accepted by a deterministic finite automaton.
   131 
   131 
   132 \begin{figure}[p]
   132 \begin{figure}[p]
   133 \small
   133 \small
   134 \lstinputlisting[numbers=left]{../progs/display/dfa.scala}
   134 \lstinputlisting[numbers=left,lastline=43]{../progs/automata/dfa.sc}
   135 \caption{An implementation of DFAs in Scala using partial functions.
   135 \caption{An implementation of DFAs in Scala using partial functions.
   136   Note some subtleties: \texttt{deltas} implements the delta-hat
   136   Note some subtleties: \texttt{deltas} implements the delta-hat
   137   construction by lifting the (partial) transition  function to lists
   137   construction by lifting the (partial) transition  function to lists
   138   of characters. Since \texttt{delta} is given as a partial function,
   138   of characters. Since \texttt{delta} is given as a partial function,
   139   it can obviously go ``wrong'' in which case the \texttt{Try} in
   139   it can obviously go ``wrong'' in which case the \texttt{Try} in
   140   \texttt{accepts} catches the error and returns \texttt{false}---that
   140   \texttt{accepts} catches the error and returns \texttt{false}---that
   141   means the string is not accepted.  The example \texttt{delta} in
   141   means the string is not accepted.  The example \texttt{delta} in
   142   Line 28--38 implements the DFA example shown earlier in the
   142   Line 22--43 implements the DFA example shown earlier in the
   143   handout.\label{dfa}}
   143   handout.\label{dfa}}
   144 \end{figure}
   144 \end{figure}
   145 
   145 
   146 My take on an implementation of DFAs in Scala is given in
   146 My take on an implementation of DFAs in Scala is given in
   147 Figure~\ref{dfa}. As you can see, there are many features of the
   147 Figure~\ref{dfa}. As you can see, there are many features of the
   159 The DFA-class has also an argument for specifying final states. In the
   159 The DFA-class has also an argument for specifying final states. In the
   160 implementation it is not a set of states, as in the mathematical
   160 implementation it is not a set of states, as in the mathematical
   161 definition, but a function from states to booleans (this function is
   161 definition, but a function from states to booleans (this function is
   162 supposed to return true whenever a state is final; false
   162 supposed to return true whenever a state is final; false
   163 otherwise). While this boolean function is different from the sets of
   163 otherwise). While this boolean function is different from the sets of
   164 states, Scala allows us to use sets for such functions (see Line 40 where
   164 states, Scala allows us to use sets for such functions (see Line 41 where
   165 the DFA is initialised). Again it will become clear later on why I use
   165 the DFA is initialised). Again it will become clear later on why I use
   166 functions for final states, rather than sets.
   166 functions for final states, rather than sets.
   167 
   167 
   168 The most important point in the implementation is that I use Scala's
   168 The most important point in the implementation is that I use Scala's
   169 partial functions for representing the transitions; alternatives would
   169 partial functions for representing the transitions; alternatives would
   170 have been \texttt{Maps} or even \texttt{Lists}. One of the main
   170 have been \texttt{Maps} or even \texttt{Lists}. One of the main
   171 advantages of using partial functions is that transitions can be quite
   171 advantages of using partial functions is that transitions can be quite
   172 nicely defined by a series of \texttt{case} statements (see Lines 28
   172 nicely defined by a series of \texttt{case} statements (see Lines 29
   173 -- 38 for an example). If you need to represent an automaton with a
   173 -- 39 for an example). If you need to represent an automaton with a
   174 sink state (catch-all-state), you can use Scala's pattern matching and
   174 sink state (catch-all-state), you can use Scala's pattern matching and
   175 write something like
   175 write something like
   176 
   176 
   177 {\small\begin{lstlisting}[language=Scala]
   177 {\small\begin{lstlisting}[language=Scala]
   178 abstract class State
   178 abstract class State
   283 NFAs a set of states; \texttt{fins} is again a function from states to
   283 NFAs a set of states; \texttt{fins} is again a function from states to
   284 booleans. The \texttt{next} function calculates the set of next states
   284 booleans. The \texttt{next} function calculates the set of next states
   285 reachable from a single state \texttt{q} by a character~\texttt{c}. In
   285 reachable from a single state \texttt{q} by a character~\texttt{c}. In
   286 case there is no such state---the partial transition function is
   286 case there is no such state---the partial transition function is
   287 undefined---the empty set is returned (see function
   287 undefined---the empty set is returned (see function
   288 \texttt{applyOrElse} in Lines 9 and 10). The function \texttt{nexts}
   288 \texttt{applyOrElse} in Lines 11 and 12). The function \texttt{nexts}
   289 just lifts this function to sets of states.
   289 just lifts this function to sets of states.
   290  
   290  
   291 \begin{figure}[p]
   291 \begin{figure}[p]
   292 \small
   292 \small
   293 \lstinputlisting[numbers=left]{../progs/display/nfa.scala}
   293 \lstinputlisting[numbers=left,lastline=43]{../progs/automata/nfa.sc}
   294 \caption{A Scala implementation of NFAs using partial functions.
   294 \caption{A Scala implementation of NFAs using partial functions.
   295   Notice that the function \texttt{accepts} implements the
   295   Notice that the function \texttt{accepts} implements the
   296   acceptance of a string in a breadth-first search fashion. This can be a costly
   296   acceptance of a string in a breadth-first search fashion. This can be a costly
   297   way of deciding whether a string is accepted or not in applications that need to handle
   297   way of deciding whether a string is accepted or not in applications that need to handle
   298   large NFAs and large inputs.\label{nfa}}
   298   large NFAs and large inputs.\label{nfa}}
   299 \end{figure}
   299 \end{figure}
   300 
   300 
   301 Look very careful at the \texttt{accepts} and \texttt{deltas}
   301 Look very careful at the \texttt{accepts} and \texttt{deltas}
   302 functions in NFAs and remember that when accepting a string by a NFA
   302 functions in NFAs and remember that when accepting a string by a NFA
   303 we might have to explore all possible transitions (recall which state
   303 we might have to explore all possible transitions (recall which state
   304 to go to is not unique anymore with NFAs\ldots{}we need to explore
   304 to go to is not unique any more with NFAs\ldots{}we need to explore
   305 potentially all next states). The implementation achieves this
   305 potentially all next states). The implementation achieves this
   306 exploration through a \emph{breadth-first search}. This is fine for
   306 exploration through a \emph{breadth-first search}. This is fine for
   307 small NFAs, but can lead to real memory problems when the NFAs are
   307 small NFAs, but can lead to real memory problems when the NFAs are
   308 bigger and larger strings need to be processed. As result, some
   308 bigger and larger strings need to be processed. As result, some
   309 regular expression matching engines resort to a \emph{depth-first
   309 regular expression matching engines resort to a \emph{depth-first
   334 
   334 
   335 \subsection*{Epsilon NFAs}
   335 \subsection*{Epsilon NFAs}
   336 
   336 
   337 In order to get an idea what calculations are performed by Java \&
   337 In order to get an idea what calculations are performed by Java \&
   338 friends, we need a method for transforming a regular expression into
   338 friends, we need a method for transforming a regular expression into
   339 an automaton. This automaton should accept exactly those strings that
   339 a corresponding automaton. This automaton should accept exactly those strings that
   340 are accepted by the regular expression.  The simplest and most
   340 are accepted by the regular expression.  The simplest and most
   341 well-known method for this is called \emph{Thompson Construction},
   341 well-known method for this is called the \emph{Thompson Construction},
   342 after the Turing Award winner Ken Thompson. This method is by
   342 after the Turing Award winner Ken Thompson. This method is by
   343 recursion over regular expressions and depends on the non-determinism
   343 recursion over regular expressions and depends on the non-determinism
   344 in NFAs described in the previous section. You will see shortly why
   344 in NFAs described in the previous section. You will see shortly why
   345 this construction works well with NFAs, but is not so straightforward
   345 this construction works well with NFAs, but is not so straightforward
   346 with DFAs.
   346 with DFAs.
   407 unfortunately.  If we were to follow the many textbooks on the
   407 unfortunately.  If we were to follow the many textbooks on the
   408 subject, we would now start with defining what $\epsilon$NFAs
   408 subject, we would now start with defining what $\epsilon$NFAs
   409 are---that would require extending the transition relation of
   409 are---that would require extending the transition relation of
   410 NFAs. Next, we would show that the $\epsilon$NFAs are equivalent to
   410 NFAs. Next, we would show that the $\epsilon$NFAs are equivalent to
   411 NFAs and so on. Once we have done all this on paper, we would need to
   411 NFAs and so on. Once we have done all this on paper, we would need to
   412 implement $\epsilon$NFAs. Lets try to take a shortcut instead. We are
   412 implement $\epsilon$NFAs. Let's try to take a shortcut instead. We are
   413 not really interested in $\epsilon$NFAs; they are only a convenient
   413 not really interested in $\epsilon$NFAs; they are only a convenient
   414 tool for translating regular expressions into automata. So we are not
   414 tool for translating regular expressions into automata. So we are not
   415 going to implementing them explicitly, but translate them immediately
   415 going to implementing them explicitly, but translate them immediately
   416 into NFAs (in a sense $\epsilon$NFAs are just a convenient API for
   416 into NFAs (in a sense $\epsilon$NFAs are just a convenient API for
   417 lazy people ;o).  How does the translation work? Well we have to find
   417 lazy people ;o).  How does this translation work? Well we have to find
   418 all transitions of the form
   418 all transitions of the form
   419 
   419 
   420 \[
   420 \[
   421 q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow}
   421 q\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow}
   422 \;\stackrel{a}{\longrightarrow}\;
   422 \;\stackrel{a}{\longrightarrow}\;
   423 \stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'
   423 \stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'
   424 \]
   424 \]
   425 
   425 
   426 \noindent where somewhere in the ``middle'' is an $a$-transition. We
   426 \noindent where somewhere in the ``middle'' is an $a$-transition (for
   427 replace them with $q \stackrel{a}{\longrightarrow} q'$. Doing this to
   427 a character, say, $a$). We replace them with
   428 the $\epsilon$NFA on the right-hand side above gives the NFA
   428 $q \stackrel{a}{\longrightarrow} q'$. Doing this to the $\epsilon$NFA
       
   429 on the right-hand side above gives the NFA
   429 
   430 
   430 \begin{center}
   431 \begin{center}
   431 \begin{tikzpicture}[>=stealth',very thick,
   432 \begin{tikzpicture}[>=stealth',very thick,
   432     every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
   433     every state/.style={minimum size=0pt,draw=blue!50,very thick,fill=blue!20},]
   433 \node[state,initial]  (r_1)  {$R_1$};
   434 \node[state,initial]  (r_1)  {$R_1$};
   447 and verify that I did not forget any transition.
   448 and verify that I did not forget any transition.
   448 
   449 
   449 So in what follows, whenever we are given an $\epsilon$NFA we will
   450 So in what follows, whenever we are given an $\epsilon$NFA we will
   450 replace it by an equivalent NFA. The Scala code for this translation
   451 replace it by an equivalent NFA. The Scala code for this translation
   451 is given in Figure~\ref{enfa}. The main workhorse in this code is a
   452 is given in Figure~\ref{enfa}. The main workhorse in this code is a
   452 function that calculates a fixpoint of function (Lines 5--10). This
   453 function that calculates a fixpoint of function (Lines 6--12). This
   453 function is used for ``discovering'' which states are reachable by
   454 function is used for ``discovering'' which states are reachable by
   454 $\epsilon$-transitions. Once no new state is discovered, a fixpoint is
   455 $\epsilon$-transitions. Once no new state is discovered, a fixpoint is
   455 reached. This is used for example when calculating the starting states
   456 reached. This is used for example when calculating the starting states
   456 of an equivalent NFA (see Line 36): we start with all starting states
   457 of an equivalent NFA (see Line 28): we start with all starting states
   457 of the $\epsilon$NFA and then look for all additional states that can
   458 of the $\epsilon$NFA and then look for all additional states that can
   458 be reached by $\epsilon$-transitions. We keep on doing this until no
   459 be reached by $\epsilon$-transitions. We keep on doing this until no
   459 new state can be reached. This is what the $\epsilon$-closure, named
   460 new state can be reached. This is what the $\epsilon$-closure, named
   460 in the code \texttt{ecl}, calculates. Similarly, an accepting state of
   461 in the code \texttt{ecl}, calculates. Similarly, an accepting state of
   461 the NFA is when we can reach an accepting state of the $\epsilon$NFA
   462 the NFA is when we can reach an accepting state of the $\epsilon$NFA
   462 by $\epsilon$-transitions.
   463 by $\epsilon$-transitions.
   463 
   464 
   464 
   465 
   465 \begin{figure}[p]
   466 \begin{figure}[p]
   466 \small
   467 \small
   467 \lstinputlisting[numbers=left]{../progs/display/enfa.scala}
   468 \lstinputlisting[numbers=left,lastline=43]{../progs/automata/enfa.sc}
   468 
   469 
   469 \caption{A Scala function that translates $\epsilon$NFA into NFAs. The
   470 \caption{A Scala function that translates $\epsilon$NFA into NFAs. The
   470   transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
   471   transition function of $\epsilon$NFA takes as input an \texttt{Option[C]}.
   471   \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
   472   \texttt{None} stands for an $\epsilon$-transition; \texttt{Some(c)}
   472   for a ``proper'' transition consuming a character. The functions in
   473   for a ``proper'' transition consuming a character. The functions in
   473   Lines 18--26 calculate
   474   Lines 19--24 calculate
   474   all states reachable by one or more $\epsilon$-transition for a given
   475   all states reachable by one or more $\epsilon$-transition for a given
   475   set of states. The NFA is constructed in Lines 36--38.
   476   set of states. The NFA is constructed in Lines 30--34.
   476   Note the interesting commands in Lines 5 and 6: their purpose is
   477   Note the interesting commands in Lines 7 and 8: their purpose is
   477   to ensure that \texttt{fixpT} is the tail-recursive version of
   478   to ensure that \texttt{fixpT} is the tail-recursive version of
   478   the fixpoint construction; otherwise we would quickly get a
   479   the fixpoint construction; otherwise we would quickly get a
   479   stack-overflow exception, even on small examples, due to limitations
   480   stack-overflow exception, even on small examples, due to limitations
   480   of the JVM.
   481   of the JVM.
   481   \label{enfa}}
   482   \label{enfa}}
   544 increases this counter whenever a new state is created.\footnote{You might
   545 increases this counter whenever a new state is created.\footnote{You might
   545   have to read up what \emph{companion objects} do in Scala.}
   546   have to read up what \emph{companion objects} do in Scala.}
   546 
   547 
   547 \begin{figure}[p]
   548 \begin{figure}[p]
   548 \small
   549 \small
   549 \lstinputlisting[numbers=left]{../progs/display/thompson1.scala}
   550 \lstinputlisting[numbers=left,linerange={1-20}]{../progs/automata/thompson.sc}
   550 \caption{The first part of the Thompson Construction. Lines 7--16
   551 \hspace{5mm}\texttt{\dots}
       
   552 \lstinputlisting[numbers=left,linerange={28-45},firstnumber=28]{../progs/automata/thompson.sc}
       
   553 \caption{The first part of the Thompson Construction. Lines 10--19
   551   implement a way of how to create new states that are all
   554   implement a way of how to create new states that are all
   552   distinct by virtue of a counter. This counter is
   555   distinct by virtue of a counter. This counter is
   553   increased in the companion object of \texttt{TState}
   556   increased in the companion object of \texttt{TState}
   554   whenever a new state is created. The code in Lines 24--40
   557   whenever a new state is created. The code in Lines 38--45
   555   constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$.
   558   constructs NFAs for the simple regular expressions $\ZERO$, $\ONE$ and $c$.
   556   Compare this code with the pictures given in \eqref{simplecases} on
   559   Compare this code with the pictures given in \eqref{simplecases} on
   557   Page~\pageref{simplecases}.
   560   Page~\pageref{simplecases}.
   558   \label{thompson1}}
   561   \label{thompson1}}
   559 \end{figure}
   562 \end{figure}
   560 
   563 
   561 \begin{figure}[p]
   564 \begin{figure}[p]
   562 \small
   565 \small
   563 \lstinputlisting[numbers=left]{../progs/display/thompson2.scala}
   566 \lstinputlisting[numbers=left,firstline=48,firstnumber=48,lastline=85]{../progs/automata/thompson.sc}
   564 \caption{The second part of the Thompson Construction implementing
   567 \caption{The second part of the Thompson Construction implementing
   565   the composition of NFAs according to $\cdot$, $+$ and ${}^*$.
   568   the composition of NFAs according to $\cdot$, $+$ and ${}^*$.
   566   The implicit class about rich partial functions
   569   The implicit class (Lines 48--54) about rich partial functions
   567   implements the infix operation \texttt{+++} which
   570   implements the infix operation \texttt{+++} which
   568   combines an $\epsilon$NFA transition with a NFA transition
   571   combines an $\epsilon$NFA transition with an NFA transition
   569   (both are given as partial functions---but with different type!).\label{thompson2}}
   572   (both are given as partial functions---but with different type!).\label{thompson2}}
   570 \end{figure}
   573 \end{figure}
   571 
   574 
   572 The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more
   575 The case for the sequence regular expression $r_1 \cdot r_2$ is a bit more
   573 complicated: Say, we are given by recursion two NFAs representing the regular
   576 complicated: Say, we are given by recursion two NFAs representing the regular
   649 \noindent The idea behind this construction is that the start of any
   652 \noindent The idea behind this construction is that the start of any
   650 string is first recognised by the first NFA, then we silently change
   653 string is first recognised by the first NFA, then we silently change
   651 to the second NFA; the ending of the string is recognised by the
   654 to the second NFA; the ending of the string is recognised by the
   652 second NFA...just like matching of a string by the regular expression
   655 second NFA...just like matching of a string by the regular expression
   653 $r_1\cdot r_2$. The Scala code for this construction is given in
   656 $r_1\cdot r_2$. The Scala code for this construction is given in
   654 Figure~\ref{thompson2} in Lines 16--23. The starting states of the
   657 Figure~\ref{thompson2} in Lines 57--65. The starting states of the
   655 $\epsilon$NFA are the starting states of the first NFA (corresponding
   658 $\epsilon$NFA are the starting states of the first NFA (corresponding
   656 to $r_1$); the accepting function is the accepting function of the
   659 to $r_1$); the accepting function is the accepting function of the
   657 second NFA (corresponding to $r_2$). The new transition function is
   660 second NFA (corresponding to $r_2$). The new transition function is
   658 all the ``old'' transitions plus the $\epsilon$-transitions connecting
   661 all the ``old'' transitions plus the $\epsilon$-transitions connecting
   659 the accepting states of the first NFA to the starting states of the
   662 the accepting states of the first NFA to the starting states of the
   660 first NFA (Lines 18 and 19). The $\epsilon$NFA is then immediately
   663 second NFA (Lines 59 and 60). The $\epsilon$NFA is then immediately
   661 translated in a NFA.
   664 translated in a NFA.
   662 
   665 
   663 
   666 
   664 The case for the alternative regular expression $r_1 + r_2$ is
   667 The case for the alternative regular expression $r_1 + r_2$ is
   665 slightly different: We are given by recursion two NFAs representing
   668 slightly different: We are given by recursion two NFAs representing
   666 $r_1$ and $r_2$ respectively. Each NFA has some starting states and
   669 $r_1$ and $r_2$ respectively. Each NFA has some starting states and
   667 some accepting states. We obtain a NFA for the regular expression $r_1
   670 some accepting states. We obtain a NFA for the regular expression
   668 + r_2$ by composing the transition functions (this crucially depends
   671 $r_1 + r_2$ by composing the transition functions (this crucially
   669 on knowing that the states of each component NFA are distinct---recall
   672 depends on knowing that the states of each component NFA are
   670 we implemented for this to hold some bespoke code for states). We also
   673 distinct---recall we implemented for this to hold by some bespoke code
   671 need to combine the starting states and accepting functions
   674 for \texttt{TState}s). We also need to combine the starting states and
   672 appropriately.
   675 accepting functions appropriately.
   673 
   676 
   674 \begin{center}
   677 \begin{center}
   675 \begin{tabular}[t]{ccc}
   678 \begin{tabular}[t]{ccc}
   676 \begin{tikzpicture}[node distance=3mm,
   679 \begin{tikzpicture}[node distance=3mm,
   677     >=stealth',very thick,
   680     >=stealth',very thick,
   732 \end{tikzpicture}
   735 \end{tikzpicture}
   733 \end{tabular}
   736 \end{tabular}
   734 \end{center}
   737 \end{center}
   735 
   738 
   736 \noindent The code for this construction is in Figure~\ref{thompson2}
   739 \noindent The code for this construction is in Figure~\ref{thompson2}
   737 in Lines 25--33.
   740 in Lines 67--75.
   738 
   741 
   739 Finally for the $*$-case we have a NFA for $r$ and connect its
   742 Finally for the $*$-case we have a NFA for $r$ and connect its
   740 accepting states to a new starting state via
   743 accepting states to a new starting state via
   741 $\epsilon$-transitions. This new starting state is also an accepting
   744 $\epsilon$-transitions. This new starting state is also an accepting
   742 state, because $r^*$ can recognise the empty string.
   745 state, because $r^*$ can recognise the empty string.
   786 \end{tikzpicture}    
   789 \end{tikzpicture}    
   787 \end{tabular}
   790 \end{tabular}
   788 \end{center}
   791 \end{center}
   789 
   792 
   790 \noindent 
   793 \noindent 
   791 The corresponding code is in Figure~\ref{thompson2} in Lines 35--43)
   794 The corresponding code is in Figure~\ref{thompson2} in Lines 77--85)
   792 
   795 
   793 To sum up, you can see in the sequence and star cases the need of
   796 To sum up, you can see in the sequence and star cases the need for
   794 having silent $\epsilon$-transitions. Similarly the alternative case
   797 silent $\epsilon$-transitions. Otherwise this construction just
   795 shows the need of the NFA-nondeterminism. It seems awkward to form the
   798 becomes awkward. Similarly the alternative case shows the need of the
   796 `alternative' composition of two DFAs, because DFA do not allow
   799 NFA-nondeterminism. It looks non-obvious to form the `alternative'
   797 several starting and successor states. All these constructions can now
   800 composition of two DFAs, because DFA do not allow several starting and
   798 be put together in the following recursive function:
   801 successor states. All these constructions can now be put together in
       
   802 the following recursive function:
   799 
   803 
   800 
   804 
   801 {\small\begin{lstlisting}[language=Scala]
   805 {\small\begin{lstlisting}[language=Scala]
   802 def thompson(r: Rexp) : NFAt = r match {
   806 def thompson(r: Rexp) : NFAt = r match {
   803   case ZERO => NFA_ZERO()
   807   case ZERO => NFA_ZERO()
   811 
   815 
   812 \noindent
   816 \noindent
   813 It calculates a NFA from a regular expressions. At last we can run
   817 It calculates a NFA from a regular expressions. At last we can run
   814 NFAs for the our evil regular expression examples.  The graph on the
   818 NFAs for the our evil regular expression examples.  The graph on the
   815 left shows that when translating a regular expression such as
   819 left shows that when translating a regular expression such as
   816 $a^{\{n\}}$ into a NFA, the size can blow up and then even the
   820 $a^{?\{n\}} \cdot a^{\{n\}}$ into a NFA, the size can blow up and then
   817 relative fast (on small examples) breadth-first search can be
   821 even the relative fast (on small examples) breadth-first search can be
   818 slow. The graph on the right shows that with `evil' regular
   822 slow\ldots the red line maxes out at about 15 $n$s.
   819 expressions the depth-first search can be abysmally slow. Even if the
   823 
   820 graphs not completely overlap with the curves of Python, Ruby and
   824 
   821 Java, they are similar enough. OK\ldots now you know why regular
   825 The graph on the right shows that with `evil' regular expressions also
   822 expression matchers in those languages are so slow. 
   826 the depth-first search can be abysmally slow. Even if the graphs not
       
   827 completely overlap with the curves of Python, Ruby and Java, they are
       
   828 similar enough. 
   823 
   829 
   824 
   830 
   825 \begin{center}
   831 \begin{center}
   826 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}  
   832 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}  
   827 \begin{tikzpicture}
   833 \begin{tikzpicture}
   880     ytick={0,5,...,30},
   886     ytick={0,5,...,30},
   881     scaled ticks=false,
   887     scaled ticks=false,
   882     axis lines=left,
   888     axis lines=left,
   883     width=5.5cm,
   889     width=5.5cm,
   884     height=4cm, 
   890     height=4cm, 
   885     legend entries={Python, Java, depth-first NFA},
   891     legend entries={Python, Java 8, depth-first NFA},
   886     legend style={at={(0.5,-0.25)},anchor=north,font=\small},
   892     legend style={at={(0.5,-0.25)},anchor=north,font=\small},
   887     legend cell align=left]
   893     legend cell align=left]
   888   \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
   894   \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
   889   \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};  
   895   \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};  
   890   % depth-first search in NFAs
   896   % depth-first search in NFAs
   900 \end{axis}
   906 \end{axis}
   901 \end{tikzpicture}
   907 \end{tikzpicture}
   902 \end{tabular}
   908 \end{tabular}
   903 \end{center}
   909 \end{center}
   904 
   910 
   905 
   911 \noindent
       
   912 OK\ldots now you know why regular expression matchers in those
       
   913 languages are sometimes so slow. A bit surprising, don't you agree?
   906 
   914 
   907 \subsection*{Subset Construction}
   915 \subsection*{Subset Construction}
   908 
   916 
   909 Of course, some developers of regular expression matchers are aware of
   917 Of course, some developers of regular expression matchers are aware of
   910 these problems with sluggish NFAs and try to address them. One common
   918 these problems with sluggish NFAs and try to address them. One common
   929 a single state. So with DFAs we do not have to search at all.  What is
   937 a single state. So with DFAs we do not have to search at all.  What is
   930 perhaps interesting is the fact that for every NFA we can find a DFA
   938 perhaps interesting is the fact that for every NFA we can find a DFA
   931 that also recognises the same language. This might sound a bit
   939 that also recognises the same language. This might sound a bit
   932 paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA
   940 paradoxical: NFA $\rightarrow$ decision of acceptance hard; DFA
   933 $\rightarrow$ decision easy. But this \emph{is} true\ldots but of course
   941 $\rightarrow$ decision easy. But this \emph{is} true\ldots but of course
   934 there is always a caveat---nothing ever is for free in life.
   942 there is always a caveat---nothing ever is for free in life. Let's see what this
       
   943 actually means.
   935 
   944 
   936 There are actually a number of methods for transforming a NFA into
   945 There are actually a number of methods for transforming a NFA into
   937 an equivalent DFA, but the most famous one is the \emph{subset
   946 an equivalent DFA, but the most famous one is the \emph{subset
   938   construction}. Consider the following NFA where the states are
   947   construction}. Consider the following NFA where the states are
   939 labelled with $0$, $1$ and $2$.
   948 labelled with $0$, $1$ and $2$.
  1194 accepting ($Q_4$) and a state that is non-accepting (the other
  1203 accepting ($Q_4$) and a state that is non-accepting (the other
  1195 states).
  1204 states).
  1196 
  1205 
  1197 In Step 3 we need to fill in more stars according whether 
  1206 In Step 3 we need to fill in more stars according whether 
  1198 one of the next-state pairs are marked. We have to do this 
  1207 one of the next-state pairs are marked. We have to do this 
  1199 for every unmarked field until there is no change anymore.
  1208 for every unmarked field until there is no change any more.
  1200 This gives the triangle
  1209 This gives the triangle
  1201 
  1210 
  1202 \begin{center}
  1211 \begin{center}
  1203 \begin{tikzpicture}[scale=0.6,line width=0.8mm]
  1212 \begin{tikzpicture}[scale=0.6,line width=0.8mm]
  1204 \draw (0,0) -- (4,0);
  1213 \draw (0,0) -- (4,0);
  1253 \path[->] (Q_4) edge [loop above] node  {$a, b$} ();
  1262 \path[->] (Q_4) edge [loop above] node  {$a, b$} ();
  1254 \end{tikzpicture}
  1263 \end{tikzpicture}
  1255 \end{center}
  1264 \end{center}
  1256 
  1265 
  1257 
  1266 
       
  1267 \noindent This minimised DFA is certainly fast when it comes deciding
       
  1268 whether a string is accepted or not. But this is not universally the
       
  1269 case.  Suppose you count the nodes in a regular expression (when
       
  1270 represented as tree).  If you look carefully at the Thompson
       
  1271 Construction you can see that the constructed NFA has states that grow
       
  1272 linearly in terms of the size of the regular expression. This is good,
       
  1273 but as we have seen earlier deciding whether a string is matched by an
       
  1274 NFA is hard. Translating an NFA into a DFA means deciding whether a
       
  1275 string is matched by a DFA is easy, but the number of states can grow
       
  1276 exponentially, even after minimisation. Say a NFA has $n$ states, then
       
  1277 in the worst case the corresponding minimal DFA that can match the
       
  1278 same language as the NFA might contain $2^n$ of states. Unfortunately
       
  1279 in many interesting cases this worst case bound is the dominant
       
  1280 factor.
       
  1281 
       
  1282 
  1258 By the way, we are not bothering with implementing the above
  1283 By the way, we are not bothering with implementing the above
  1259 minimisation algorithm: while up to now all the transformations used
  1284 minimisation algorithm: while up to now all the transformations used
  1260 some clever composition of functions, the minimisation algorithm
  1285 some clever composition of functions, the minimisation algorithm
  1261 cannot be implemented by just composing some functions. For this we
  1286 cannot be implemented by just composing some functions. For this we
  1262 would require a more concrete representation of the transition
  1287 would require a more concrete representation of the transition
  1263 function (like maps). If we did this, however, then many advantages of
  1288 function (like maps). If we did this, however, then many advantages of
  1264 the functions would be thrown away. So the compromise is to not being
  1289 the functions would be thrown away. So the compromise is to not being
  1265 able to minimise (easily) our DFAs.
  1290 able to minimise (easily) our DFAs. We want to use regular expressions
       
  1291 directly anyway.
  1266 
  1292 
  1267 \subsection*{Brzozowski's Method}
  1293 \subsection*{Brzozowski's Method}
  1268 
  1294 
  1269 I know this handout is already a long, long rant: but after all it is
  1295 I know this handout is already a long, long rant: but after all it is
  1270 a topic that has been researched for more than 60 years. If you
  1296 a topic that has been researched for more than 60 years. If you